NVIDIA Dynamo Rebuilds Inference Stack for Agentic Coding Workloads

NVIDIA

Apr 25, 2026 · Updated May 3, 2026

NVIDIA updated its Dynamo inference orchestrator with agent-native optimizations that deliver up to 7x more throughput for multi-step AI workflows. By introducing KV-aware routing and a four-tier memory hierarchy, the system eliminates redundant recomputations in long-running agent sessions.

NVIDIA updated Dynamo, its open-source inference orchestration layer, with optimizations for agentic AI (systems that plan and act autonomously). The update introduces agent_hints, an API for passing metadata like task priority to the stack, plus native support for protocols that handle interleaved thinking and tool calls.

Throughput increase: Up to 7x more
Cache hit rate: 85-97%
Memory hierarchy tiers: 4 (GPU, CPU, NVMe, Remote)
Routing performance: 170M ops/s
Read to write ratio: 11.7x
Supported protocols: v1/chat, v1/responses, v1/messages

Traditional inference was built for chat, but agents make hundreds of API calls with 97% context overlap. This "write-once-read-many" pattern overwhelms standard caches. These optimizations follow NVIDIA Dynamo 1.0 to give self-hosted models the same cache-reuse efficiency found in frontier APIs.

You can now use the Dynamo router for KV-aware placement (routing based on stored model memory), sending requests to the GPU worker holding the relevant context. The system uses a four-tier memory hierarchy to pin system prompts while evicting ephemeral reasoning tokens. The agent_hints API is available on GitHub.

View the full update on developer.nvidia.com

NVIDIA AI

@NVIDIAAIApr 25

Traditional inference wasn’t built for agentic coding. Agentic tools make hundreds of API calls per coding session, often with recomputed context, creating bottlenecks that drive up cost per token. NVIDIA Dynamo rebuilds the stack for agents with: → KV-aware routing → Agent-aware scheduling → Multi-tier caching → Unified orchestration The result: higher cache hit rates, lower latency, and up to 7× more throughput: https://t.co/E9tRgiLmar

49478

View on X

Still wondering? A few quick answers below.

NVIDIA Dynamo is an open-source inference orchestration layer that acts as a distributed operating system for AI clusters. It coordinates between agent frameworks and inference engines like vLLM or TensorRT-LLM. It is designed to manage complex workloads by handling request routing, scheduling, and memory management across multiple GPU nodes to improve performance.

AI agents often reuse the same conversation history across hundreds of API calls. Dynamo optimizes this by using KV-aware routing, which sends requests to the specific worker that already has the relevant context in its memory. This reduces redundant recomputations, leading to lower latency and up to seven times higher throughput for agentic sessions.

Agent hints are a new API extension that allows agent frameworks to pass metadata to the inference stack. These hints include task priority, estimated output length, and speculative prefill signals. By sharing this context, the orchestrator can make better decisions about which requests to prioritize and when to warm up the cache before a tool call returns.

Dynamo manages the Key-Value cache across four storage layers: GPU high-bandwidth memory, CPU pinned DRAM, local NVMe storage, and remote cluster-wide storage. This hierarchy allows the system to offload and share cached context between different workers. It ensures that high-value data, like system prompts, is retained while ephemeral data, such as reasoning tokens, is evicted.

Yes, NVIDIA Dynamo is an open-source project. Developers can access the source code, documentation, and the new agent hints API through the official GitHub repository. It is designed to work with various open-source models and inference engines, providing a standardized infrastructure layer for teams running their own GPU-backed agentic workflows.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from NVIDIA →

Keep reading

NVIDIA Hardens Dynamo to Match Frontier Agent Performance on Custom Stacks

NVIDIA updated its Dynamo inference framework to support the specific multi-turn requirements of agent harnesses like Claude Code and Codex. The update eliminates infrastructure friction that causes reasoning drift and cache misses, allowing developers to run complex agents on private stacks with the same fidelity as managed frontier endpoints.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChainJun 7

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain announced immediate support for NVIDIA Nemotron 3 Ultra, an open frontier model designed for long-running AI agents. This integration makes the model's 5x faster inference and up to 30% lower cost for complex agentic tasks directly available to developers using the LangChain framework.

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

CohereApr 24

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

Cohere released production-ready W4A8 quantization kernels for dense and Mixture of Experts models, now integrated into the vLLM inference framework. By combining 4-bit weights with 8-bit activations, the update achieves up to 58 percent faster prefill and 45 percent faster decoding on NVIDIA Hopper GPUs.

What is NVIDIA Dynamo?

How does NVIDIA Dynamo improve performance for AI agents?

What are NVIDIA Dynamo agent hints?

How does the NVIDIA Dynamo 4-tier memory hierarchy work?

Is NVIDIA Dynamo open source and how can I access it?

Keep reading

NVIDIA Hardens Dynamo to Match Frontier Agent Performance on Custom Stacks

NVIDIA Hardens Dynamo to Match Frontier Agent Performance on Custom Stacks

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

Keep reading

NVIDIA Hardens Dynamo to Match Frontier Agent Performance on Custom Stacks

NVIDIA Hardens Dynamo to Match Frontier Agent Performance on Custom Stacks

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance