NVIDIA Hardens Dynamo to Match Frontier Agent Performance on Custom Stacks

NVIDIA

May 9, 2026 · Updated May 17, 2026

NVIDIA updated its Dynamo inference framework to support the specific multi-turn requirements of agent harnesses like Claude Code and Codex. The update eliminates infrastructure friction that causes reasoning drift and cache misses, allowing developers to run complex agents on private stacks with the same fidelity as managed frontier endpoints.

NVIDIA updated Dynamo, its open-source distributed inference framework, to support the high-pressure requirements of multi-turn AI agents. New features like --strip-anthropic-preamble restore KV cache reuse, while streaming tool dispatch improves responsiveness. These changes follow the Dynamo agent-native inference stack launch.

TTFT reduction: 5x (from 912ms to 169ms)
Test prompt size: 52K tokens
Supported APIs: Anthropic Messages, OpenAI Responses
New flags: --strip-anthropic-preamble, --enable-streaming-tool-dispatch
Standalone crates: dynamo-protocols, dynamo-parsers, dynamo-tokenizers

Standard inference servers often fail in agentic loops by buffering tool calls or dropping reasoning tokens. This creates a wait-to-act bottleneck and causes agents to lose their train of thought. By preserving interleaved reasoning, Dynamo ensures custom deployments maintain the same behavioral correctness as native frontier APIs.

You can now use Dynamo to serve models like Nemotron 3 Super for Claude Code or OpenClaw. For OpenAI Codex users, model-catalog aliasing prevents harnesses from falling back to low-performance profiles. Protocol and parser layers are also available as standalone crates like dynamo-parsers.

View the full update on developer.nvidia.com

NVIDIA AI

@NVIDIAAIMay 8

Most agentic stacks run into the same problems pretty quickly: reasoning and tool parsing drift across turns, KV cache reuse falls apart, or tools fire too late. We’ve been hardening Dynamo’s harness-facing path so @Claudeai Code, @OpenClaw, and @openai Codex-style agent patterns behave reliably on custom stacks and inference endpoints: • Stable prompts for KV reuse and lower TTFT • Interleaved reasoning + tool calls preserved across turns • Streaming tool dispatch instead of end-of-turn buffering • Harness behavior aligned with real multi-turn agent runtimes If you’re building your own agent stack or serving endpoint, this blog goes through the infrastructure issues that tend to show up in practice and the patterns we’ve been using to fix them. Tech blog ➡️https://t.co/dCWgk4OmyL

23147

View on X

Still wondering? A few quick answers below.

NVIDIA Dynamo is an open-source, distributed inference-serving framework designed to deploy AI models at data center scale. It acts as an orchestrator for GPU clusters, optimizing how models are served to handle complex workloads like autonomous agents that require high throughput, low latency, and efficient memory management across multiple computing nodes.

Dynamo uses a feature called strip-anthropic-preamble to remove session-specific billing headers from incoming requests. By stripping these varying headers before tokenization, the system ensures the stable prompt instructions start at the beginning of the sequence. This restores KV cache reuse, which reduced time to first token by approximately 5x in NVIDIA benchmarks.

Streaming tool dispatch is a feature that allows an inference engine to emit tool calls as structured events as soon as they are parsed. Instead of buffering the entire response until the turn ends, Dynamo sends a side-channel notification that tells the agent harness a tool is ready to execute, enabling parallel execution while text continues to stream.

Dynamo preserves interleaved reasoning by ensuring that thinking tokens stay attached to the specific tool calls they explain across multiple turns. It uses model-specific reasoning parsers to control whether prior thinking should be retained or truncated, preventing the model from losing its logical context during complex, multi-step agentic workflows that require reasoning replay.

NVIDIA Dynamo is designed to be compatible with frontier agent harnesses including Claude Code, OpenClaw, and OpenAI Codex. It achieves this by providing high-fidelity API endpoints that match the specific metadata, streaming event sequences, and model catalog requirements these tools expect, allowing them to run on custom infrastructure without losing their native behavioral characteristics.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from NVIDIA →

Keep reading

NVIDIA Dynamo Rebuilds Inference Stack for Agentic Coding Workloads

NVIDIA updated its Dynamo inference orchestrator with agent-native optimizations that deliver up to 7x more throughput for multi-step AI workflows. By introducing KV-aware routing and a four-tier memory hierarchy, the system eliminates redundant recomputations in long-running agent sessions.

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChainJun 7

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain announced immediate support for NVIDIA Nemotron 3 Ultra, an open frontier model designed for long-running AI agents. This integration makes the model's 5x faster inference and up to 30% lower cost for complex agentic tasks directly available to developers using the LangChain framework.

Fireworks AI Adds NVIDIA Nemotron 3 Ultra for Agentic Reasoning

Fireworks AIJun 4

Fireworks AI Adds NVIDIA Nemotron 3 Ultra for Agentic Reasoning

Fireworks AI now offers NVIDIA Nemotron 3 Ultra, an open model for advanced autonomous agents, with immediate deployment support. This provides developers with optimized infrastructure for long-running agentic tasks that require frontier reasoning and orchestration.

Arena.ai Adds Nemotron 3 Ultra to Agent Mode for Real-World Agent Evaluation

ArenaJun 5

Arena.ai Adds Nemotron 3 Ultra to Agent Mode for Real-World Agent Evaluation

Arena.ai has integrated NVIDIA's Nemotron 3 Ultra model into its Agent Mode, enabling users to run the model for complex, multi-step tasks. These sessions contribute to the new Agent Arena leaderboard, which evaluates agentic AI models on real-world performance using tools like web search and terminal. This expands the range of frontier models available for practical agentic workflows and provides new data for understanding their capabilities in autonomous tasks.

What is NVIDIA Dynamo?

How does NVIDIA Dynamo reduce time to first token for agents?

What is streaming tool dispatch in NVIDIA Dynamo?

How does NVIDIA Dynamo handle reasoning tokens in multi-turn agents?

Which agent tools are compatible with NVIDIA Dynamo?

Keep reading

NVIDIA Dynamo Rebuilds Inference Stack for Agentic Coding Workloads

NVIDIA Dynamo Rebuilds Inference Stack for Agentic Coding Workloads

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

Fireworks AI Adds NVIDIA Nemotron 3 Ultra for Agentic Reasoning

Fireworks AI Adds NVIDIA Nemotron 3 Ultra for Agentic Reasoning

Arena.ai Adds Nemotron 3 Ultra to Agent Mode for Real-World Agent Evaluation

Arena.ai Adds Nemotron 3 Ultra to Agent Mode for Real-World Agent Evaluation

Keep reading

NVIDIA Dynamo Rebuilds Inference Stack for Agentic Coding Workloads

NVIDIA Dynamo Rebuilds Inference Stack for Agentic Coding Workloads

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

Fireworks AI Adds NVIDIA Nemotron 3 Ultra for Agentic Reasoning

Fireworks AI Adds NVIDIA Nemotron 3 Ultra for Agentic Reasoning

Arena.ai Adds Nemotron 3 Ultra to Agent Mode for Real-World Agent Evaluation

Arena.ai Adds Nemotron 3 Ultra to Agent Mode for Real-World Agent Evaluation