HeadsUpAI

NVIDIA Hardens Dynamo to Match Frontier Agent Performance on Custom Stacks

· Updated

NVIDIA updated Dynamo, its open-source distributed inference framework, to support the high-pressure requirements of multi-turn AI agents. New features like --strip-anthropic-preamble restore KV cache reuse, while streaming tool dispatch improves responsiveness. These changes follow the Dynamo agent-native inference stack launch.
TTFT reduction
5x (from 912ms to 169ms)
Test prompt size
52K tokens
Supported APIs
Anthropic Messages, OpenAI Responses
New flags
--strip-anthropic-preamble, --enable-streaming-tool-dispatch
Standalone crates
dynamo-protocols, dynamo-parsers, dynamo-tokenizers

Standard inference servers often fail in agentic loops by buffering tool calls or dropping reasoning tokens. This creates a wait-to-act bottleneck and causes agents to lose their train of thought. By preserving interleaved reasoning, Dynamo ensures custom deployments maintain the same behavioral correctness as native frontier APIs.

You can now use Dynamo to serve models like Nemotron 3 Super for Claude Code or OpenClaw. For OpenAI Codex users, model-catalog aliasing prevents harnesses from falling back to low-performance profiles. Protocol and parser layers are also available as standalone crates like dynamo-parsers.

NVIDIA AI
NVIDIA AI
@NVIDIAAI
X

Most agentic stacks run into the same problems pretty quickly: reasoning and tool parsing drift across turns, KV cache reuse falls apart, or tools fire too late. We’ve been hardening Dynamo’s harness-facing path so @Claudeai Code, @OpenClaw, and @openai Codex-style agent patterns behave reliably on custom stacks and inference endpoints: • Stable prompts for KV reuse and lower TTFT • Interleaved reasoning + tool calls preserved across turns • Streaming tool dispatch instead of end-of-turn buffering • Harness behavior aligned with real multi-turn agent runtimes If you’re building your own agent stack or serving endpoint, this blog goes through the infrastructure issues that tend to show up in practice and the patterns we’ve been using to fix them. Tech blog ➡️https://t.co/dCWgk4OmyL

23retweets147likes
View on X

Still wondering? A few quick answers below.

NVIDIA Dynamo is an open-source, distributed inference-serving framework designed to deploy AI models at data center scale. It acts as an orchestrator for GPU clusters, optimizing how models are served to handle complex workloads like autonomous agents that require high throughput, low latency, and efficient memory management across multiple computing nodes.

Dynamo uses a feature called strip-anthropic-preamble to remove session-specific billing headers from incoming requests. By stripping these varying headers before tokenization, the system ensures the stable prompt instructions start at the beginning of the sequence. This restores KV cache reuse, which reduced time to first token by approximately 5x in NVIDIA benchmarks.

Streaming tool dispatch is a feature that allows an inference engine to emit tool calls as structured events as soon as they are parsed. Instead of buffering the entire response until the turn ends, Dynamo sends a side-channel notification that tells the agent harness a tool is ready to execute, enabling parallel execution while text continues to stream.

Dynamo preserves interleaved reasoning by ensuring that thinking tokens stay attached to the specific tool calls they explain across multiple turns. It uses model-specific reasoning parsers to control whether prior thinking should be retained or truncated, preventing the model from losing its logical context during complex, multi-step agentic workflows that require reasoning replay.

NVIDIA Dynamo is designed to be compatible with frontier agent harnesses including Claude Code, OpenClaw, and OpenAI Codex. It achieves this by providing high-fidelity API endpoints that match the specific metadata, streaming event sequences, and model catalog requirements these tools expect, allowing them to run on custom infrastructure without losing their native behavioral characteristics.

Share this update