Together AI DeepSeek V4 Pro Deployment Tops Industry Speed Benchmarks

Together AI

Jun 15, 2026

Together AI now ranks first on Artificial Analysis for DeepSeek V4 Pro inference, delivering 211.9 tokens per second. This performance lead across 11 providers stems from inference systems optimizations, including custom KV cache management, prefix reuse, and kernel tuning on NVIDIA HGX B200 hardware. The deployment achieves the lowest latency and highest output speed for the model.

Fastest
#1 Together.ai 211.9 t/s
#2 Makora 168.8 t/s
#3 Lightning AI 153.0 t/s
#4 Baseten 125.2 t/s
#5 Fireworks 115.2 t/s
Output speed Total 11 providers — Together AI leads industry benchmarks for output speed and lowest latency across eleven major inference providers.

View the full update on together.ai

Together AI

@togethercompute13h ago

DeepSeek V4 Pro on Together AI is now #1 on Artificial Analysis for both output speed and latency. Serving V4 well is an inference systems problem: KV cache, prefix reuse, kernels, and endpoint profiles. We break down the systems work here: https://t.co/RLHi35DFif

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

Together AI Delivers 31% Faster Coding Agent Inference on Blackwell

Together AI published coding agent benchmarks showing its inference engine achieves 31% more tokens per second than the next-fastest open-source engine on NVIDIA Blackwell hardware. These performance gains result from custom kernels targeting Blackwell Tensor Core instructions. Cursor now runs its real-time coding agents on this production stack to maintain low-latency feedback loops during development.

NVIDIA Blackwell Ultra Powers DeepSeek V4 Pro at 150 Tokens Per Second

NVIDIAApr 25

NVIDIA Blackwell Ultra Powers DeepSeek V4 Pro at 150 Tokens Per Second

NVIDIA reported that DeepSeek-V4-Pro achieves over 150 tokens per second on Blackwell Ultra hardware. This performance level makes 1.6-trillion parameter models viable for real-time autonomous agents. Future software updates like Dynamo and NVFP4 are expected to push these speeds even higher.

Artificial Analysis Launches AA-AgentPerf Benchmark for Agentic Inference Workloads

Artificial Analysis3h ago

Artificial Analysis Launches AA-AgentPerf Benchmark for Agentic Inference Workloads

Artificial Analysis launched AA-AgentPerf, the first benchmark measuring agentic inference performance using real coding trajectories. The benchmark’s lead metric, Agents per Megawatt, evaluates concurrent agent capacity at production service levels. Initial results for DeepSeek V4 Pro show NVIDIA’s rack-scale GB300 system sustains 61,354 agents per megawatt, significantly outperforming single-node Blackwell and Hopper configurations in power efficiency.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.