Baseten Delivery Network Cuts Large Model Cold Starts by 2–3x

Baseten

Mar 20, 2026 · Updated Apr 25, 2026

Baseten launched the Baseten Delivery Network (BDN), cutting cold starts 2–3x for large models. BDN eliminates weight download bottlenecks with three-tier caching and single-flight downloads to prevent thundering herd issues during burst scaling. Available to all Baseten customers today.

Baseten, an AI inference platform, launched the Baseten Delivery Network (BDN) to cut cold starts 2–3x for large models. Three mechanisms work together: weights mirror to Baseten-managed storage at push time — removing HuggingFace, S3, and GCS runtime dependencies — a three-tier cache (local NVMe → peer nodes → mirrored origin) serves weight data to replicas, and single-flight downloads ensure only one node fetches any given file from origin.

The thundering herd problem is what BDN targets: 50–100 replicas simultaneously pulling the same hundreds of gigabytes from origin at scale-up saturates bandwidth for all. Single-flight assigns one responsible fetcher per file; the rest load from the local cache tier instead of racing to origin.

Deploy large models on Baseten Cloud and let BDN handle weight delivery — the gains are largest in burst scenarios where dozens of replicas need the same model weights at once.

View the full update on baseten.co

Baseten

@basetenMar 19

Cold starts for large models are one of the hardest problems in AI inference infrastructure. Today we're launching the Baseten Delivery Network (BDN) to solve one of the hardest parts of this problem. 2–3x faster cold starts for large models at scale via optimizations at the pod, node, and cluster levels.

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

Qwen Launches Caching for Qwen3.7-Max to Slash Agent Costs by 90 Percent

Qwen introduced implicit and explicit context caching for its flagship Qwen3.7-Max model to reduce API latency and expenses. By allowing developers to pin massive system prompts and tool definitions, the update cuts the cost of repeated inputs by 90 percent.

OpenRouter Launches Response Caching to Deliver Free and Instant Identical Requests

OpenRouterMay 2

OpenRouter Launches Response Caching to Deliver Free and Instant Identical Requests

OpenRouter introduced a beta response caching feature that stores the output of identical API requests at the edge. By skipping the model provider for repeated calls, developers can eliminate token costs and reduce latency from seconds to milliseconds.

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Zhipu AIMay 21

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Z.ai successfully deployed its ZCube network architecture in production to power GLM-5.1 coding services, reducing hardware costs by 33% while boosting throughput. By flattening the network topology, the system eliminates the congestion typically caused by moving massive amounts of data between GPUs during long-context inference.

Fireworks AIMay 30

Fireworks AI Serverless 2.0 Adds Priority Lanes Without Reserved GPUs

Fireworks AI launched Serverless 2.0, introducing three distinct serving paths—Standard, Priority, and Fast—to its inference platform. By allowing developers to choose between cost-efficiency, congestion reliability, or high throughput at the request level, the update removes the binary choice between shared fleets and expensive reserved capacity.