Zyphra Launches AMD-First Inference Cloud Optimized for Long-Horizon Agents

Zyphra

May 15, 2026 · Updated Jun 6, 2026

Zyphra launched Zyphra Cloud, a full-stack AI platform on AMD MI355X GPUs rather than NVIDIA, opening with serverless inference for long-horizon agents. The 288GB of memory per AMD chip — versus 192GB on NVIDIA's B200 — keeps nearly double the agent sessions resident in VRAM at long context.

Zyphra, an open superintelligence research company, launched Zyphra Cloud, a full-stack AI platform built on AMD infrastructure. The platform debuts with Zyphra Inference, a serverless service for hosting frontier open-weight models like DeepSeek V3.2 and Kimi K2.6. It is optimized for agentic workloads requiring long-running sessions and massive context windows.

Hardware: AMD Instinct MI355X
Memory per GPU: 288GB HBM3E
Memory per node: 2.3TB (8-GPU node)
Memory bandwidth: 8 TB/s per GPU
Initial models: DeepSeek V3.2, Kimi K2.6, GLM 5.1
Agent capacity: 184 sessions (Kimi K2.6 at 256K)

As models reach trillion-parameter scales, memory capacity becomes the bottleneck for industrial-scale inference. Zyphra uses AMD Instinct MI355X GPUs, which offer 288GB of high-bandwidth memory—more than the 192GB in NVIDIA's B200. This capacity allows more user sessions to stay resident in VRAM, preventing performance-killing cache evictions when memory is exhausted.

You can access the service now to run long-context models with custom optimizations like Tree Attention. The platform supports DeepSeek, Kimi, and GLM models, with DeepSeek V4-Pro support coming soon. Sign up for the serverless API to build agents that maintain up to 256K tokens of context.

View the full update on zyphra.com

Zyphra

@ZyphraAIMay 4

Introducing Zyphra Cloud: A full stack AI platform on AMD. Launching today with Zyphra Inference: serverless inference for frontier open-weight models focused on long horizon agentic workloads. Powered by @AMD MI355X GPUs on @TensorWave. Learn more at https://t.co/ZltAxAzo94 https://t.co/RqFC6DUR2B

14103

View on X

Still wondering? A few quick answers below.

Zyphra Cloud is a full-stack AI platform built on AMD infrastructure designed for developers, enterprises, and hyperscalers. It unifies model serving, agent infrastructure, and scalable compute into a single environment. The platform is designed to bring research innovations in model architecture and systems design into production for building and deploying advanced AI systems.

Zyphra Inference is a serverless inference service and the first component of the Zyphra Cloud platform. It is purpose-built for large open-weight models and long-running agentic workloads. The service is optimized for tasks that require long context windows, large KV caches, and high concurrency, specifically leveraging the high memory capacity and bandwidth of AMD hardware.

Zyphra Inference uses AMD MI355X GPUs, which provide 288GB of memory per chip compared to the 192GB found in NVIDIA B200 GPUs. This higher memory capacity allows nearly twice as many active agent sessions to remain resident in VRAM. For example, an AMD node can support 184 active agents at 256K context, while a B200 node supports roughly 100.

At launch, Zyphra Inference supports several leading frontier open-weight models, including DeepSeek V3.2, Kimi K2.6, and GLM 5.1. The company has also announced that support for DeepSeek V4-Pro is currently in development. These models are optimized end-to-end for the AMD MI355X hardware using custom kernels and novel parallelism schemes developed by Zyphra Research.

Tree Attention is a specialized attention algorithm developed by Zyphra to optimize performance on AMD point-to-point hardware fabric. Unlike standard Ring Attention, which can perform poorly on this topology, Tree Attention restructures the attention process as a collective tree-reduction. This results in significantly better bandwidth and lower latency for the long-context and agentic workloads handled by the platform.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Z.ai successfully deployed its ZCube network architecture in production to power GLM-5.1 coding services, reducing hardware costs by 33% while boosting throughput. By flattening the network topology, the system eliminates the congestion typically caused by moving massive amounts of data between GPUs during long-context inference.

Fireworks AI Adds GLM 5.1 Training to Build Long Horizon Coding Agents

Fireworks AIApr 28

Fireworks AI Adds GLM 5.1 Training to Build Long Horizon Coding Agents

Fireworks AI added Z.ai's GLM 5.1 to its training platform, supporting supervised fine-tuning and direct preference optimization with a 200K context window. This allows developers to customize the flagship agentic model for multi-hour autonomous tasks without the numerical drift common in fragmented training and inference stacks.

Cloudflare Launches Agent Memory to Give AI Agents Persistent Long Term State

CloudflareApr 18

Cloudflare Launches Agent Memory to Give AI Agents Persistent Long Term State

Cloudflare introduced Agent Memory, a managed service that extracts and stores key information from agent conversations to prevent context rot. By moving state management to a dedicated pipeline, agents can recall past decisions and facts across sessions without exhausting their context windows.

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents

OllamaJun 7

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents

Ollama has made NVIDIA's Nemotron 3 Ultra model available on its cloud. This 550 billion parameter Mixture of Experts (MoE) model is designed for long-running AI agents, delivering 5x faster inference and up to 30% lower costs for complex agentic tasks.

What is Zyphra Cloud?

What is Zyphra Inference?

How does Zyphra Inference compare to NVIDIA B200 nodes?

Which AI models are available on Zyphra Inference?

What is Zyphra Tree Attention?

Keep reading

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Fireworks AI Adds GLM 5.1 Training to Build Long Horizon Coding Agents

Fireworks AI Adds GLM 5.1 Training to Build Long Horizon Coding Agents

Cloudflare Launches Agent Memory to Give AI Agents Persistent Long Term State

Cloudflare Launches Agent Memory to Give AI Agents Persistent Long Term State

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents

Keep reading

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Fireworks AI Adds GLM 5.1 Training to Build Long Horizon Coding Agents

Fireworks AI Adds GLM 5.1 Training to Build Long Horizon Coding Agents

Cloudflare Launches Agent Memory to Give AI Agents Persistent Long Term State

Cloudflare Launches Agent Memory to Give AI Agents Persistent Long Term State

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents