Introducing Zyphra Cloud: A full stack AI platform on AMD. Launching today with Zyphra Inference: serverless inference for frontier open-weight models focused on long horizon agentic workloads. Powered by @AMD MI355X GPUs on @TensorWave. Learn more at https://t.co/ZltAxAzo94 https://t.co/RqFC6DUR2B
Zyphra Launches AMD-First Inference Cloud Optimized for Long-Horizon Agents
Zyphra, an open superintelligence research company, launched Zyphra Cloud, a full-stack AI platform built on AMD infrastructure. The platform debuts with Zyphra Inference, a serverless service for hosting frontier open-weight models like
DeepSeek V3.2 and Kimi K2.6. It is optimized for agentic workloads requiring long-running sessions and massive context windows.- Hardware
- AMD Instinct MI355X
- Memory per GPU
- 288GB HBM3E
- Memory per node
- 2.3TB (8-GPU node)
- Memory bandwidth
- 8 TB/s per GPU
- Initial models
- DeepSeek V3.2, Kimi K2.6, GLM 5.1
- Agent capacity
- 184 sessions (Kimi K2.6 at 256K)
As models reach trillion-parameter scales, memory capacity becomes the bottleneck for industrial-scale inference. Zyphra uses AMD Instinct MI355X GPUs, which offer 288GB of high-bandwidth memory—more than the 192GB in NVIDIA's B200. This capacity allows more user sessions to stay resident in VRAM, preventing performance-killing cache evictions when memory is exhausted.
You can access the service now to run long-context models with custom optimizations like Tree Attention. The platform supports DeepSeek, Kimi, and GLM models, with DeepSeek V4-Pro support coming soon. Sign up for the serverless API to build agents that maintain up to 256K tokens of context.
Zyphra
@ZyphraAI
14retweets103likes
View on XStill wondering? A few quick answers below.
Zyphra Cloud is a full-stack AI platform built on AMD infrastructure designed for developers, enterprises, and hyperscalers. It unifies model serving, agent infrastructure, and scalable compute into a single environment. The platform is designed to bring research innovations in model architecture and systems design into production for building and deploying advanced AI systems.
Zyphra Inference is a serverless inference service and the first component of the Zyphra Cloud platform. It is purpose-built for large open-weight models and long-running agentic workloads. The service is optimized for tasks that require long context windows, large KV caches, and high concurrency, specifically leveraging the high memory capacity and bandwidth of AMD hardware.
Zyphra Inference uses AMD MI355X GPUs, which provide 288GB of memory per chip compared to the 192GB found in NVIDIA B200 GPUs. This higher memory capacity allows nearly twice as many active agent sessions to remain resident in VRAM. For example, an AMD node can support 184 active agents at 256K context, while a B200 node supports roughly 100.
At launch, Zyphra Inference supports several leading frontier open-weight models, including DeepSeek V3.2, Kimi K2.6, and GLM 5.1. The company has also announced that support for DeepSeek V4-Pro is currently in development. These models are optimized end-to-end for the AMD MI355X hardware using custom kernels and novel parallelism schemes developed by Zyphra Research.
Tree Attention is a specialized attention algorithm developed by Zyphra to optimize performance on AMD point-to-point hardware fabric. Unlike standard Ring Attention, which can perform poorly on this topology, Tree Attention restructures the attention process as a collective tree-reduction. This results in significantly better bandwidth and lower latency for the long-context and agentic workloads handled by the platform.




