Together AI Presents Untied Ulysses for Memory-Efficient Long-Context Training

Together AI

Jun 15, 2026

Together AI researcher Max Ryabinin introduced Untied Ulysses, a context parallelism technique that optimizes GPU memory usage during transformer training. By chunking attention heads and reusing buffers across iterations, the method enables training 8B and 32B scale models on a single 8xH100 node with 25% longer sequences than prior implementations, overcoming memory limits that previously stalled 3M-token context training.

View the full update on youtube.com

Together AI

@togethercompute3d ago

Training a Llama 3B model with a 3M token context on a single 8xH100 node fails because model parameters alone exhaust GPU memory. @m_ryabinin explains how Untied Ulysses, his team's latest research, pushes past that wall, training at 8B and 32B scale with 25% longer sequences than prior implementations. https://t.co/nm0sjLUSUL

442

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

Together AI Optimizes MiniMax M3 Inference with New Systems Kernels

Together AI implemented custom engineering optimizations to serve MiniMax M3 at production scale. The team built a KV-block-major sparse attention kernel, integrated paged attention for MSA, and optimized decode index scoring. These changes, alongside a Rust-based multimodal preprocessing gateway, delivered 81–125% throughput improvements across varying concurrency levels for the 1-million-token context model.

MiniMax M3 drops attention overhead from 30 to 5 percent

MiniMaxJun 3

MiniMax M3 drops attention overhead from 30 to 5 percent

MiniMax revealed technical highlights for its M3 model, featuring a Sparse Attention architecture that maintains uncompressed data for its 1-million-token context window. The update reduces attention kernel overhead from 30% to 5% of per-decode wall-clock time and introduces vision-coding capabilities where the model self-evaluates its own rendered UI.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.

Cohere Proves Mixture of Experts Models Amplify Speculative Decoding Gains

CohereApr 24

Cohere Proves Mixture of Experts Models Amplify Speculative Decoding Gains

Cohere validated that Mixture-of-Experts models achieve higher speedups from speculative decoding than dense models by staying in a memory-bandwidth-bound sweet spot. The research shows that consecutive tokens naturally reuse the same experts, significantly reducing the data-loading bottleneck during parallel verification.