Taalas Launches Hard-Wired Llama Chip Delivering 10x Faster Inference

Taalas Inc.

Feb 19, 2026 · Updated Apr 25, 2026

Taalas, an AI hardware startup, launched a chip with Llama 3.1 8B permanently hard-wired into silicon - 17K tokens/sec per user, nearly 10x faster than H200 GPUs at 20x lower build cost. The inference API is open for developer access.

Taalas, an AI hardware startup building model-specific silicon, launched its first product: a chip with Llama 3.1 8B permanently hard-wired into the hardware. Their HC1 platform achieves 17K tokens/sec per user - nearly 10x faster than current GPU-based inference - while costing 20x less to build and consuming 10x less power.

The performance comes from a fundamentally different architecture. Modern inference hardware separates memory from compute, requiring HBM stacks, advanced packaging, and liquid cooling. Taalas merges both onto a single chip at DRAM-level density, eliminating that bottleneck entirely. Each chip is produced for a specific model, trading generality for extreme efficiency.

The HC1 Llama 3.1 8B is available as a chatbot demo and a beta inference API. Apply for API access at Taalas' site. A mid-sized reasoning LLM on HC1 is expected in spring, with a frontier model on their next-generation HC2 platform planned for winter.

View the full update on taalas.com

Taalas Inc.

@taalas_incFeb 19

24 dedicated people. $30M spent on development. Extreme specialization, speed, and power efficiency. Today we launch Taalas’ first product. Check it out: Details: https://t.co/88CA0XAL71 Demo chatbot: https://t.co/ec4ladcKnw API: https://t.co/M3EkaxEqPj

5615.8k

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

NVIDIA Vera Rubin Hits 400 Tokens Per Second for Trillion Parameter Models

NVIDIA's Vera Rubin platform uses a co-designed stack of seven specialized chips to solve the high-latency and cost bottlenecks of autonomous AI agents. By integrating dedicated hardware for token generation and tool execution, the system maintains high interactivity for trillion-parameter models while reducing token costs by 90 percent compared to previous architectures.

Together AI Launches Unified Voice Agent Cloud With Full Pipeline Co-Location

Together AIMar 18

Together AI Launches Unified Voice Agent Cloud With Full Pipeline Co-Location

Together AI launched a unified platform for real-time voice agents with STT, LLM, and TTS co-located on one cloud. Most voice stacks route audio across separate vendors — Together keeps all three in the same cluster, hitting latency under 700ms.

Perplexity Launches ROSE Inference Engine to Optimize Blackwell GPU Performance

PerplexityMay 7

Perplexity Launches ROSE Inference Engine to Optimize Blackwell GPU Performance

Perplexity developed a custom inference engine called ROSE and a domain-specific language to build specialized GPU kernels for NVIDIA hardware. By moving down the stack, the company can achieve peak performance on Blackwell chips and reduce latency for massive trillion-parameter models.

Xiaomi MiMo Breaks 1,000 Tokens/s on 1T Model with Standard GPUs

MiMoJun 9

Xiaomi MiMo Breaks 1,000 Tokens/s on 1T Model with Standard GPUs

Xiaomi MiMo, in collaboration with TileRT, released MiMo-V2.5-Pro-UltraSpeed, achieving over 1,000 tokens/s output speed on a 1-trillion-parameter model using a single standard 8-GPU node. This breakthrough enables real-time AI applications and faster agentic coding by overcoming inference speed bottlenecks on commodity hardware.