Intel Arc Pro B70 Delivers 1.8x Performance Boost for Large AI Models

Intel News

Apr 1, 2026 · Updated Apr 25, 2026

Intel's latest MLPerf Inference v6.0 benchmarks show the Arc Pro B70 GPU achieving nearly double the performance of previous generations. This hardware combination enables running 120B parameter models on workstation-class systems using an open software stack.

Intel released MLPerf Inference v6.0 results for its Intel Xeon 6 CPUs and Intel Arc Pro B-Series GPUs. The Intel Arc Pro B70 demonstrated up to 1.8x higher inference performance than the previous generation. A four-GPU setup provides 128GB of VRAM, enough to run 120B parameter models with high concurrency.

This update positions Intel as a viable alternative for developers seeking high-performance AI inference without proprietary lock-in. By offering 1.6x more KV cache capacity than comparable competitors, the hardware handles larger models and longer context windows. The open, containerized Linux stack aims to simplify enterprise-grade deployments.

You can now leverage these systems for local LLM inference and fine-tuning. The Intel Xeon 6 processors include built-in acceleration like AMX and AVX512, allowing some AI tasks to run efficiently without dedicated accelerators. These systems are designed for workstations where data privacy and cost-efficiency are priorities.

View the full update on newsroom.intel.com

Intel News

@intelnewsApr 1

Demand for AI inference = demand for greater performance. See how today’s newly released MLPerf Inference v6.0 benchmarks show how #IntelXeon 6 CPUs and #IntelArcPro B-series GPUs deliver—with Intel Arc Pro B70 providing up to 1.8x higher inference performance over previous generations. Read more: https://t.co/94aZZbXsGf

1988

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

NVIDIA released Nemotron 3 Ultra, a 550B-parameter model that leads US open-weights benchmarks with an intelligence score of 48. The model delivers high-throughput performance exceeding 300 tokens per second, significantly outpacing similarly sized frontier models from China.

NVIDIAMay 20

NVIDIA Releases Nemotron-Labs-Diffusion for 6x Faster Parallel Token Generation

NVIDIA released Nemotron-Labs-Diffusion, a family of open-weight models that unify standard autoregressive decoding with parallel diffusion-based generation. By switching attention patterns within a single model, these 3B to 14B parameter models achieve up to 4x higher throughput on modern hardware compared to traditional sequential generation.

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

PerplexityMay 12

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity published research showing that NVIDIA's GB200 Blackwell architecture nearly halves communication latency for large Mixture-of-Experts models compared to the previous generation. The findings suggest that Blackwell is a primary platform for reducing the cost and latency of serving frontier-scale AI search.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.