Artificial Analysis Benchmarks Guardrail Models for Safety and Latency

Artificial Analysis

Jun 15, 2026

Artificial Analysis, in partnership with NVIDIA, benchmarked 19 guardrail and moderation models across three open datasets. The analysis reveals no single winner, highlighting a critical tradeoff between catching unsafe content and over-refusing safe inputs. Models cluster into permissive or restrictive categories, with NVIDIA’s Nemotron 3.5, Alibaba’s Qwen3Guard 8B, and AI2’s WildGuard defining the current quality-latency frontier.

Average F1
Average F1 across WildGuardTest prompt harm, ToxicChat, and XSTest · Higher is better
AI2 ● Alibaba ● OpenAI ● NVIDIA ● IBM ● Meta ● Google
Artificial Analysis
88.9% 87.3% 86.7% 85.9% 85.9% 85.3% 85.0% 84.5% 83.6% 83.4% 83.4% 82.6% 82.5% 81.3% 80.8% 80.5% 78.0% 70.7% 67.9%
WildGuard Qwen3Guard 8B gpt-oss-120B (Prompted) Nemotron Safety Guard 8B Granite Guardian 4.1 8B NCS-R 4B NCS-R 4B Nemotron 3.5 CS (Cat) Nemotron 3.5 CS Nemotron 3.5 CS (Cat) Nemotron 3.5 CS gpt-oss-safeguard-20B Nemotron 3 CS Llama Guard 3 gpt-oss-safeguard-120B (Prompted) 20B (Prompted) Llama Guard 4 omni-moderation ShieldGemma 9B — AI safety model performance comparison across WildGuardTest, ToxicChat, and XSTest benchmarks, ranked by average F1 score.

View the full update on artificialanalysis.ai

Artificial Analysis

@ArtificialAnlys3d ago

Users and enterprises are handing AI models and agents more autonomy, so the guardrails that screen their inputs and outputs matter more than ever. However, the benchmarks for evaluating those guardrails haven’t kept pace with model intelligence In partnership with @nvidia, we independently benchmarked guardrail and moderation models across three open datasets, measuring detection quality, latency, and the tradeoff between catching unsafe content and over-refusing safe content. No model wins outright, and there is still no common standard for judging them. We see this as an early step in a measurement problem that will continue to grow more important as models take on more real-world work.

6128

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

Artificial Analysis Ranks Nemotron 3 Ultra Fastest for Agentic Tasks

Artificial Analysis evaluated NVIDIA's newly launched Nemotron 3 Ultra, finding it completes agentic tasks significantly faster than peers due to high inference speed. The model achieves competitive performance on Terminal-Bench v2.1, positioning it as a leading option for efficient autonomous AI workflows.

Arena.ai Ranks NVIDIA Nemotron 3 Ultra #20 on Agent Arena Leaderboard

ArenaYesterday

Arena.ai Ranks NVIDIA Nemotron 3 Ultra #20 on Agent Arena Leaderboard

Arena.ai added NVIDIA's Nemotron 3 Ultra to the Agent Arena leaderboard, where it ranks #20 overall and #5 among open models. The model shows strong tool-use discipline, tying for #1 in tool hallucination, but struggles with steerability and bash recovery. These scores, based on 2,849 sessions, remain subject to wide confidence intervals as data stabilizes.

NVIDIAMay 20

NVIDIA Releases Nemotron-Labs-Diffusion for 6x Faster Parallel Token Generation

NVIDIA released Nemotron-Labs-Diffusion, a family of open-weight models that unify standard autoregressive decoding with parallel diffusion-based generation. By switching attention patterns within a single model, these 3B to 14B parameter models achieve up to 4x higher throughput on modern hardware compared to traditional sequential generation.