What is Artificial Analysis?

Artificial Analysis is Independent AI benchmarking and analysis company evaluating AI models and API providers across quality, price, and performance. HeadsUpAI tracks Artificial Analysis across the AI ecosystem and curates every significant update — the latest being "Artificial Analysis Benchmarks New OpenAI GPT Transcribe and Live Models" (July 29, 2026) — so you get the whole story in a 30-second read.

What's new from Artificial Analysis?

The most recent Artificial Analysis update is "Artificial Analysis Benchmarks New OpenAI GPT Transcribe and Live Models" (July 29, 2026). HeadsUpAI curates every significant Artificial Analysis release as a 30-second read — what shipped and why it matters.

What does Artificial Analysis do?

Artificial Analysis is Independent AI benchmarking and analysis company evaluating AI models and API providers across quality, price, and performance. On this page you'll find every significant Artificial Analysis development HeadsUpAI has tracked recently — analysis and launches — so you can keep up with where Artificial Analysis is heading without reading a dozen sources.

How often is Artificial Analysis news updated here?

Continuously. HeadsUpAI adds new Artificial Analysis updates as they're announced — usually within hours — and the 68 updates currently shown cover the past 90 days, newest first.

Artificial Analysis AI News & Updates — Latest Releases & Features

Q: What are the latest Artificial Analysis updates and releases?

The latest Artificial Analysis updates: "Artificial Analysis Benchmarks New OpenAI GPT Transcribe and Live Models", "Artificial Analysis Ranks Alibaba Qwen Audio 3.0 Realtime #1", "Artificial Analysis: Moonshot AI Releases Kimi K3 Open Weights", "Artificial Analysis: Claude Opus 5 Leads Agentic Knowledge Work Benchmark", and "Artificial Analysis: OpenAI GPT-5.6 Sol Dominates Token-Efficiency Frontier". HeadsUpAI has curated 68 Artificial Analysis updates over the last 90 days, covering analysis and launches — listed newest first, presented straight, no hype, no bias.

Artificial Analysis8h ago

Artificial Analysis Benchmarks New OpenAI GPT Transcribe and Live Models

Artificial Analysis reports OpenAI released GPT Transcribe, a batch speech-to-text model scoring 3.31% on AA-WER. The model improves accuracy by 0.7 percentage points over its predecessor while reducing price by 25% to $4.50 per 1,000 minutes. It now accepts contextual prompts, keywords, and language hints. OpenAI also launched GPT-Live-Transcribe, a streaming model currently undergoing benchmarking.

Artificial Analysis16h ago

Artificial Analysis Ranks Alibaba Qwen Audio 3.0 Realtime #1

Artificial Analysis benchmarked Alibaba’s Qwen Audio 3.0 Realtime, finding the Plus variant leads its Speech to Speech Index at 84.1%. The model outperforms GPT-Realtime-2.1 High on speech reasoning, conversational dynamics, and agentic performance. However, it records a 4.02-second time to first audio, significantly slower than the ~1.1-second latency achieved by OpenAI’s GPT-Realtime-2 series.

Artificial AnalysisJul 27

Artificial Analysis: Moonshot AI Releases Kimi K3 Open Weights

Artificial Analysis reports that Moonshot AI released the weights for Kimi K3, a 2.6T parameter model. It now leads the open-weight category with a score of 57 on the Artificial Analysis Intelligence Index. The release includes a custom license requiring separate agreements for high-revenue businesses and mandatory UI attribution for products exceeding 100 million monthly active users.

Artificial AnalysisJul 25

Artificial Analysis: Claude Opus 5 Leads Agentic Knowledge Work Benchmark

Artificial Analysis benchmarked Anthropic’s Claude Opus 5 on its AA-Briefcase agentic knowledge work test, where it took the top spot with a 1720 Elo score. The model outperforms Claude Fable 5 in analytical quality while reducing task costs by 20%. While Opus 5 leads in analytical rigor, it trails GPT-5.6 Sol in presentation quality and requires longer task times.

Artificial AnalysisJul 23

Artificial Analysis: OpenAI GPT-5.6 Sol Dominates Token-Efficiency Frontier

Artificial Analysis reports that OpenAI’s GPT-5.6 Sol dominates the token-efficiency Pareto frontier on its Intelligence Index. The model achieves higher intelligence with fewer output tokens than competing models launched this month. Within the GPT-5.6 family, the Sol and Luna tiers provide superior token efficiency compared to the Terra tier.

Artificial AnalysisJul 23

Artificial Analysis Benchmarks Thinking Machines Lab Inkling on AA-Briefcase

Artificial Analysis benchmarked Thinking Machines Lab’s Inkling model on its AA-Briefcase agentic knowledge-work test, where it achieved an Elo of 836. The model scored 19.3% on the rubric, ranking ahead of DeepSeek V4 Flash but behind Nemotron 3 Ultra. Inkling showed higher presentation quality than analytical depth, averaging 81 turns per task with 0.5 tool calls per turn.

Artificial AnalysisJul 22

Artificial Analysis Ranks Kimi K3 Second on AA-Briefcase Benchmark

Artificial Analysis benchmarked Moonshot AI’s Kimi K3 on its AA-Briefcase agentic knowledge work benchmark, where the model achieved an Elo of 1543, ranking second overall. While Kimi K3 matches Claude Fable 5 in analytical quality, it records a 56.4-minute average task time and a $10.57 cost per task, significantly higher than its predecessor and several competitors.

Artificial AnalysisJul 18

Artificial Analysis Ranks Kimi K3 Joint Fifth on Coding Agent Index

Artificial Analysis benchmarked Moonshot AI’s Kimi K3 in Kimi Code CLI, which scored 57 on their Coding Agent Index. This result places the model joint fifth, outperforming Claude Opus 4.8 while leading all tested open-weight configurations. At 3.18 dollars per task, Kimi K3 offers cost-efficient performance compared to other frontier models on the leaderboard.

Artificial AnalysisJul 17

Artificial Analysis: Frontier AI Expands to Six Labs as Costs Drop

Artificial Analysis reports that four new frontier models launched in eight days, increasing the number of labs with models scoring above 50 on its Intelligence Index from two to six. Kimi K3 debuted at #3, while near-frontier intelligence costs fell by 2–3×. Claude Fable 5 remains #1, though its lead over the field has narrowed to one point.

Artificial AnalysisJul 16

Artificial Analysis Benchmarks Moonshot AI's Kimi K3 Reasoning Model

Artificial Analysis benchmarked Moonshot AI’s Kimi K3, which scores 57 on its Intelligence Index and reaches a 1668 Elo on GDPval-AA v2, outperforming GPT-5.5 and Claude Opus 4.8 in agentic tasks. While K3 leads in agentic knowledge work and plans an open-weights release, it shows a 51% hallucination rate and costs $0.94 per task, significantly higher than its predecessor.

Artificial AnalysisJul 15

Artificial Analysis Ranks Alibaba Qwen-Audio-3.0-TTS-Plus First on Speech Leaderboard

Artificial Analysis ranks Alibaba’s Qwen-Audio-3.0-TTS-Plus as the leading model on its Speech Arena Leaderboard for provider voices. The model achieved an Elo score of 1,236 across 1,305 evaluations. It generates 16 characters per second and costs $27.59 per 1 million characters via Alibaba Cloud Model Studio.

Artificial AnalysisJul 14

Artificial Analysis Benchmarks Meta Muse Spark 1.1 on AA-Briefcase

Artificial Analysis benchmarked Meta’s Muse Spark 1.1 on its AA-Briefcase agentic knowledge work test, where the model achieved an Elo score of 863. While the model shows significant gains in analytical quality and rubric pass rates, its presentation quality dropped to an Elo of 432, marking a 67-point decrease from the previous version.

Artificial AnalysisJul 13

Artificial Analysis Ranks Gemini Omni Flash #1 in Video Generation

Artificial Analysis ranked Google’s Gemini Omni Flash #1 on its Text to Video and Image to Video leaderboards, narrowly surpassing ByteDance’s Seedance 2.0. The model is priced at $0.10 per second and is available via the Gemini API, Google AI Studio, and consumer platforms including YouTube Shorts and the Gemini app.

Artificial AnalysisJul 11

Artificial Analysis Benchmarks Meta Muse Spark 1.1 Coding Agent Performance

Artificial Analysis benchmarked Meta’s Muse Spark 1.1 (xhigh) on its Coding Agent Index, where the model achieved a score of 69. This result places it between GPT-5.5 (medium) at 71 and Claude Opus 4.8 (medium) at 67. The model costs approximately $1.40 per task, offering high cost-efficiency at the expense of longer task-completion times.

Artificial AnalysisJul 11

Artificial Analysis: GPT-5.6 Sol and Luna Dominate Terra in Cost-Efficiency

Artificial Analysis benchmarked OpenAI’s GPT-5.6 model family, finding that Sol and Luna outperform Terra in cost-efficiency at every reasoning level. While all three tiers push past GPT-5.5 on the Pareto frontier, Luna stands out as the most cost-efficient option. For any Terra effort level, Sol or Luna provides higher intelligence at no extra cost, or equal intelligence for less.

Artificial AnalysisJul 11

Artificial Analysis Benchmarks China Mobile's JT-4.1 Flash 236B Model

Artificial Analysis benchmarked China Mobile’s JT-4.1 Flash 236B A21B, which scored 39 on the Intelligence Index. The non-reasoning model achieved 28% on τ³-Banking and a 20-point reduction in hallucination rate compared to its predecessor. It features a 256k context window and shows significant gains in complex reasoning tasks on the Humanity’s Last Exam benchmark.

Artificial AnalysisJul 10

Artificial Analysis Benchmarks Meta Muse Spark 1.1 Performance and Efficiency

Artificial Analysis benchmarked Meta’s Muse Spark 1.1, which scored 51 on their Intelligence Index, an 8-point improvement over the previous version. The model demonstrates significant gains in coding and scientific reasoning while maintaining high token efficiency, costing approximately $0.26 per task. It features a 1-million-token context window and is available via Meta’s first-party API.

Artificial AnalysisJul 10

Artificial Analysis Ranks GPT-5.6 Sol Highest for Presentation Quality

Artificial Analysis reports that OpenAI’s GPT-5.6 Sol (max) achieved the highest Presentation Elo of any model in its AA-Briefcase agentic knowledge work benchmark. Scoring 1656, the model shows a 500-point improvement over GPT-5.5 (xhigh), resulting in a projected 95% win rate in head-to-head visual comparisons for professional document presentation.

Artificial AnalysisJul 10

Artificial Analysis Ranks GPT-5.6 Sol (max) First on CritPt Benchmark

Artificial Analysis ranks OpenAI’s GPT-5.6 Sol (max) as the new leader on the CritPt benchmark, achieving 32.3% accuracy on research-level physics problems. This result surpasses the previous leader, GPT-5.5 (xhigh), by about 5 points and outperforms Claude Fable 5 by 4 points. The benchmark tests models on unpublished challenges contributed by global physics researchers.

Artificial AnalysisJul 9

Artificial Analysis Benchmarks Grok 4.5 on AA-Briefcase Agentic Benchmark

Artificial Analysis ranks SpaceXAI Grok 4.5 as the top non-Anthropic model on its AA-Briefcase agentic knowledge work benchmark with an Elo of 1328. The model completes tasks in 12.4 minutes at a cost of 1.12 dollars per task, using an average of 23 turns. It shows a 578-point improvement over Grok 4.3, trailing only Anthropic’s Claude 5-series models.

Artificial AnalysisJul 9

Artificial Analysis Launches EnterpriseOps-Gym-AA Leaderboard for Agentic Enterprise Workflows

Artificial Analysis launched the EnterpriseOps-Gym-AA leaderboard, evaluating how AI agents perform on stateful, multi-step enterprise workflows across eight business domains. Claude Fable 5 leads with a 51.1% success rate, while Gemini 3.5 Flash follows at 50.1%. The benchmark reveals a 90x cost variance per task, with higher spending failing to guarantee higher accuracy in real-world enterprise environments.

Artificial AnalysisJul 9

Artificial Analysis Ranks Grok 4.5 First on AutomationBench-AA Leaderboard

Artificial Analysis ranks SpaceXAI's Grok 4.5 first on its AutomationBench-AA leaderboard with a 51% score. It is the first model to complete over half of workflow objectives without guardrail violations. At $0.34 per task, Grok 4.5 outperforms leading models like Claude Fable 5 and GPT-5.5 in both cost and objective completion, while maintaining high token efficiency.

Artificial AnalysisJul 8

Artificial Analysis Benchmarks SpaceXAI Grok 4.5 Performance and Efficiency

Artificial Analysis ranks SpaceXAI’s Grok 4.5 fourth on its Intelligence Index with a score of 54, marking a 16-point improvement over Grok 4.3. The model achieves parity with GPT-5.5 on coding agent tasks while using significantly fewer tokens and costing $2.49 per task. It also leads on the τ³-Banking benchmark, outperforming GPT-5.5.

Artificial AnalysisJul 8

Artificial Analysis Benchmarks Google Nano Banana 2 Lite Image Performance

Artificial Analysis benchmarked Google’s Nano Banana 2 Lite, finding it ranks #5 in text-to-image generation, outperforming the base Nano Banana 2. However, it ranks #18 in image editing, trailing larger models. The model generates 1K images in 3.4 seconds at $33.60 per 1K images, offering a faster, cheaper alternative for generation but not for complex editing tasks.

Artificial AnalysisJul 7

Artificial Analysis Launches Harvey LAB-AA Legal Agent Benchmark Leaderboard

Artificial Analysis launched Harvey LAB-AA, an independent benchmark evaluating AI models on 120 real-world legal tasks across 24 practice areas. Claude Fable 5 leads with a 14.2% all-pass rate, while 13 of 28 models passed zero tasks. The results show a 950x cost variance, highlighting the gap between partial criteria completion and professional-grade legal deliverables.

Artificial AnalysisJul 7

Artificial Analysis Launches Industry Indices to Benchmark AI on Professional Tasks

Artificial Analysis released six new Capability Indices evaluating AI models across Finance, Legal, Healthcare, Strategy, Engineering, and Economics. The benchmarks use occupational data to weight model performance based on the actual frequency of professional tasks like contract review and clinical documentation. Results reveal a massive frontier premium, with top-tier models costing over 100x more than mid-tier alternatives for incremental accuracy gains.

Artificial AnalysisJul 7

Artificial Analysis Ranks SpeechifyAI Simba 3.2 First for TTS Quality and Price

Artificial Analysis ranked SpeechifyAI’s Simba 3.2 as the top text-to-speech model on its Speech Arena leaderboard following blind human preference testing. The model achieved a 1,233 Elo score, surpassing previous leaders from Google and Cartesia while maintaining the lowest price point among the top five models.

Artificial AnalysisJul 6

Artificial Analysis Launches AutomationBench-AA for Agentic SaaS Workflow Evaluation

Artificial Analysis launched AutomationBench-AA, an independent leaderboard evaluating AI agents on 657 real-world SaaS workflow tasks. Claude Fable 5 leads with a 48.6% objective completion rate, though it falls back to Opus 4.8 on 18% of tasks. Gemini 3.5 Flash achieves the best guardrail-adjusted efficiency, completing 15.0 objectives per violation while matching GPT-5.5 performance at 37% of the cost.

Artificial AnalysisJul 6

Artificial Analysis Benchmarks AssemblyAI Universal-3.5 Pro Realtime Model

Artificial Analysis benchmarked AssemblyAI’s Universal-3.5 Pro Realtime, which achieves a 4.1% word error rate at 0.44 seconds latency in Max Accuracy mode. The model now supports 18 languages and accepts turn-by-turn conversation context without reconnecting. Pricing remains $0.45 per hour, with a Min Latency mode offering a 4.3% word error rate at 0.40 seconds.

Artificial AnalysisJul 2

Artificial Analysis Benchmarks Fish Audio S2.1 Pro Speech Model

Artificial Analysis benchmarked Fish Audio’s new S2.1 Pro text-to-speech model, which supports 83 languages and voice cloning. The model achieved an Elo score of 1,153, ranking 13th on the Speech Arena leaderboard. It processes 56.3 characters per second, and Fish Audio is providing free API access to the model through July 24, 2026.

Artificial AnalysisJul 2

Artificial Analysis Benchmarks Claude Sonnet 5 on AA-Briefcase Agentic Benchmark

Artificial Analysis benchmarked Anthropic’s Claude Sonnet 5 on its AA-Briefcase agentic knowledge work benchmark. The model achieved a 1391 Elo score, a 312-point improvement over Claude Sonnet 4.6. Claude Sonnet 5 shows a 17x cost-per-task range across five effort settings, with max effort requiring 183 turns per task, a fourfold increase over the previous generation.

Artificial AnalysisJun 30

Artificial Analysis Launches Controlled Voice Arena for TTS Model Evaluation

Artificial Analysis launched the Controlled Voice Arena to standardize Text-to-Speech model evaluation. By cloning all models onto the same 8 voices—2 US Male, 2 US Female, 2 UK Male, and 2 UK Female—the arena isolates model quality from voice preference. Voting is open now, with the first leaderboard results expected later this week.

Artificial AnalysisJun 30

Artificial Analysis Ranks Alibaba HappyHorse 1.1 Second on Video Leaderboards

Artificial Analysis ranks Alibaba’s HappyHorse 1.1 second on its text-to-video and image-to-video leaderboards, trailing only ByteDance’s Seedance 2.0. The model features improved audio-visual synchronization, seven-language lip-sync, and support for nine reference images. It is available on Alibaba Cloud Model Studio, Qwen Cloud, and fal, priced at $9.90 per minute for 1080p video generation.

Artificial AnalysisJun 25

Artificial Analysis Ranks HappyHorse-1.0 First on Video Editing Leaderboard

Artificial Analysis published its Video Editing Leaderboard, ranking HappyHorse-1.0 first overall after ~80,000 blind human votes. The benchmark evaluates six models across five capabilities, revealing that Kling 3.0 leads in Visual Effects while Wan 2.7 excels in Sound and Speech. HappyHorse-1.0 is the only model to rank in the top three across all five categories.

Artificial AnalysisJun 25

Artificial Analysis Ranks Microsoft MAI-Image-2.5 Models and Details Pricing

Artificial Analysis ranked Microsoft’s MAI-Image-2.5 second in text-to-image and third in image editing on its leaderboards. The model costs $48 per 1,000 images on the Foundry API, while the faster MAI-Image-2.5-Flash variant ranks eighth and sixth at $20 per 1,000 images. Both models are available in the MAI Playground, with MAI-Image-2.5 also integrated into PowerPoint and OneDrive.

Artificial AnalysisJun 23

Artificial Analysis Launches Composite Index for Native Speech to Speech Models

Artificial Analysis launched a Speech to Speech Index evaluating native audio models across reasoning, conversational dynamics, and agentic performance. OpenAI’s GPT-Realtime-2 (High) leads at 77.2%, while xAI’s Grok Voice Think Fast 1.0 leads in agentic performance at 52.1%. Deepslate Opal is the fastest at 0.44 seconds, and Gemini 3.1 Flash Live Preview (Minimal) is the lowest-cost model.

Artificial AnalysisJun 22

Artificial Analysis Ranks GLM-5.2 as Leading Open-Weights Agentic Model

Artificial Analysis benchmarked Z.ai's GLM-5.2 on its GDPval-AA agentic work evaluation, where the model achieved a 1524 Elo score. This result ranks GLM-5.2 as the leading open-weights model and third overall, placing it competitive with proprietary frontier models on long-horizon, multi-turn tasks. The model is available via API at $1.40 per 1M input tokens.

Artificial AnalysisJun 22

Artificial Analysis Launches Video Editing Arena for Frontier AI Models

Artificial Analysis launched a Video Editing Arena to benchmark frontier video models on text-instructed editing tasks. The platform features blind human-preference voting for six models, including Seedance 2.0, Runway Aleph 2.0, and Kling 3.0 Omni. It evaluates capabilities like visual effects, sound editing, and physics simulation, with the first leaderboard results expected within 24 hours.

Artificial AnalysisJun 19

Artificial Analysis Launches AA-Briefcase Benchmark for Agentic Knowledge Work

Artificial Analysis launched AA-Briefcase, a benchmark evaluating AI models on long-horizon knowledge work across multi-week projects. Claude Fable 5 leads with a 1587 Elo score, yet satisfies all rubric criteria on only 3% of tasks. The benchmark reveals an 800x cost variance across models, with 31 of 91 tasks resulting in no model scoring above 50%.

Artificial AnalysisJun 18

Artificial Analysis: Claude Fable 5 Is the Most Expensive Model

Artificial Analysis reports that Claude Fable 5 cost $6,228 to run their Intelligence Index, making it the most expensive model they have benchmarked. This cost is driven by a doubling of token and cache pricing compared to Claude Opus 4.8. Anthropic’s new top-tier model is priced at $10 per million input tokens and $50 per million output tokens.

Artificial AnalysisJun 18

Artificial Analysis Benchmarks Z.ai GLM-5.2 on Research-Level Physics Problems

Artificial Analysis reports Z.ai's GLM-5.2 scored 20.9% on the CritPt benchmark of research-level physics problems, tying Claude Opus 4.8. This result marks a 4.5× generational jump from GLM-5.1's 4.6% score ten weeks ago. GLM-5.2 leads all open-weights models, with only proprietary systems like GPT-5.5 Pro at 30.6% scoring higher on the privately-graded test.

Artificial AnalysisJun 17

Artificial Analysis Benchmarks Soniox v5 Real-Time on Pareto Frontier

Artificial Analysis benchmarked Soniox v5 Real-Time, finding it lands on the Pareto frontier for streaming speech-to-text. The model achieves a 4.5% word error rate at 0.05 seconds latency for final transcripts. At $2 per 1,000 minutes, it is the lowest-priced proprietary streaming model tested, supporting over 60 languages with real-time translation and identification.

Artificial AnalysisJun 17

Artificial Analysis Ranks Z ai's GLM-5.2 as Leading Open-Weights Model

Artificial Analysis ranks Z ai's GLM-5.2 as the leading open-weights model on its Intelligence Index v4.1 with a score of 51. The model achieves parity with proprietary frontier models on agentic tasks and sits on the Pareto frontier for intelligence versus cost per task. It represents an 11-point improvement over GLM-5.1 despite maintaining the same parameter size.

Artificial AnalysisJun 16

Artificial Analysis Updates Intelligence Index with Agentic Benchmarks and Metrics

Artificial Analysis released a video walkthrough of its Intelligence Index v4.1, detailing three upgraded agentic evaluations: Terminal-Bench 2.1, τ³-Bench Banking, and GDPval-AA v2. The update introduces new per-task metrics for cost, time, and output tokens, alongside reporting for cached input tokens to provide transparency into the real-world costs of running benchmark tasks.

Artificial AnalysisJun 15

Artificial Analysis Launches AA-AgentPerf Benchmark for Agentic Inference Workloads

Artificial Analysis launched AA-AgentPerf, the first benchmark measuring agentic inference performance using real coding trajectories. The benchmark’s lead metric, Agents per Megawatt, evaluates concurrent agent capacity at production service levels. Initial results for DeepSeek V4 Pro show NVIDIA’s rack-scale GB300 system sustains 61,354 agents per megawatt, significantly outperforming single-node Blackwell and Hopper configurations in power efficiency.

Artificial AnalysisJun 15

Artificial Analysis Updates Coding Agent Index with Contamination-Proof DeepSWE Benchmark

Artificial Analysis updated its Coding Agent Index to v1.1, replacing the gameable SWE-Bench Pro with Datacurve's DeepSWE. DeepSWE tasks are written from scratch to prevent training data contamination. The refreshed leaderboard ranks Claude Code with Fable 5 at 77, followed by Codex with GPT-5.5 at 76 and Claude Code with Opus 4.8 at 73.

Artificial AnalysisJun 15

Artificial Analysis Ranks Ideogram 4.0 #8 on Open-Weights Image Leaderboard

Artificial Analysis ranked Ideogram 4.0 as the eighth-best model on its Open Weights Text-to-Image Leaderboard. The model, Ideogram's first open-weights release, excels in design, layout, and text rendering but performs lower on photorealism and anime benchmarks. It is available via API starting at $30 per thousand images, with weights downloadable for non-commercial use.

Artificial AnalysisJun 15

Artificial Analysis Benchmarks Guardrail Models for Safety and Latency

Artificial Analysis, in partnership with NVIDIA, benchmarked 19 guardrail and moderation models across three open datasets. The analysis reveals no single winner, highlighting a critical tradeoff between catching unsafe content and over-refusing safe inputs. Models cluster into permissive or restrictive categories, with NVIDIA’s Nemotron 3.5, Alibaba’s Qwen3Guard 8B, and AI2’s WildGuard defining the current quality-latency frontier.

Artificial AnalysisJun 15

Artificial Analysis Ranks Claude Fable 5 #1 on Intelligence Index

Artificial Analysis ranks Anthropic’s Claude Fable 5 first on its Intelligence Index with a score of 64.9, leading the next-best model by nearly 5 points. The model introduces safety guardrails that route flagged queries to Claude Opus 4.8 and achieves frontier performance across agentic benchmarks, including GDPval-AA and Terminal-Bench Hard.

Artificial AnalysisJun 15

Artificial Analysis Ranks HiDream-O1-Image-1.5 Third on Image Leaderboard

Artificial Analysis ranked HiDream-O1-Image-1.5 third on its Text to Image Leaderboard with an Elo score of 1,264. The model, built on HiDream’s Unified Transformer architecture, now sits behind only OpenAI’s image models. It is available on the HiHarness and Vivago platforms, priced at $80 per 1,000 images.

Artificial AnalysisJun 10

Cohere Releases North Mini Code, a Small Open-Weight Model for Coding

Cohere released North Mini Code, a small 30B parameter (3B active) open weights coding model. This model achieves competitive coding performance for its size and speed, positioning it as a focused option in the open-weight ecosystem.

Artificial AnalysisJun 10

Anthropic Releases Claude Fable 5, Tops Agentic Work Benchmark with Safeguards

Anthropic has released Claude Fable 5, its first publicly available Mythos-class model, which ranks #1 on Artificial Analysis's GDPval-AA benchmark. This model includes new security guardrails for high-risk domains and a fallback mechanism to Claude Opus 4.8, setting a new standard for capable and responsibly scaled AI.

Artificial AnalysisJun 6

Artificial Analysis Benchmarks Google's Gemma 4 12B Transcription at 8.8% WER

Artificial Analysis benchmarked Google DeepMind's new open-weight Gemma 4 12B model for transcription, reporting an 8.8% Word Error Rate (WER). This places the model behind specialized open-weight transcription solutions, but it is available for local deployment alongside Google's new Eloquent dictation app.

Artificial AnalysisJun 4

Artificial Analysis Ranks Nemotron 3 Ultra Fastest for Agentic Tasks

Artificial Analysis evaluated NVIDIA's newly launched Nemotron 3 Ultra, finding it completes agentic tasks significantly faster than peers due to high inference speed. The model achieves competitive performance on Terminal-Bench v2.1, positioning it as a leading option for efficient autonomous AI workflows.

Artificial AnalysisJun 4

Artificial Analysis finds Step 3.7 Flash sets a new speed intelligence frontier

Artificial Analysis has released independent benchmarking for StepFun's Step 3.7 Flash, confirming the model delivers over 412 output tokens per second. The results place the open-weights model on the Pareto frontier for speed versus intelligence, showing significant gains in autonomous agentic tasks.

Artificial AnalysisJun 4

Alibaba Fun-Realtime-TTS claims top spot on Speech Arena leaderboard

Alibaba's latest text-to-speech model has reached #1 on the Artificial Analysis Speech Arena, surpassing Google's Gemini. The model delivers high-fidelity real-time audio with native support for regional accents and voice cloning at a competitive price point.

Artificial AnalysisJun 2

Microsoft MAI-Transcribe-1.5 delivers top tier accuracy at 276x real time speed

Microsoft has released MAI-Transcribe-1.5, a speech-to-text model that ranks third for accuracy while processing audio at 276x real-time speed. The model leads the accuracy-speed Pareto frontier, offering a high-performance alternative for high-volume enterprise audio workloads.

Artificial AnalysisJun 1

NVIDIA Cosmos 3 takes top open weights rank with agentic reasoning

NVIDIA's Cosmos 3 Super models have reached #1 on the Artificial Analysis open-weights leaderboards for both image and video generation. The system uses a reasoning-based architecture to refine prompts before generating high-fidelity visual content.

Artificial AnalysisJun 1

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

NVIDIA released Nemotron 3 Ultra, a 550B-parameter model that leads US open-weights benchmarks with an intelligence score of 48. The model delivers high-throughput performance exceeding 300 tokens per second, significantly outpacing similarly sized frontier models from China.

Artificial AnalysisJun 1

Artificial Analysis Ranks xAI Grok Imagine Quality in Top Five

Artificial Analysis has ranked xAI's high-fidelity image model as the leading alternative to OpenAI and Google offerings. The model delivers top-tier visual quality and editing capabilities at a significantly lower price point than its primary competitors.