HeadsUpAI

Arena.ai Data Shows Open Source Models Have Mostly Closed the Proprietary Gap

· Updated

Arena.ai tracked three years of human preference data and found that open-source models have mostly closed the performance gap with proprietary systems. The +250 point lead once held by closed-source models in the Text Arena has collapsed to +30 points, a margin that separates rank #1 from rank #18.

While open-source models like DeepSeek V4 Pro have reached parity on general tasks, "Expert" prompts remain the final frontier. Proprietary models maintain a +40 point lead here, representing the distance between rank #1 and rank #8. This gap briefly flipped in early 2025, but proprietary labs have since regained a lead.

For general applications, the performance difference is now marginal, suggesting high-level reasoning is becoming commoditized. However, proprietary models still offer superior consistency for complex tasks, a trend seen in recent GPT-5.5 leaderboard rankings. You can view the full historical data and filter by use cases online.

Arena.ai
Arena.ai
@arena
X

Have open source models closed the gap with proprietary ones? We've tracked three years of Arena data across three arenas. The short answer: mostly yes. In Text Arena, the proprietary winner had a +250 Arena lead. By early 2025, it had fallen to low double digits, and at its narrowest was almost closed entirely. Today, the proprietary lead is about +30 points. It separates #1 from roughly #18 on the current leaderboard. - Open source has quickly closed most of the gap - The biggest gains happened before 2025 - The remaining gap is small in points, but still large in rank Get a deeper look into the race for Code Arena: Frontend and Expert prompts in the thread 🧵

8retweets86likes
View on X

Still wondering? A few quick answers below.

According to three years of Arena data, the gap has narrowed significantly. In the Text Arena, the lead held by proprietary models dropped from 250 points to roughly 30 points today. While this point difference is small, it still represents a gap of about 18 positions on the current leaderboard between the top model and the best open-source alternative.

Expert prompts remain the most difficult challenge for open-source models. Proprietary systems currently maintain a 40-point lead in this category, which is the distance between the first and eighth ranks. While open-source models have shown they can reach the top of the leaderboard for hard prompts, proprietary models have been more consistent at maintaining the number one spot.

Yes, open-source models have briefly taken the lead in specific categories. In early 2025, the DeepSeek R1 model moved ahead of proprietary competitors on expert-level prompts, turning a narrow gap into a short-lived open-source lead. However, proprietary models quickly regained the top position and have generally remained more consistent in holding the lead on the toughest challenges.

Arena data shows that the majority of the progress made by open-source models happened before 2025. During that period, the massive 250-point lead held by proprietary winners in the Text Arena fell to low double digits. By early 2025, the gap had narrowed to its closest point, nearly closing entirely before proprietary models established their current 30-point lead.

Share this update