How to build a better AI benchmark

MIT Technology Review·2025-05-08 20:00

It’s not easy being one of Silicon Valley’s favorite benchmarks.

SWE-Bench (pronounced “swee bench”) launched in November 2024 to evaluate an AI model’s coding skill, using more than 2,000 real-world programming problems pulled from the public GitHub repositories of 12 different Python-based projects.

In the months since then, it’s quickly become one of the most popular tests in AI. A SWE-Bench score has become a mainstay of major model releases from OpenAI, Anthropic, and Google—and outside of foundation models, the fine-tuners at AI firms are in constant competition to see who can rise above the pack. The top of the leaderboard is a pileup between three different fine tunings of Anthropic’s Claude Sonnet model and Amazon’s Q developer agent. Auto Code Rover—one of the Claude modifications—nabbed the number two spot in November, and was acquired just three months later.

……

Read full article on MIT Technology Review

Technology

HOME

PROPERTY

SALE

RENT

NEW LAUNCH

CONDOS

OVERSEAS

GROUP

SERVICES

LOTTERY

Get Nestia App Free Now

Property Agent Program

Properties for sale

Properties for rent

Singapore New Launch

Singapore Condo

Sale by area

Rent by area

Popular properties for sale

Popular properties for rent

Singapore News

Singapore Online Groups

External Links