ML//benchmark//SWE-Bench

2026-03-04

- Real GitHub issues from popular Python repos: model must write a patch that passes the test suite.

Real GitHub issues from popular Python repos: model must write a patch that passes the test suite.

Tests agent capabilities: planning, code understanding, debugging, not just generation.

SWE-Bench Verified: human-validated subset. Adopted by Anthropic, Devin, OpenAI as the de-facto agent benchmark.