ML//benchmark//SWE-Bench
- Real GitHub issues from popular Python repos — model must write a patch that passes the test suite.
Real GitHub issues from popular Python repos — model must write a patch that passes the test suite.
Tests agent capabilities: planning, code understanding, debugging — not just generation.
SWE-Bench Verified: human-validated subset. Adopted by Anthropic, Devin, OpenAI as the de-facto agent benchmark.