ML//benchmark//SWE-Bench

- Real GitHub issues from popular Python repos — model must write a patch that passes the test suite.


Real GitHub issues from popular Python repos — model must write a patch that passes the test suite.

Tests agent capabilities: planning, code understanding, debugging — not just generation.

SWE-Bench Verified: human-validated subset. Adopted by Anthropic, Devin, OpenAI as the de-facto agent benchmark.