ML//benchmark

- Standardized tests for AI: MMLU, HumanEval, GSM8K, ARC, HellaSwag.


Standardized tests for AI: MMLU, HumanEval, GSM8K, ARC, HellaSwag.

The scoreboard that drives development — and Goodhart's favorite target.

Contamination, overfitting to test format, cherry-picking — all real problems.

Still the best imperfect tool we have for comparing models.