ML//benchmark
- Standardized tests for AI: MMLU, HumanEval, GSM8K, ARC, HellaSwag.
Standardized tests for AI: MMLU, HumanEval, GSM8K, ARC, HellaSwag.
The scoreboard that drives development — and Goodhart's favorite target.
Contamination, overfitting to test format, cherry-picking — all real problems.
Still the best imperfect tool we have for comparing models.