ML//benchmark

2023-06-15

- Standardized tests for AI: MMLU, HumanEval, GSM8K, ARC, HellaSwag.

Standardized tests for AI: MMLU, HumanEval, GSM8K, ARC, HellaSwag.

The scoreboard that drives development, and Goodhart's favorite target.

Contamination, overfitting to test format, cherry-picking. All real problems.

Still the best imperfect tool we have for comparing models.