ML//benchmark//BIG-Bench
- Google's massive collaborative benchmark: 200+ tasks spanning reasoning, translation, QA, math, logic.
Google's massive collaborative benchmark: 200+ tasks spanning reasoning, translation, QA, math, logic.
BIG-Bench Hard (BBH): the subset where models initially struggled — multi-step reasoning focus.
So broad that aggregate scores obscure weaknesses. Useful for finding blind spots, not ranking models.