ML//benchmark//BIG-Bench

- Google's massive collaborative benchmark: 200+ tasks spanning reasoning, translation, QA, math, logic.


Google's massive collaborative benchmark: 200+ tasks spanning reasoning, translation, QA, math, logic.

BIG-Bench Hard (BBH): the subset where models initially struggled — multi-step reasoning focus.

So broad that aggregate scores obscure weaknesses. Useful for finding blind spots, not ranking models.