ML//benchmark//BIG-Bench

2026-03-04

- Google's massive collaborative benchmark: 200+ tasks spanning reasoning, translation, QA, math, logic.

Google's massive collaborative benchmark: 200+ tasks spanning reasoning, translation, QA, math, logic.

BIG-Bench Hard (BBH): the subset where models initially struggled (multi-step reasoning focus).

So broad that aggregate scores obscure weaknesses. Useful for finding blind spots, not ranking models.