ML//benchmark//GPQA

- Graduate-Level Google-Proof Q&A — questions so hard PhD domain experts score ~65% even with internet access.


Graduate-Level Google-Proof Q&A — questions so hard PhD domain experts score ~65% even with internet access.

GPQA Diamond: the hardest subset, the standard metric frontier labs report.

Used alongside MMLU Pro to measure deep expert-level reasoning.