ML//benchmark//GPQA
- Graduate-Level Google-Proof Q&A — questions so hard PhD domain experts score ~65% even with internet access.
Graduate-Level Google-Proof Q&A — questions so hard PhD domain experts score ~65% even with internet access.
GPQA Diamond: the hardest subset, the standard metric frontier labs report.
Used alongside MMLU Pro to measure deep expert-level reasoning.