ML//benchmark//HumanEval

- 164 Python programming problems with unit tests (OpenAI)


164 Python programming problems with unit tests (OpenAI)

Measures pass@k: probability of getting at least one correct solution in k attempts.

More objective than multiple choice — code either passes the tests or it doesn't.