Benchmarks

HumanEval

A coding benchmark measuring a model's ability to write correct Python functions from docstrings.

HumanEval is a benchmark of 164 Python programming problems where the model must generate code that passes automated unit tests. It is the standard measure of coding ability for LLMs. SWE-bench is a harder variant that tests real-world GitHub issue resolution. Top frontier models now score above 90% on HumanEval.

Related Terms