Benchmarks

HumanEval

A coding benchmark measuring a model's ability to write correct Python functions from docstrings.

HumanEval is a benchmark of 164 Python programming problems where the model must generate code that passes automated unit tests. It is the standard measure of coding ability for LLMs. SWE-bench is a harder variant that tests real-world GitHub issue resolution. Top frontier models now score above 90% on HumanEval.

Related Terms

MMLU

Massive Multitask Language Understanding — a benchmark testing knowledge across 57 academic subjects.

LLM

Large Language Model — a neural network trained on vast text corpora to generate human-like text.