Kriterler
HumanEval
A coding benchmark measuring a model's ability to write correct Python functions from docstrings.
HumanEval is a benchmark of 164 Python programming problems where the model must generate code that passes automated unit tests. It is the standard measure of coding ability for LLMs. SWE-bench is a harder variant that tests real-world GitHub issue resolution. Top frontier models now score above 90% on HumanEval.