Kriterler

HumanEval

A coding benchmark measuring a model's ability to write correct Python functions from docstrings.

HumanEval is a benchmark of 164 Python programming problems where the model must generate code that passes automated unit tests. It is the standard measure of coding ability for LLMs. SWE-bench is a harder variant that tests real-world GitHub issue resolution. Top frontier models now score above 90% on HumanEval.

İlgili Terimler

MMLU

Massive Multitask Language Understanding — a benchmark testing knowledge across 57 academic subjects.

LLM (Büyük Dil Modeli)

Büyük miktarda metin verisiyle eğitilmiş, insan benzeri metin üretebilen yapay sinir ağı.