Kriterler

MMLU

Massive Multitask Language Understanding — a benchmark testing knowledge across 57 academic subjects.

MMLU evaluates a model's factual knowledge and reasoning across 57 subjects including mathematics, history, medicine, law, and coding. It is one of the most widely cited benchmarks for comparing frontier models. Scores above 85% are considered expert-level. MMLU-Pro is a harder variant with more complex questions and no multiple-choice shortcuts.

İlgili Terimler

HumanEval

A coding benchmark measuring a model's ability to write correct Python functions from docstrings.

Perplexity

A metric of how well a language model predicts a sample of text — lower is better.

LLM

Large Language Model — a neural network trained on vast text corpora to generate human-like text.