Benchmarks

Perplexity

A metric of how well a language model predicts a sample of text — lower is better.

Perplexity measures how surprised a model is by a held-out test set: it is the exponentiated average negative log-likelihood of the test tokens. Lower perplexity indicates better language modeling. It is primarily used to compare base models on raw language modeling quality, but it poorly predicts downstream task performance, which is why task-specific benchmarks like MMLU and HumanEval are preferred.

Termes Associés

MMLU

Massive Multitask Language Understanding — a benchmark testing knowledge across 57 academic subjects.

LLM (Grand Modèle de Langage)

Un réseau de neurones entraîné sur de grands volumes de texte pour générer du contenu.

Pre-Training

The initial large-scale training phase where a model learns language from massive text corpora.

Tokenizer

The algorithm that converts raw text into a sequence of tokens for a language model.