Benchmarks

Perplexity

A metric of how well a language model predicts a sample of text — lower is better.

Perplexity measures how surprised a model is by a held-out test set: it is the exponentiated average negative log-likelihood of the test tokens. Lower perplexity indicates better language modeling. It is primarily used to compare base models on raw language modeling quality, but it poorly predicts downstream task performance, which is why task-specific benchmarks like MMLU and HumanEval are preferred.

Termes Associés