Architecture

Quantization

Compressing model weights to lower numerical precision to reduce memory and speed up inference.

Quantization reduces the bit-width of model weights from 32-bit or 16-bit floating point to 8-bit integers (INT8) or even 4-bit (INT4). This shrinks memory footprint and increases throughput with minimal quality loss. Quantized models power consumer-grade local inference tools (llama.cpp, Ollama) and some cloud providers use INT8 inference to reduce costs.

Related Terms

Inference

The process of running a trained model to generate outputs from new inputs.

Throughput

The number of tokens or requests a model can process per second.

Latency

The time between sending a request and receiving the first token of a response.

Parameter

A learnable weight in a neural network; model size is measured in billions of parameters.