Architecture

Quantization

Compressing model weights to lower numerical precision to reduce memory and speed up inference.

Quantization reduces the bit-width of model weights from 32-bit or 16-bit floating point to 8-bit integers (INT8) or even 4-bit (INT4). This shrinks memory footprint and increases throughput with minimal quality loss. Quantized models power consumer-grade local inference tools (llama.cpp, Ollama) and some cloud providers use INT8 inference to reduce costs.

Related Terms