Quantization
Compressing model weights to lower numerical precision to reduce memory and speed up inference.
Quantization reduces the bit-width of model weights from 32-bit or 16-bit floating point to 8-bit integers (INT8) or even 4-bit (INT4). This shrinks memory footprint and increases throughput with minimal quality loss. Quantized models power consumer-grade local inference tools (llama.cpp, Ollama) and some cloud providers use INT8 inference to reduce costs.
Verwandte Begriffe
The process of running a trained model to generate outputs from new inputs.
The number of tokens or requests a model can process per second.
The time between sending a request and receiving the first token of a response.
A learnable weight in a neural network; model size is measured in billions of parameters.