Latency
The time between sending a request and receiving the first token of a response.
Latency (often called Time to First Token, or TTFT) is the delay before output begins. It depends on model size, server load, and prompt length. High latency is acceptable for batch workflows but catastrophic for real-time chat. Smaller, quantized, or purpose-built fast models (like Gemini Flash) optimize for low TTFT at the cost of some quality.
Related Terms
The number of tokens or requests a model can process per second.
Delivering model output token-by-token as it is generated rather than waiting for the full response.
The process of running a trained model to generate outputs from new inputs.
Compressing model weights to lower numerical precision to reduce memory and speed up inference.