Core Concepts

Latency

The time between sending a request and receiving the first token of a response.

Latency (often called Time to First Token, or TTFT) is the delay before output begins. It depends on model size, server load, and prompt length. High latency is acceptable for batch workflows but catastrophic for real-time chat. Smaller, quantized, or purpose-built fast models (like Gemini Flash) optimize for low TTFT at the cost of some quality.

Related Terms

Throughput

The number of tokens or requests a model can process per second.

Streaming

Delivering model output token-by-token as it is generated rather than waiting for the full response.

Inference

The process of running a trained model to generate outputs from new inputs.

Quantization

Compressing model weights to lower numerical precision to reduce memory and speed up inference.