Core Concepts

Latency

The time between sending a request and receiving the first token of a response.

Latency (often called Time to First Token, or TTFT) is the delay before output begins. It depends on model size, server load, and prompt length. High latency is acceptable for batch workflows but catastrophic for real-time chat. Smaller, quantized, or purpose-built fast models (like Gemini Flash) optimize for low TTFT at the cost of some quality.

Related Terms