Concepts Fondamentaux
Streaming
Delivering model output token-by-token as it is generated rather than waiting for the full response.
Streaming uses server-sent events (SSE) to push each token to the client as soon as it is produced. This dramatically reduces perceived latency for end users — they see text appearing immediately instead of waiting for the full completion. Streaming does not change token cost but is essential for conversational UIs and real-time applications.
Termes Associés
Latency
The time between sending a request and receiving the first token of a response.
Throughput
The number of tokens or requests a model can process per second.
Completion
The text output generated by a language model in response to a prompt.
Inference
The process of running a trained model to generate outputs from new inputs.