Concepts Fondamentaux

Streaming

Delivering model output token-by-token as it is generated rather than waiting for the full response.

Streaming uses server-sent events (SSE) to push each token to the client as soon as it is produced. This dramatically reduces perceived latency for end users — they see text appearing immediately instead of waiting for the full completion. Streaming does not change token cost but is essential for conversational UIs and real-time applications.

Termes Associés

Latency

The time between sending a request and receiving the first token of a response.

Throughput

The number of tokens or requests a model can process per second.

Completion

The text output generated by a language model in response to a prompt.

Çıkarım (Inference)

Eğitilmiş bir yapay zeka modelinin yeni girdiler için çıktı üretme süreci.