Conceptos Fundamentales

Throughput

The number of tokens or requests a model can process per second.

Throughput measures how many tokens an inference endpoint can generate per second across all concurrent users. High throughput is critical for batch processing jobs and high-traffic applications. It is often inversely correlated with model size — smaller models achieve higher throughput on the same hardware.

Términos Relacionados

Latency

The time between sending a request and receiving the first token of a response.

Batch Processing

Submitting requests asynchronously in bulk for a 50% price discount.

Çıkarım (Inference)

Eğitilmiş bir yapay zeka modelinin yeni girdiler için çıktı üretme süreci.

Quantization

Compressing model weights to lower numerical precision to reduce memory and speed up inference.