Conceptos Fundamentales
Throughput
The number of tokens or requests a model can process per second.
Throughput measures how many tokens an inference endpoint can generate per second across all concurrent users. High throughput is critical for batch processing jobs and high-traffic applications. It is often inversely correlated with model size — smaller models achieve higher throughput on the same hardware.
Términos Relacionados
Latency
The time between sending a request and receiving the first token of a response.
Batch Processing
Submitting requests asynchronously in bulk for a 50% price discount.
Inference
The process of running a trained model to generate outputs from new inputs.
Quantization
Compressing model weights to lower numerical precision to reduce memory and speed up inference.