Multimodal
A model that can process and generate multiple types of data, such as text and images.
Multimodal models accept inputs beyond text — images, audio, video, or documents — and integrate them into a unified representation. GPT-4o, Gemini, and Claude 3 are multimodal: you can send an image alongside text and the model reasons across both. Multimodal inference typically costs more than text-only due to the additional tokens consumed by vision encoding.
Términos Relacionados
Large Language Model — a neural network trained on vast text corpora to generate human-like text.
The basic unit of text that language models process and are billed by.
A large pre-trained model that serves as the base for many downstream applications.
The process of running a trained model to generate outputs from new inputs.