Architektur

Multimodal

A model that can process and generate multiple types of data, such as text and images.

Multimodal models accept inputs beyond text — images, audio, video, or documents — and integrate them into a unified representation. GPT-4o, Gemini, and Claude 3 are multimodal: you can send an image alongside text and the model reasons across both. Multimodal inference typically costs more than text-only due to the additional tokens consumed by vision encoding.

Verwandte Begriffe