Architektur

Attention Mechanism

The core operation in transformers that lets each token attend to all other tokens.

Self-attention computes a weighted sum of all token representations, where weights reflect how relevant each token is to each other. This gives transformers their ability to model long-range dependencies and contextual meaning. Multi-head attention runs this operation in parallel across many representation subspaces, enabling richer feature extraction.

Verwandte Begriffe