machine-learning attention

Definition

Multi-Head Attention

Multi-head attention is a method that combines multiple heads (paralell attention layers) to:

  1. Different focuses: Each head learns to focus on different types of relationships in the data.
  2. Concatenation: The results of all heads are merged and linearly transformed to a unified, rich representation.
  3. Capacity: By leveraging multiple perspectives, the model gains robustness and the ability to handle complex patterns.

For heads: