Definition
Multi-Head Attention
Multi-head attention is a method that combines multiple heads (paralell attention layers) to:
- Different focuses: Each head learns to focus on different types of relationships in the data.
- Concatenation: The results of all heads are merged and linearly transformed to a unified, rich representation.
- Capacity: By leveraging multiple perspectives, the model gains robustness and the ability to handle complex patterns.
For heads: