machine-learning transformers attention

Definition

Scaled Dot-Product Attention

where:

  • is the query vector representing what the current token is searching for.
  • is the key vector representing what each token contains.
  • is the value vector representing the actual content to be aggregated or propagated.
  • is the softmax function.

Note that this is usually computed as a batch across all tokens int he sequence fr efficiency.

Shape Visualisation

Taken from 1.

Footnotes

  1. https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html