Scaled Dot-Product Attention

Definition

Scaled Dot-Product Attention

$Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) \cdot V$
where:

$Q \in R^{n \times d_{q}}$ is the query vector representing what the current token is searching for.

$K \in R^{n \times d_{q}}$ is the key vector representing what each token contains.

$V \in R^{n \times d_{v}}$ is the value vector representing the actual content to be aggregated or propagated.

$softmax$ is the softmax function.

Note that this is usually computed as a batch across all tokens int he sequence fr efficiency.

Shape Visualisation

Taken from ¹.

https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html ↩

Lukas' Notes

Scaled Dot-Product Attention

Definition

Shape Visualisation

Graph View

Table of Contents

Backlinks

Lukas' Notes

Scaled Dot-Product Attention

Definition

Shape Visualisation

Footnotes

Graph View

Table of Contents

Backlinks