Transformer

Definition

Transformer

A Transformer is a deep learning architecture based on the self-attention mechanism, designed to process sequential data without the need for recurrent or convolutional layers. Formally, the architecture maps an input sequence $(x_{1}, \dots, x_{n})$ to an output sequence of the same length by evaluating the relative importance of each element in the sequence using Scaled Dot-Product Attention:

$Attention (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V$

where $Q$ (Query), $K$ (Key), and $V$ (Value) are linear projections of the input embeddings.

Core Innovations

Parallelisation: Unlike recurrent neural networks, which process tokens sequentially, the transformer’s attention mechanism allows for the parallel processing of the entire sequence, significantly reducing training time for long documents.

Global Context: Self-attention enables the model to capture long-range dependencies regardless of their distance in the sequence, mitigating the information bottleneck found in traditional sequence models.

Architecture: The model consists of stacked encoder and decoder layers, which serve as the foundation for state-of-the-art models such as BERT and the GPT series.

Lukas' Notes

Transformer

Definition

Core Innovations

Graph View

Table of Contents

Backlinks