machine-learning deep-learning nlp
Definition
Transformer
A Transformer is a deep learning architecture based on the self-attention mechanism, designed to process sequential data without the need for recurrent or convolutional layers. Formally, the architecture maps an input sequence to an output sequence of the same length by evaluating the relative importance of each element in the sequence using Scaled Dot-Product Attention:
where (Query), (Key), and (Value) are linear projections of the input embeddings.
Core Innovations
Parallelisation: Unlike recurrent neural networks, which process tokens sequentially, the transformer’s attention mechanism allows for the parallel processing of the entire sequence, significantly reducing training time for long documents.
Global Context: Self-attention enables the model to capture long-range dependencies regardless of their distance in the sequence, mitigating the information bottleneck found in traditional sequence models.
Architecture: The model consists of stacked encoder and decoder layers, which serve as the foundation for state-of-the-art models such as BERT and the GPT series.