large-language-models

Definition

Byte Pair Encoding

Byte pair encoding is an algorithm for encoding strings or text into smaller strings by creating and using a translation table.

Algorithm

Counting Pairs: For a given text corpus, count all adjacent pairs of symbol in every word. If we denote a word as a sequence of symbols

then each pair is counted over the entire corpus.

Merging: Identify the most frequent pair, say . This pair is merged into a new token, which can be denoted as . Mathematically, if is the frequency of the pair, choose the pair with the maximum frequency:

Replace every occurrence of followed by in the corpus with the new token .

Repeat: Repeat the counting and merging steps until a predefined vocabulary size is reached or no pair occurs more than once.