Definition
Byte Pair Encoding
Byte pair encoding is an algorithm for encoding strings or text into smaller strings by creating and using a translation table.
Algorithm
Counting Pairs: For a given text corpus, count all adjacent pairs of symbol in every word. If we denote a word as a sequence of symbols
then each pair is counted over the entire corpus.
Merging: Identify the most frequent pair, say . This pair is merged into a new token, which can be denoted as . Mathematically, if is the frequency of the pair, choose the pair with the maximum frequency:
Replace every occurrence of followed by in the corpus with the new token .
Repeat: Repeat the counting and merging steps until a predefined vocabulary size is reached or no pair occurs more than once.