Knuth-Morris-Pratt Algorithm

Definition

Knuth–Morris–Pratt Algorithm

The Knuth–Morris–Pratt algorithm is an algorithm for the string matching problem that finds all occurrences of a pattern $P$ in a text $T$ in $O (∣ P ∣ + ∣ T ∣)$ time. It precomputes the prefix function of $P$ and uses it to shift the pattern past characters that cannot match, avoiding the backtracking of the naive approach.

Intuition

KMP separates the work into two phases. The entire preparation phase is exactly the first step, compute_pi(P): it looks only at the pattern $P$ and builds the prefix table $π$ . After that, preparation is finished. The search phase then scans the text $T$ once, using $π$ only when a mismatch occurs.

The variable $q$ appears in both pieces of pseudocode, but it has different local roles. In compute_pi, $q$ is just the current pattern index whose $π (q)$ value is being computed. In kmp_search, $q$ is the current matched-prefix length in the text scan.

During the search, $q$ is the current matched-prefix length: it stores how many initial characters of $P$ currently match the suffix of the text prefix already scanned. The text index $i$ only moves to the right.

On a mismatch, KMP already knows something useful. Suppose $q = 5$ : the first five characters of $P$ have just matched the last five scanned text characters. The next comparison fails, so $P [5]$ does not match the current text character $T [i]$ .

KMP cannot keep the whole matched prefix $ababa$ , because the next character failed. But it can keep any suffix of $ababa$ that is also a prefix of $P$ . The longest such suffix is $aba$ , so $π (4) = 3$ .

After the fallback, the text index $i$ has not moved. The known suffix $aba$ is reused as the new matched prefix of $P$ . Now KMP compares the same text character $T [i] = b$ with the new candidate $P [3] = b$ .

When $q = ∣ P ∣$ , the pattern is a suffix of the scanned text prefix. The algorithm reports the start index, then falls back once more so overlapping matches can still be found.

Pseudocode

int[] compute_pi(String P)
    int n := |P|
    int[] pi := new int[n]
    pi[0] := 0
    int k := 0
    for q := 1 to n-1:
        while k > 0 and P[k] != P[q]:
            k := pi[k-1]
        if P[k] == P[q]:
            k := k + 1
        pi[q] := k
    return pi

List<int> kmp_search(String T, String P)
    int n := |T|
    int m := |P|
    int[] pi := compute_pi(P)
    List<int> result := []
    int q := 0
    for i := 0 to n-1:
        while q > 0 and P[q] != T[i]:
            q := pi[q-1]
        if P[q] == T[i]:
            q := q + 1
        if q == m:
            result.append(i - m + 1)
            q := pi[q-1]
    return result

Correctness

For $a > b$ , interpret $P [a .. b]$ as the empty string $ε$ .

Correctness of KMP search

The KMP search procedure reports exactly the shifts at which $P$ occurs in $T$ .

Proof sketch

Say that $(ℓ, i)$ is good if $P [0.. ℓ - 1]$ is a suffix of $T [0.. i]$ . Thus $(0, i)$ is always good, because it represents the empty string.

The loop invariant is:
$q = max {ℓ ∣ (ℓ, i) is good}$
after the iteration that processes $T [i]$ .

Two facts justify the fallback step:

If $(ℓ, i)$ is good and $ℓ > 0$ , then $(ℓ - 1, i - 1)$ is good.

If $(ℓ, i)$ is good, then $(π (ℓ - 1), i)$ is good, and no length strictly between $π (ℓ - 1)$ and $ℓ$ is good.

Fact 1 says that extending a suffix match by the current text character requires the previous prefix to have matched one character earlier. Fact 2 says that, after a mismatch, the longest possible remaining candidate is exactly the maximal border given by the prefix function.

The invariant follows by induction over $i$ . The base case is immediate. For the inductive step, assume that before processing $T [i]$ , $q$ is the maximum length such that $P [0.. q - 1]$ is a suffix of $T [0.. i - 1]$ . While $P [q] \neq = T [i]$ , fact 2 allows the algorithm to replace $q$ by $π (q - 1)$ without skipping any possible longer match. When the while loop stops, either $q = 0$ or $P [q] = T [i]$ . In the latter case, fact 1 shows that incrementing $q$ gives the longest suffix match ending at $i$ .

Therefore, after each iteration, $q$ is the maximum length of a prefix of $P$ that is a suffix of the text prefix seen so far. If $q = ∣ P ∣$ , then all of $P$ is a suffix of $T [0.. i]$ , so an occurrence starts at shift $i - ∣ P ∣ + 1$ . If $q < ∣ P ∣$ , then $P$ is not a suffix of $T [0.. i]$ . Hence the algorithm reports all and only valid occurrences.

Lukas' Notes

Knuth-Morris-Pratt Algorithm

Table of Contents

Definition

Intuition

Pseudocode

Correctness

Backlinks