Definition
Knuth–Morris–Pratt Algorithm
The Knuth–Morris–Pratt algorithm is an algorithm for the string matching problem that finds all occurrences of a pattern in a text in time. It precomputes the prefix function of and uses it to shift the pattern past characters that cannot match, avoiding the backtracking of the naive approach.
Intuition
KMP separates the work into two phases. The entire preparation phase is exactly the first step, compute_pi(P): it looks only at the pattern and builds the prefix table . After that, preparation is finished. The search phase then scans the text once, using only when a mismatch occurs.
The variable appears in both pieces of pseudocode, but it has different local roles. In compute_pi, is just the current pattern index whose value is being computed. In kmp_search, is the current matched-prefix length in the text scan.
During the search, is the current matched-prefix length: it stores how many initial characters of currently match the suffix of the text prefix already scanned. The text index only moves to the right.
On a mismatch, KMP already knows something useful. Suppose : the first five characters of have just matched the last five scanned text characters. The next comparison fails, so does not match the current text character .
KMP cannot keep the whole matched prefix , because the next character failed. But it can keep any suffix of that is also a prefix of . The longest such suffix is , so .
After the fallback, the text index has not moved. The known suffix is reused as the new matched prefix of . Now KMP compares the same text character with the new candidate .
When , the pattern is a suffix of the scanned text prefix. The algorithm reports the start index, then falls back once more so overlapping matches can still be found.
Pseudocode
int[] compute_pi(String P)
int n := |P|
int[] pi := new int[n]
pi[0] := 0
int k := 0
for q := 1 to n-1:
while k > 0 and P[k] != P[q]:
k := pi[k-1]
if P[k] == P[q]:
k := k + 1
pi[q] := k
return pi
List<int> kmp_search(String T, String P)
int n := |T|
int m := |P|
int[] pi := compute_pi(P)
List<int> result := []
int q := 0
for i := 0 to n-1:
while q > 0 and P[q] != T[i]:
q := pi[q-1]
if P[q] == T[i]:
q := q + 1
if q == m:
result.append(i - m + 1)
q := pi[q-1]
return result
Correctness
For , interpret as the empty string .
Correctness of KMP search
The KMP search procedure reports exactly the shifts at which occurs in .
Proof sketch
Say that is good if is a suffix of . Thus is always good, because it represents the empty string.
The loop invariant is:
after the iteration that processes .
Two facts justify the fallback step:
- If is good and , then is good.
- If is good, then is good, and no length strictly between and is good.
Fact 1 says that extending a suffix match by the current text character requires the previous prefix to have matched one character earlier. Fact 2 says that, after a mismatch, the longest possible remaining candidate is exactly the maximal border given by the prefix function.
The invariant follows by induction over . The base case is immediate. For the inductive step, assume that before processing , is the maximum length such that is a suffix of . While , fact 2 allows the algorithm to replace by without skipping any possible longer match. When the while loop stops, either or . In the latter case, fact 1 shows that incrementing gives the longest suffix match ending at .
Therefore, after each iteration, is the maximum length of a prefix of that is a suffix of the text prefix seen so far. If , then all of is a suffix of , so an occurrence starts at shift . If , then is not a suffix of . Hence the algorithm reports all and only valid occurrences.