Pumping lemma

In the theory of formal languages, a pumping lemma is a statement that any language of a given class can be "pumped". A language can be pumped if any sufficiently long string in the language can be broken into pieces and these pieces can be repeated to produce an even longer string in the language. Thus, if there is a pumping lemma for a given language class, any language in the class will contain an infinite set of strings all produced by a simple rule given by the pumping lemma.

The two most important examples are the pumping lemma for regular languages and the pumping lemma for context-free languages. These are called 'lemmas' rather than 'theorems' not because they are unimportant, but because their primary use is to generate quick proofs that a particular language is not in the language class. However, they cannot be used to prove the statement that a language is in the class; satisfying the pumping lemma is a necessary, but not sufficient, condition.

Someone familiar with the pumping lemma for regular languages can see immediately that a language that allows parenthesized expressions, but requires the parentheses to balance, can not be a regular language, and so the language can not be generated by a regular grammar nor recognized by a finite state machine. Trying to compose such a proof directly would take a significant length of time.

Pumping lemma for regular languages

If a language L is regular, then there exists a number m > 0 such that every string w in L with length |w| ≥ m can be written in the form
w = xyz
with strings x, y and z such that |xy| ≤ m, |y| ≥ 1 and
xy^kz is in L for every k≥0.

Using this lemma, one can for instance prove that the language L = {aⁿbⁿ : n ≥ 0} over the alphabet Σ = {a, b} is not regular. For if it were, we could pick m as in the pumping lemma. The string w = a^mb^m is in L, and the pumping lemma therefore yields a decomposition w = xyz with |xy|≤ m, |y|≥ 1 and xy^kz in L for every k≥0. But then y has to consist of a non-zero number of a's, and xy²z has more a's than b's and is therefore not in L, a contradiction.

The proof that the language of balanced (i.e. properly nested) parentheses is not regular follows the same idea. Given m, there is a string of balanced parentheses that begins with more than m left parentheses, so that y will consist entirely of left parentheses. By repeating y, we can produce a string that does not even contain the same number of left and right parentheses, and so they can not be balanced.

The proof idea for the pumping lemma itself is as follows: the regular language is accepted by a certain deterministic finite acceptor; every string that's longer than the number m of states of that acceptor will revisit a certain state, thereby causing a loop which can be repeated; the loop corresponds to the string y.

Note that the pumping lemma does not give a sufficient condition for a language to be regular: for example, the language {aⁿb^m : n ≥ m} is not regular but can still be "pumped." For a practical test that exactly characterizes regular languages, see the Myhill-Nerode Theorem.

Pumping lemma for context-free languages

If a language L is context-free, then there exists some number m > 0 such that any string w in L with |w| ≥ m can be written as
w = uvxyz
with strings u, v, x, y and z, such that |vxy| ≤ m, |vy| ≥ 1, and
uv^kxy^kz is in L for every k ≥ 0.

A typical application of this pumping lemma is the proof that the language L = {aⁿbⁿcⁿ : n ≥ 0} over the alphabet Σ = {a, b, c} is not context-free. Indeed, if this language were context-free, then by the pumping lemma we could pick n greater than m and produce a string so long that when it is written as uvxyz, since vxy is short it can contain only two of the three letters. Then by pumping we produce a string which can not be in the given language.