Regular expression

A regular expression (abbreviated as regexp or regex) is a string that describes a whole set of strings, according to certain syntax rules. These expressions are used by many text editors and utilities (especially in the Unix operating system) to search bodies of text for certain patterns and, for example, replace the found strings with a certain other string.

The origin of regular expressions lies in automata theory and formal language theory (both part of theoretical computer science). These fields study models of computation (automata) and ways to describe and classify formal languages. A formal language is nothing but a set of strings. In the 1940s, Warren McCulloch and Walter Pitts described the nervous system by modelling neurons as small simple automata. The mathematician, Stephen Kleene, later described these models using his mathematical notation called regular sets. Ken Thompson built this notation into the editor qed, then into the Unix editor ed and eventually into grep. Ever since that time, regular expressions have been widely used in Unix and Unix-like utilities such as: expr, awk, Emacs, vim, lex, and Perl. Most tools use an implementation of the regex library built by Henry Spencer.

Regular expressions correspond to the type 3 grammars of the Chomsky hierarchy and may be used to describe a regular language.

Table of contents

1 Regular expressions in Formal Language Theory
2 Regular expression syntaxes

2.1 Traditional Unix regexps
2.2 POSIX modern (extended) regexps
2.3 Perl Compatible Regular Expressions (PCRE)

3 External links

Regular expressions in Formal Language Theory

Regular expressions consist of constants and operators that denote sets of strings and operations over these sets, respectively. Given a finite alphabet Σ the following constants are defined:

(empty set) ∅ denoting the set ∅
(empty string) ε denoting the set {ε}
(literal character) a in Σ denoting the set {"a"}

and the following operations:

(concatenation) RS denoting the set { αβ | α in R and β in S }. For example {"ab", "c"}{"d", "ef"} = {"abd", "abef", "cd", "cef"}.
(set union) R U S denoting the set union of R and S.
(Kleene star) R* denoting the smallest superset of R that contains ε and is closed under string concatenation. This is the set of all strings that can be made by concatenating zero or more strings in R. For example, {"ab", "c"}* = {ε, "ab", "c", "abab", "abc", "cab", "cc", "ababab", ... }.

To avoid brackets it is assumed that the Kleene star has the highest priority, then concatenation and then set union. If there is no ambiguity then brackets may be omitted. For example, (ab)c is written as abc and a U (b(c*)) can be written as a U bc*.

Sometimes the complement operator ~ is added; ~R denotes the set of all strings over Σ that are not in R. In that case the resulting operators form a Kleene algebra. The complement operator is redundant: it can always be expressed by only using the other operators.

Examples:

a U b* denotes {"a", ε, "b", "bb", "bbb", ...}
(a U b)* denotes the set of all strings consisting of 'a's and 'b's, including the empty string
b*(ab*)* the same
ab*(c U ε) denotes the set of strings starting with 'a', then zero or more 'b's and finally optionally a 'c'.
( bb U a(bb)*aa U a(bb)*(ab U ba)(bb)*(ab U ba) )* denotes the set of all strings which contain an even number of 'b's and a number of 'a's divisible by three.

Regular expressions in this sense can express exactly the class of languages accepted by finite state automata: the regular languages. There is, however, a significant difference in compactness: some classes of regular languages can only be described by automata that grow exponentially in size, while the required regular expressions only grow linearly.

We can also study expressive power within the formalism. As the example shows, different regular expressions can express the same language: the formalism is redundant.

It is possible to write an algorithm which for two given regular expressions decides whether the described languages are equal - essentially, it reduces each expression to a minimal deterministic finite state automaton and determines whether they are isomorphic (equivalent).

To what extent can this redundancy be eliminated? Can we find an interesting subset of regular expressions that is still fully expressive? Kleene star and set union are obviously required, but perhaps we can restrict their use. This turns out to be a surprisingly difficult problem. As simple as the regular expressions are, it turns out there is no method to systematically rewrite them to some normal form. They are not finitely axiomatizable. So we have to resort to other methods. This leads to the star height problem.

Regular expression syntaxes

Traditional Unix regexps

The "basic" Unix regexp syntax is now defined as obsolete by POSIX, but is still widely used for the purposes of backwards compatibility. Most Unix utilities (grep, sed...) use it by default.

In this syntax, most characters are treated as literals - they match only themselves ("a" matches "a", "abc" matches "abc", etc). The exceptions are called metacharacters:

. Matches any single character

[ ] Matches a single character that is contained within the brackets - [abc] matches "a", "b", or "c". [a-z] matches any lowercase letter.

[^ ] Matches a single character that is not contained within the brackets - [^a-z] matches any single character that isn't a lowercase letter

^ Matches the start of the line

$ Matches the end of the line

\$ \$ Treat the expression enclosed within the brackets as a single block

* Match the last "block" zero or more times - "\$abc\$*" matches the empty string, or "abc", or "abcabc", and so on

\\{x,y\\} Match the last "block" at least x and not more than y times - "a\\{3,5\\}" matches "aaa", "aaaa" or "aaaaa".

There is no representation of the set union operator in this syntax.

Examples:

".at" matches any three-letter word ending with "at"

"[hc]at" matches "hat" and "cat"

"[^b]at" matches any three-letter word ending with "at" and not beginning with 'b'.

"^[hc]at" matches "hat" and "cat" but only at the beginning of a line

"[hc]at$" matches "hat" and "cat" but only at the end of a line

POSIX modern (extended) regexps

The more modern "extended" regexp can often be used with modern Unix utilities by including the command line flag "-E".

POSIX extended regexps are similar in syntax to the traditional Unix regexp, with some exceptions. The following metacharacters are added:

+ Match the last "block" one or more times - "ba+" matches "ba", "baa", "baaa" and so on

? Match the last "block" zero or one times - "ba?" matches "b" or "ba"

| The choice (or set union) operator: match either the expression before or the expression after the operator - "abc|def" matches "abc" or "def".

Also, backslashes are removed: \\{...\\} becomes {...} and \$...\$ becomes (...)

Examples:

"[hc]+at" matches with "hat", "cat", "hhat", "chat", "hcat", "ccat" et cetera

"[hc]?at" matches "hat", "cat" and "at"

"([cC]at | [dD]og)" matches "cat", "Cat", "dog" and "Dog"

Since the characters '(', ')', '[', ']', '.', '*', '?', '+', '^' and '$' are used as special symbols they have to be "escaped" somehow if they are meant literally. This is done by preceding them with '\\' which therefore also has to be "escaped" this way if meant literally.

Examples:

".\\.(\$|\$)" matches with the string "a.)"

Perl Compatible Regular Expressions (PCRE)

Perl has a much richer syntax than even the extended POSIX regexp. This syntax has also been used in other utilities and applications -- exim, for example. Although still named "regular expressions", the Perl extensions give an expressive power that far exceeds the regular languages.

This extra power comes at a cost. The worst-case complexity of matching a string against a Perl regular expression is exponential in the size of the input. That means the regular expression could take an extremely long time to process under just the right conditions of expression and input string, although it rarely happens in practice.

External links

Syntax and Semantics of Regular Expressions by the Xerox Research Centre
Regexp Syntax Summary, reference table for different styles of regular expressions
Regexp tutorial written by David Mertz
The Regexp Coach, a very powerful interactive learning system to hone your regexp skills.
Mastering Regular Expressions

.	Matches any single character
[ ]	Matches a single character that is contained within the brackets - [abc] matches "a", "b", or "c". [a-z] matches any lowercase letter.
[^ ]	Matches a single character that is not contained within the brackets - [^a-z] matches any single character that isn't a lowercase letter
^	Matches the start of the line
$	Matches the end of the line
\\( \\)	Treat the expression enclosed within the brackets as a single block
*	Match the last "block" zero or more times - "\\(abc\\)*" matches the empty string, or "abc", or "abcabc", and so on
\\{x,y\\}	Match the last "block" at least x and not more than y times - "a\\{3,5\\}" matches "aaa", "aaaa" or "aaaaa".

+	Match the last "block" one or more times - "ba+" matches "ba", "baa", "baaa" and so on
?	Match the last "block" zero or one times - "ba?" matches "b" or "ba"
\|	The choice (or set union) operator: match either the expression before or the expression after the operator - "abc\|def" matches "abc" or "def".