Next: POSIX Regular Expression Syntax
Up: Appendix: Regular Expressions
Previous: Brief History
  Contents
Regular expressions consist of constants and operators that denote sets of strings and
operations over these sets, respectively. Given a finite alphabet the following
constants are defined
- ( empty set )
denoting the set
- ( empty string )
denoting the set
- ( literal character )
denoting the set
and the following operations
- ( concatenation )
denoting the set
. For example,
.
- ( set union )
denoting the set union of and .
- ( Keeene star )
denoting the smallest superset of that contains
and is closed under string concatenation. This is the set of all strings that
can be made by concatenating zero or more strings in . For example,
.
To avoid brackets it is assumed that the Kleene star has the highest priority, then concatenation
and then set union. If there is no ambiguity then brackets may be omitted. For example,
is written as and a
can be written as a .
Sometimes the complement operator is added; denotes the set of all strings over
that are not in . In that case the resulting operators form a Kleene algebra. The
complement operator is redundant: it can always be expressed by only using the other operators.
Examples:
- A
denotes
denotes the set of all strings consisting of 's and 's, including the empty string
the same
-
denotes the set of strings starting with , then zero or more 's and
finally optionally a .
-
denotes the set of all strings
which contain an even number of 's and a number of 's divisible by three.
Regular expressions in this sense can express exactly the class of languages accepted by finite
state automata, the regular languages. There is, however, a significant difference in compactness.
Some classes of regular languages can only be described by automata that grow exponentially in
size, while the required regular expressions only grow linearly. Regular expressions correspond to
the type 3 grammars of the Chomsky hierarchy and may be used to describe a regular language.
We can also study expressive power within the formalism. As the example shows, different regular
expressions can express the same language, the formalism is redundant.
It is possible to write an algorithm which given two regular expressions decides whether the described
languages are equal - essentially, it reduces each expression to a minimal deterministic finite state
automaton and determines whether they are equivalent.
To what extent can this redundancy be eliminated? Can we find an interesting subset of regular
expressions that is still fully expressive? Kleene star and set union are obviously required, but
perhaps we can restrict their use. This turns out to be a surprisingly difficult problem. As simple
as the regular expressions are, it turns out there is no method to systematically rewrite them to
some normal form. They are not finitely axiomatizable. So, we have to resort to other methods.
This leads to the star height problem.
Next: POSIX Regular Expression Syntax
Up: Appendix: Regular Expressions
Previous: Brief History
  Contents
Andre Merzky
2004-05-13
|