Springe direkt zu Inhalt

Boris Dimitrov:

Complexity of regular expression matching


Regular expressions, or regexes for short, are a pattern matching standard for string parsing and replacement. They are widely used computational primitive employed in many programming languages and text processing utilities such as JavaScript, Perl, Python, Ruby, Google RE2. They are also used in computer networks, databases and data mining, computational biology etc. A regular expression consists of symbols from some alphabet ∑ and a set of operators O. We will start by introducing various operators and explain their usage, giving us the basics with the help of which we will be able to construct different types of regular expressions. Then we will show the Thompson transformation, which converts an arbitrary regular expression into an equivalent nondeterministic finite automata, thus providing us with algorithm that solves the regular expression matching problem for the general case in the “rectangular” O(mn) time. Later there were improvements to this method that led to an algorithm that is the fastest for this problem known to date. However there are some specific types of regular expressions, for which faster matching algorithms exist. We will introduce some examples, which are reason to be considered that faster algorithm for the general case might exist. Finally we will classify the regular expressions based on their depth and the used set of operators and present the respective results.

Bachelor of Science (B.Sc.)