By Eugene Kirpichov <ekirpichov@gmail.com>
Code available at github: http://github.com/jkff/ire
This article was originally published in the Russian functional programming journal http://fprog.ru/ and this translation was intended to be published in Peter Seibel’s http://codequarterly.com/ journal, but as Peter eventually decided not to proceed with the journal, I am publishing it on my personal space.
I would like to thank Peter for his multitude of extremely valuable comments that have helped me greatly improve the clarity and flow of the article.
The problem of regular expression matching is a well-known one, with a multitude of solutions.
The corresponding algorithms are very beautiful, motivating many curious programmers to write their own regular expression engines for fun.
The approaches to this problem are quite different in their area of application. Here are some questions that the developer of an engine has to answer.
Let us list some of the existing approaches and engines.
awk, grep, re2 use the so called “automata-theoretic” approach, which allows them to guarantee linear matching time, however some features (for example, backreferences) cannot be implemented at all within their approach, and others (such as capturing groups) are quite hard to implement efficiently.
Also, with this approach it is difficult to control memory overhead while retaining high efficiency — even re2 at times suffers from exponentially large memory consumption, though it is designed to avoid such situations.
Besides, this approach allows for certain curious uses, for example, re2 uses automata theory to compute the minimal and maximal possible strings matching the regular expression, thus making Google Code Search possible by greatly reducing the amount of code to be actually matched against the expression when processing a query.
Modified automata-theoretic approach, “tagged” automata: libtre, regex-tdfa — linear matching time is also guaranteed; it is possible to do approximate search;
Semiring-based approach: weighted-regexp — also linear matching time and constant memory consumption; a very simple, beautiful and efficient implementation;
Recursive descent: most of the other engines (Perl and PCRE-compatible engines, Java, irregexp etc.)—the whole range of features, but “pathological” cases are possible where matching time sharply increases.
In a fantastic blog post Dan Piponi [1] outlined yet another approach using monoids and finger trees.
This approach only works for “true” regular expressions (i.e. we only have character classes, braces and the operators +, *, ?, |) but it allows to perform the matching incrementally: after small changes of the input string we can recompute the result very quickly without scanning the whole string. These changes include concatenating two strings and cutting a string in two pieces. Obviously these two operations form a basis for many others (insertion into the middle, appending or prepending a character etc).
In this article we further develop Piponi’s approach, which employs a number of beautiful algorithmic techniques from the world of functional programming, to build a Java library for incremental regular expression matching. This is also an interesting investigation into how the functional approach blends together with an imperative language.
There were several reasons to choose Java:
Where could such an approach to regular expression matching be useful?
The ability to quickly recompute the result after changes of the input string would be useful in text editors for incremental highlighting of arbitrary syntax, with tokens specified by their regular expressions (vim, for instance, allows rather flexible definition of syntax rules, but does not always do the highlighting and re-highlighting correctly, because it uses a rather naive approach to highlighting).
Unfortunately, the engine developed in this article has performance characteristics that do not allow it to serve this purpose — however, not all hope is lost: some possible improvements will be outlined.
One could imagine a problem in bioinformatics where one assembles a DNA sequence, say, using a genetic algorithm, from some “basis” sequences using glue and scissors (concatenations and cuts), and optimizes a function depending on the presence and position of particular patterns in the sequence.
The problem that we’re going to solve is incremental matching of strings against regular expressions. There are several ambiguities to resolve here, however. Let us outline the major features of our engine: they are basically the same as those in Dan Piponi’s article.
There is one more ambiguity to resolve: whether to make the “incremental” interface pure or impure in the mathematical sense: whether concatenation and splitting modify their arguments or create new results.
We choose the pure option (concatenation and splitting are mathematical functions), because, as it is the case with nearly any kind of data structures having both a “pure” and “impure” implementation, it turns out dramatically easier to reason about mathematically and declaratively, and also dramatically easier to implement, debug and test. We shall elaborate more on the importance of this design decision in the section ‘The importance of purity’ section.
The most well-known approach to regular expression matching is based on finite automata and is studied in most university courses on compiler construction. You can throuroughly familiarize yourself with it, for example, in the article by Russ Cox. Let us provide just a quick refresher of the most basic concepts of how finite automata are used for matching text. Incrementalization will follow in a surprisingly simple fashion.
Given a regular expression, one uses, for example, the “Thompson’s construction” (due Ken Thompson), and builds a finite automaton where one state is declared “initial”, one is declared “final” (though in theory nothing prevents having more than one initial or final state, and we shall use this opportunity in our algorithm), and some states are connected to each other by edges.
An edge from A to B with subscript c means: “If state A is active and the character c is input, then state B becomes active instead of A”. There are also ε (epsilon)-edges: if A is connected to B by an ε-edge, then, upon activation of A, B is also immediately activated.
Epsilon edges are needed, for example, when constructing an automaton for the “?” (“optional”) construction: given an automaton for R, you can insert an epsilon transition from the initial to the final state, giving an automaton for “R?”. If no edge from state A fits the input, then it is simply deactivated.
The automaton is called “non-deterministic” because, given a state and a character, you can’t determine which edge to follow, because there can be several of them. Instead, you follow all of them at once and therefore, at any given moment, several states may be active.
To perform matching, the initial states are activated and each character of the sequence is fed as input in turn. If in the end at least one “final” state is active, then the matching is declared successful.
Example. The regular expression “.*a(b*a|bc+)a”.
An automaton for this expression is shown on the picture below (somewhat simplified with respect to the Thompson’s construction, by removing several redundant nodes and epsilon edges), and here is the sequence of its active states upon feeding it with the string “aabcca”.
A non-deterministic automaton for the regular expression “.*a(b*a|bc+)a”
String seen so far | Active states |
s1 | |
a | s1,s2,s3,s4 |
a a | s1,s2,s3,s4,s5,s8 |
aa b | s1,s3,s6 |
aab c | s1,s6,s7,s8 |
aabc c | s1,s6,s7,s8 |
aabcc a | s1,s2,s3,s4,s9 |
Since in the end the active set contains a final state, s9, the matching is declared successful.
To speed up matching, one sometimes eliminates ε-edges and performs determinization of the automaton (each state can have at most 1 out-going edge with a given subscript) [2]. As a result, a completely different automaton is obtained, which, however, corresponds to the same regular expression.
In a deterministic automaton, at any given moment, exactly one state is active, which allows for a more efficient implementation of the matching procedure.
However we won’t use determinization because as a result, the automaton may blow up exponentially [3], which, as we’ll later see, is completely unacceptable in our case.
We shall use only the first part, namely elimination of ε-edges: it can only decrease the size of the automaton (compare the NFA above and below).
The algorithm is very simple: take “shortcuts” through ε-edges, i.e. whenever one node is reachable from another through a chain of ε-edges, copy edges from the second node to the first. Then remove all ε-edges (since they’re now unnecessary) and nodes that became redundant (unreachable) as a result of this.
The non-deterministic automaton corresponding to the expression “.*a(b*a|bc+)a”, after removing ε-edges
Thus we have outlined the approach to testing whether a string matches a regular expression using finite automata. Finding match positions and capturing groups is more difficult, and we direct the reader to Russ Cox’ article for a more thourough treatment — we do not present the traditional approach here ourselves, because we shall use a different one.
Now let us gradually arrive to the basic ideas of incremental matching.
First, let us depict this automaton in a slightly different way.
Every possible input character has a “transition function”: “What will be the next states of the automaton after feeding this character to the input, if its current state is S?” It can also be seen that the notion of such a “transition function” makes sense not only for individual characters, but for whole strings. Strings have transition functions, too!
Given a string’s transition function, however long this string might be, we can emulate feeding it to the automaton’s input without even looking at its characters. A string’s transition function is computed from the transition functions of its characters (which is also shown on the picture below); in the same way, given transition functions for two strings, we can compute the transition function of their concatenation.
Transition functions of a non-deterministic automaton and their composition
Note that the transition function of a concatenation of two strings is the composition of their transition functions, in a slighty unusual sense.
If we were speaking of deterministic automata, then transition functions would be regular mathematical functions from S to S, where S is the automaton’s set of states.
Given two strings p and q with corresponding transition functions f and g, feeding pq to the automaton will take s to f(s) and then to g(f(s)), which means that the transition function of pq is g∘f.
However, we’re using NFAs, and transition functions of characters and strings take states to sets of states. Note that we do not say that the inputs of these functions are also sets of states: it suffices to define them only at individual “singleton” states: applying a transition function to a set of states means applying it to each of these states individually and taking a union of the results (imagine that a simulation of the automaton occurs simultaneously for all these states).
So suppose again that the transition functions of p and q are f and g. Then if the automaton’s initial state is s (some individual state), then pq will take the automaton first to f(s) (a set of states) and then to ⋃r←f(s)g(r).
This is the definition of composition for transition functions of non-deterministic automata: (f∘g)(s)=⋃r←f(s)g(r).
This definition is curious because it has several interpretations.
The interpretation given just now;
The graphical interpretation as connectivity in a two-layer graph, as on the picture above;
Multiplication of boolean matrices: if we represent the transition function f as an NxN boolean matrix (where N is the number of states in the automaton) with 1 in cell s,t if t∈f(s).
Then we may rephrase the definition (f∘g)(s)=⋃r←f(s)g(r) as follows: (f∘g)(s,t)=⋁r←f(s)t∈g(r)=⋁r←1..N(s,r)∈f∧(r,t)∈g .
Note the extreme similarity with matrix multiplication: (AB)[i,j]=∑k←1..NA[i,k]∗B[k,j]: only summation is replaced with logical “or” (∨) and multiplication is replaced with logical “and” (∧).
This interpretation is of course not new; it is a well-known fact shown in most textbooks on graph theory that connectivity in graphs may be computed using multiplication of boolean matrices corresponding to their incidence matrices. However, it opens some opportunities for optimization by employing well-known algorithms for fast matrix multiplication.
Let us now consider how this knowledge of transition function multiplication helps us implement incremental matching.
We shall start with a simpler problem: match testing, i.e. answering the question “Does the string s match regular expression R?”
This is by definition equivalent to the question “Does the transition function of s take R‘s automaton to a final state?”. So, if we maintain transition functions of strings under the incremental operations (concatenations and splits), we’ll be able to also maintain the answer to this question.
Handling concatenations is simple: given that we’ve learnt to rapidly compute transition functions of string concatenations, we’ve essentially learnt to rapidly recompute results of “yes/no”-style match tests after concatenations.
Therefore, if we carry with each string its transition function (a device for rapidly performing a match test), then when concatenating two strings, we’d compose their functions, yielding again a device for rapidly testing their concatenation for a match.
Handling splits is harder: it is not obvious how to get the transition function of a part of a string, knowing only that of the whole string.
However, this problem of splitting is, curiously, reduced to the problem of concatenation: if we represent a string as an hierarchical concatenation (a tree) of several smaller parts (chunks), then parts of this string will be concatenations of some of these chunks. More precisely, a part of the string equals the concatenation of all complete subtrees fully enclosed within that part, with incomplete chunks giving birth to new tiny but complete subtrees — see picture below.
Representing a part of a string as the concatenation of its smaller parts
If a split goes through the middle of a chunk, we’ll still have to recompute the transition function for the resulting “sub-chunks” character by character, but most of the chunks or bigger parts of the hierarchy will remain intact and we won’t have to recompute their transition functions, thus saving most of the computation.
All that remains is to choose a good way of representing a string as many chunks, so that concatenation and splitting are efficient, and memory overhead is not too high. This is what we consider in the next section.
Note again that this still does not allow finding positions of matches — only whether the string matches the expression or not. This is perhaps the most interesting algorithmic problem in this article, and we shall address it later when more background is given.
Now, before going to the technical and most interesting parts, let us recap on the basic idea of incremental matching.
An example of such a datastructure is illustrated on the picture below.
Example representation with cached transition functions for the string bcabbccabcccab
This data structure, which is a tree of small arrays, is called a “rope”, and ropes are frequently used for representing strings where efficient concatenation and splitting are needed while preserving reasonably low memory overhead (e.g. having a balanced tree of individual characters is not an option because it blows up memory usage).
Ropes are a long-known datastructure and there are many varieties of them, usually differing in the kind of balanced trees they use. One of these varieties is described in the article Ropes: An alternative to strings by Hans-J. Boehm, Russ Atkinson, and Michael Plass, but we’ll use a different one, to be shown later in the article.
Maintaining the transition functions of rope nodes at concatenations and splits is simple for every variety of ropes: exactly as we assemble new nodes from old nodes during rebalancing, we assemble the transition functions of new nodes from transition functions of old nodes (by composing them).
Even the definition of rebalancing operations doesn’t have to be modified, just the constructor for composite nodes (nodes with children) has to multiply the children’s transition functions to obtain one for the parent, and the constructor for chunks has to assemble the transition function from transition functions of characters.
Note that there is not much special about transition functions that allows us to maintain them under splits and concatenations. The only reason why we could do so is because we can compute the transition function for a concatenation of two strings from their transition functions.
So, essentially ropes allow us to maintain absolutely any kind of “additive” characteristic (and we’re of course not restricted to speaking about strings, i.e. lists of characters — for example, lists of numbers are just as fine). There is just one restriction: in order for the additivity to make any sense, it must not be dependent on the order in which we add up the items to obtain the whole; this property is called associativity: since for concatenation holds a(bc)=(ab)c, the additive characteristic f must obey f(a(bc))=f((ab)c), that is, if the additivity is expressed as f(ab)=g(f(a),f(b)), then g must obey g(g(x,y),z)=g(x,g(y,z)).
Here are some examples of additive (associative) characteristics:
There are many more examples, you can find them near the end of the article, in the section “Monoids”.
Below is an example of a rope of numbers maintaining the minimum and maximum — the combining operation here is g((m1,M1),(m2,M2))=(min(m1,m2),max(M1,M2))
A rope of numbers with “cached” minimums and maximums.
In addition to splitting at a position, one may implement one more very important and beautiful operation on ropes: splitting on a monotone predicate. We shall need this operation when we get from “testing for a match” to “locating matches”, but we first provide an abstract setting, because it will be a lot easier to understand how locating matches can be done using this abstract algorithm, than to go in the opposite direction (recognize its beauty inside the full complicated algorithm for locating matches).
Suppose f is a predicate on strings. Suppose that f is such that a string may only gain (but not lose) the property f when symbols are appended to it on the right, i.e., ∀s1,f(s1)⇒∀s2,f(s1+s2). In this case let us call f monotone.
Then obviously each string s satisfying the property f has a minimal prefix satisfying f. Let us illustrate this notion and the algorithm for its efficient computation on a rope:
Splitting a rope of numbers by the monotone predicate “Sum of squares exceeds 140”
This picture shows a rope of numbers, storing in each node the sum of squares of numbers within this node, and it also shows how the rope is split to find the minimal prefix whose sum of squares is greater than 140. The algorithm resembles a “hot and cold” game; the figure shows sums of squares of prefixes before and after various edges and (when scanning a leaf chunk) individual numbers; those that do not yet satisfy the condition are marked with “cold” blue color, and those that do are marked with “hot” red. This picture also shows that when scanning through a leaf chunk, we have to recompute and add up the squares of numbers, i.e. the information stored in nodes of the rope is not sufficient.
On this picture an edge is marked red if the predicate is true for the prefix which ends where this edge ends. To find the split point, we have to descend from the root of the rope to a leaf chunk, doing steps downward or to the right, at each level scanning edges left-to-right until we find a red edge (this is similar to a binary search procedure), and finally split the leaf chunk, this time using regular linear search.
Since we only move downward or to the right, at any moment the edges that have been considered cover together an ever-growing prefix of the original rope, and each new scanned edge appends the rope covered by this edge to this prefix. If the predicate is not true before scanning an edge but becomes true after scanning it, this means that it becomes true somewhere inside the part of the rope covered by the destination node of this edge, and we have to descend downward into this node in order to find the split point.
In order to be constantly aware of whether the predicate is true, we should be able to quickly compute f(ps), given f(p) and f(s) for any two ropes p and s, since when moving downward or to the right, we increase the “current prefix” (p) with sub-ropes covered with each scanned edge (s), and when we get to the leaf chunks, during linear search we increase p with single-element sub-ropes corresponding to elements of the chunk.
Now note that match testing also sometimes fits this pattern:
So, we can use this “monotone split” procedure to find the smallest prefix of the string containing a match of our regex.
Splitting a rope for the string acabaabacabccaaba on the monotone predicate “matches .*bcc.* ”
This is key to finding match positions.
Suppose we have to find a match of R in the string. This problem can partly be reformulated as testing the string against the expression “.∗R.∗”, but this only tells us the answer to the question “is there a match of R somewhere in the string?”, but not to “where is the match?”
Two key ideas will help us find the match positions.
So, we use the “split on monotone predicate” operation to find the end of the first match, and use it again, but this time in backward direction and with the backward automaton, to find its beginning. If the expression is such that strings matching it are usually small, we can just run the backward automaton character by character; if not, we can also have the nodes of the rope store transition functions not only for the “forward” automaton, but also transition functions of reversed parts of the string with respect to the “backward” automaton.
It’s easy to see that all rope operations change trivially — instead of composing one pair of transition functions, we compose two: (f1,b1)∘(f2,b2)=(f2∘f1,b1∘b2) — note that the order of composition for backward transition functions is reversed because for strings if a=bc, then reverse(a)=reverse(c)reverse(b).
There is actually a number of complications here, related to possible overlaps of occurrences of different items from the given system of regular expressions (or even self-overlaps), but the main idea is the same: split to find the end of the match, split backward to find the beginning. The curious reader is directed to the source code.
Let us put together the presented algorithms and overview the structure of the whole program. Graphically this structure is shown on the pictures below. This kind of diagrams is called “concept maps”, drawn with IHMC CmapTools software.
Program structure overview
The user specifies several regular expressions as strings, which through compilation (parsing and converting to a finite automaton) get transformed to an object of type PatternSet. Such an object is capable of “indexing” regular strings, yielding objects of type IndexedString. They, in turn, are capable of efficiently searching for all matches of the patterns in themselves, and they can also efficiently be concatenated or split (on a monotone predicate). These are of course the very ropes maintaining transition functions.
Program structure: ropes and additive measures
“Indexed strings” are implemented with ropes (Rope). The program implements only ropes over characters, because further generalization was not necessary within the scope of our problem and the Java type system would cause the generic implementation to have a lot of syntactic garbage. A rope is a string that knows its “measure” of type M. The measure is computed as the sum of measures of individual characters (Function<Character,M>) under an arbitrary additive measure (Reducer). Ropes are implemented with a special kind of balanced trees that will be described later in the article. During rebalancing operations measures of new nodes are summed from measures of old nodes using this additive operation. For regular expression matching, the additive operation composition of transition functions for the expression’s automaton (more precisely, for two automata: forward and reverse).
Program structure: finite automata
Automata are implemented in a way that facilitates representing and computing their “transition functions”. An automaton can tell its transition function for any given character, and the transition function is a function from the automaton’s state type to the same type. A value of the state type is a “black box” that only tells what patterns are “terminated” by this state: if after scanning a string, the automaton’s state terminates certain patterns, then there are occurrences of these patterns ending at the end of this string. Transition functions are represented not as arbitrary functions but as a special kind of objects with efficient composition [5]. These objects are placed into the ropes used as “indexed strings”.
Program structure: the connection between deterministic and non-deterministic automata
The notion of an automaton is used in the program in two ways: for deterministic and non-deterministic automata. The state type of a deterministic automaton [6] is the set of integer numbers from 0 to some N; correspondingly, transition functions are functions from 0…N to 0…N. They are implemented as one-dimensional int[] arrays, and their composition is computed as easily as c[i] = b[a[i]].
In the case of non-deterministic automata, values of the state type are subsets of some “basis” state set, and a transition function is determined by the way in which it transforms each individual basis state to several other basis state, i.e. it is specified by a transformation of the form int → int[]. Composition of such functions is expressed as c[i]=⋃j←a[i]b[j] (thread the second function through all the outputs of the first function). For the sake of efficiency, this transformation is implemented by a boolean matrix with every element represented as one bit in a single array [7].
The aforementioned datastructure allows to match a string against a single regular expression. It is natural to demand a more practically useful generalization: matching against multiple expressions (there exist also other approaches to this problem, see for example the article “Compact DFA structure for multiple regular expressions matching” by Lin, Tang et al.). The result of such a match is a set of facts of the form “The portion i..j matched expression k ”.
The problem of matching against multiple regular expressions is solved quite easily: we build an automaton for their union, R1|R2|..., but we distinguish between final states for different expressions, i.e. a state of the resulting automaton is not simply “final” or “non-final”, but it has an associated bitset: for which of the expressions it is final.
This change only slightly influences other parts of the program, such as automata minimization or finding match positions.
Let us show an example of usage of the library.:
PatternSet pat = RegexCompiler.compile("007","008")
IndexedString s1 = pat.match("as00haklsdjhfla00");
IndexedString s2 = pat.match("7jhd7dsh008dsfa");
System.out.println(s1.append(s2).getMatches());
The program prints:
[0@(15,3), 1@(25,3)]
This means that occurrences of the first and second pattern were found correspondingly in positions 15–17 and 25–27.
This code uses the following aspects of the API:
Now let us discuss the library’s performance.
This discussion won’t be a simple one, because the performance is influenced by a large number of factors.
It is also necessary to balance the share of time devoted to indexing the string, doing concatenations/splits and searching. Indexing is done quite slowly, and one needs a large number of concatenations/splits and search to overweight it and to get an advantage over a “traditional” regular expression engine.
In light of the above, let us consider just a single test and analyze its performance. Take the set of regular expressions from the regex-dna problem from the Language Shooutout and consider performance of matching operations compared to the standard regular expression engine bundled with Java (java.util.regex.Pattern), varying length of the input DNA string (but keeping constant the total number of matches) and size of the leaf chunks: 8, 16, 32, … 512 characters.
We do not measure splitting performance separately, because splitting is used during search, and we do not measure concatenation performance because it is so fast (allocate a few objects and compose a few transition functions) that it is difficult to imagine a scenario where it would be the bottleneck.
Here is the pattern set:
[cgt]gggtaaa|tttaccc[acg]
a[act]ggtaaa|tttacc[agt]t
ag[act]gtaaa|tttac[agt]ct
agg[act]taaa|ttta[agt]cct
aggg[acg]aaa|ttt[cgt]ccct
agggt[cgt]aa|tt[acg]accct
agggta[cgt]a|t[acg]taccct
agggtaa[cgt]|[acg]ttaccct
Let us generate the input as a random sequence of the characters “ a, g, c, t ” of length 50000N (N will vary from 1 to 10) where any two consequent characters are distinct (therefore the aforementioned patterns can’t occur there), choose 100 random positions in the sequence and insert there occurrences of strings randomly chosen from the set of 8×2×3=48 strings defined by the given pattern set (8 patterns, each with 2 alternatives, each alternative matching 3 different strings). The program will compute the occurrence count of each pattern.
Results of the benchmark are shown on the pictures below. The performance characteristics of each of the two programs (our engine and the standard Java engine) are shown in the terms that are most appropriate for them: for our engine it is the indexing speed (in characters per second, because indexing speed is proportional to the number of characters) and search speed (in occurrences per second, because search speed is proportional to the number of occurrences). For the Java engine a more appropriate characteristic is “characters processed per second”; it is displayed on the same graph with our engine’s “indexing speed’, though this comparison is somewhat flawed.
On graphs in the left part of the picture, different curves correspond to different base sizes of chunks in the rope datastructure, and the bold curve corresponds to the Java engine.
The question “When is our engine better than the Java engine?” is best answered by the top left graph, which shows the dependency of search speed on the string length. It can be seen that the Java engine’s search time is proportional to the length of the string, and our engine’s time is proportional to the number of occurrences. With small base chunk sizes (4–32 characters) our engine is much faster for large strings.
On graphs in the right part of the picture, different curves correspond to different lengths of the input string. They are displayed to show how the base chunk size influences search and indexing speed. It can be seen that with increase of this chunk size indexing speed increases rapidly (but with a limit) and search speed decreases just as rapidly.
We can conclude that for large strings with a small number of occurrences our engine is more efficient, especially if tuned for a small base chunk size. However, in this case there is a sharp increase in memory consumption: memory consumption per leaf chunk does not depend on the chunk size, but there are 128 times more of 4-character chunks in a string then there are 512-character chunks, therefore the memory consumption is also 128 times larger.
Performance benchmarks
Memory overhead
Almost all the time is spent composing transition functions of characters (which is done through boolean matrix multiplication) for recomputing transition functions of leaf chunks during splits, and almost all the memory overhead is devoted to storing these boolean matrices.
We haven’t considered how performance depends on the complexity of regular expressions and on the number of occurrences. A comprehensive treatment of performance questions would take up too much space; curious readers are encouraged to play with the library themselves.
The current implementation has a number of drawbacks. It is not yet clear which of these can be fixed and which can’t, but in any case, they are interesting algorithmic problems worth thinking about.
So, we’ve built a library that does incremental matching of strings against a set of regular expressions using balanced trees and monoids. The library is very efficient in the case of long strings, few expressions, few occurrences, frequent incremental recomputation and a lot of free memory, and is quite inefficient in other cases. It’s hard to say, for which of the cases it is at all possible to make it efficient: for example, whether it is possible to create a high-performance incremental lexer with it, and whether we can at all name this experiment an algorithmic success. The author hopes at least that the presented techniques will inspire the algorithmically inclined readers for new research and will prove of use to them in other problems.
In any case, we can say that this development is an interesting and successful experience of blending the functional and imperative approaches.
Let us list the used techniques, ideas and traditions from functional programming, and discuss how well they fit with the imperative nature of the target language (Java).
The author would like to thank Dmitry Demeschchuk, Julia Astakhova, Alexey Ott and other reviewers of the original Russian version of this article for their feedback.
The project is published on GitHub at http://github.com/jkff/ire.
It has already been said that, when representing a string by a balanced tree, in order to keep memory usage reasonable, one should associate each leaf of the tree not with one character but with a chunk. Therefore, the datastructure suggested in Dan’s post (finger trees, described, for example, by Heinrich Apfelmus and in the original paper) is not a good fit: it assumes one node per element of the sequence (string).
We should choose one of the kinds of balanced trees satisfying our requirements. Let us list the requirements.
One of the simplest (in terms of implementation) but nevertheless quite efficient balanced trees are trees of constant height, for example “2–3-trees” and “B-trees” (which are frequently used for DBMS indices).
In such trees, the length of the path from root to each leaf is the same, therefore (since each non-leaf node has at least 2 children) the height is logarithmic. Usually they are used for a quite different class of problems, namely that of representing sets and searching them, but they are also a perfect fit for representing sequences (strings). The basic idea is that a node is allowed to have K to 2K−1 children (for some K) and most operations, such as insertion, splitting and concatenation, preserve this property; and when they don’t, a rebalancing occurs: either an overflown node is split into two, or two underflown nodes are merged into one.
We shall use a variation on this theme: 2–3 trees with chunks in leaves, where the chunk size may vary from N to 2N−1, and all data is stored in leaves, not in nodes [9].
The next picture illustrates the implementations of all operations on such trees. Essentially two operations suffice: splitting and concatenation, all others can be expressed through them. When digesting the pictures, it is important to remember that we’re dealing with trees of constant height. Note also that the chunk size invariant may be broken, but only in the case where there are less than N elements total: in this case the tree is represented by a single underflown chunk.
Rope operations
Let us explain these pictures briefly in the order in which they appear, top to bottom, left to right:
One of the most important aspects of this datastructure is its “purity”: operations on it do not change an existing instance but form a new one instead, i.e. they are functions in the mathematical sense.
We have already mentioned the importance of the decision to make the incremental interface “pure”, but now it is time to elaborate.
There are many advantages to using a pure approach to algorithms and datastructures, which have manifested themselves during the implementation of this program, particularly in the implementation of ropes.
Exceptional ease of implementation. Essentially we can take the diagrams drawn on the picture above and translate them mechanically to code. Lack of mutability causes the code to be a lot simpler, and its correctness (or lack thereof) becomes more obvious, because the code doesn’t have the time dimension anymore, and in order to understand how it works, one does not need to mentally trace a sequence of intermediate steps [10]: the code is just an enumeration of various cases where for each branch it is declared that “such and such input yields such and such output”. And indeed, to the author’s surprise, after the first successful compilation only 1 or 2 silly mistakes were fixed before the code passed all the tests.
Ease of debugging. During debugging one often wants to look at the values of some expressions “in advance” in order to understand whether it is necessary to step into them, or their result is correct and the error is somewhere later in the code [11] When these expressions are “pure” (i.e., don’t have side effects), such an approach is possible. If side effects are present, then evaluating the expression in the debugger will change the program’s internal state and further debugging will be pointless.
Complete thread safety. It is well known that most standard mutable datastructures do not allow concurrent reading and modification, and one must synchronize access to them in a multi-threaded program. However, it is often desirable to provide non-blocking read access, even if not the most current state of the datastructure will be read. There exist tricks allowing to do that for mutable datastructure (see, for example, the implementation of the ConcurrentHashMap or the ConcurrentSkipListMap classes in the Java standard library), but for immutable datastructures no tricks are necessary, because every instance can be safely read without worrying about it being concurrently modified: it cannot be modified at all.
High performance and low memory consumption in certain scenarios. There exist situations where it is useful to preserve the “original” version of a datastructure after applying an operation to it (for example, to preserve access to two ropes after computing their concatenation). Most importantly, these situations arise in backtracking enumeration algorithms and genetic algorithms (for example, when it is possible to combine two genomes in several ways, when one wants to keep both the genomes and the result of their crossover). Of course, one might just copy the original datastructure, but that might be very inefficient, especially if the structure is large. On the contrary, for pure datastructures there’s no need to copy, and we get a performance advantage. Also, as shown on the picture above, many operations on ropes allocate minuscule (constant or logarithmic) amounts of extra memory. The picture below shows the object graph for two ropes and their concatenation. It can be seen that most of the memory is used in a shared fashion, but each object is nevertheless accessible independently.
Sharing of memory after rope concatenation.
To even better explain how rope concatenation and splitting work, and why they are so easy to implement correctly, let us simply show the code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | private static <M> Rope<M> append(Rope<M> left, Rope<M> right) {
int blockSize = left.factory.getBlockSize();
Reducer<M> reducer = left.factory.getReducer();
M sum = reducer.compose(left.sum, right.sum);
if (left.h == right.h) {
// Case "Two non-leaves of equal height"
if (left.h > 0)
return new Rope<M>(left, right, sum);
// Case "Two leaves, both large enough to be children of a 2-node"
if (!left.isUnderflownBlock() && !right.isUnderflownBlock())
return new Rope<M>(left, right, sum);
// Case "Two leaf chunks, rebalancing needed"
String bigBlock = left.block + right.block;
if (bigBlock.length() <= 2 * blockSize - 1)
return new Rope<M>(left.factory, bigBlock, sum);
return new Rope<M>(
new Rope<M>(left.factory, bigBlock.substring(0, blockSize)),
new Rope<M>(left.factory, bigBlock.substring(blockSize, bigBlock.length())),
sum);
} else if (left.h == right.h + 1) {
if (left.c == null) // 2-node of h + h -> 3-node
return new Rope<M>(left.a, left.b, right, sum);
else // 3-node of h + h -> 2-node of 2-nodes
return new Rope<M>(
new Rope<M>(left.a, left.b, reducer.compose(left.a.sum, left.b.sum)),
new Rope<M>(left.c, right, reducer.compose(left.c.sum, right.sum)),
sum);
} else if (right.h == left.h + 1) {
// Symmetrical
} else if (left.h > right.h + 1) {
// Break the larger tree into nodes, regroup using associativity
if (left.c == null)
return left.a.append(left.b.append(right));
else
return (left.a.append(left.b)).append(left.c.append(right));
} else { // right.h > left.h + 1
// Symmetrical
}
}
|
And then the splitting code, with a slightly curious interface. This function splits a rope into two, given a monotone function on the string represented by the rope. Monotonicity is exploited by requiring to provide functions that compute f(a+b) given f(a) and b, where b is either a string (represented by a rope) or a single character (for finding the rising edge within a leaf-level rope chunk).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | public <S> Pair<Rope<M>, Rope<M>> splitAfterRise(
S seed,
Function2<S, Rope<M>, S> addChunk, Function2<S, Character, S> addChar,
Predicate<S> toBool)
{
if (block != null) {
// Simple linear search inside the chunk
S s = seed;
for (int i = 0; i < block.length(); ++i) {
if (toBool.isTrueFor(s))
return Pair.of(
new Rope<M>(this.factory, block.substring(0, i)),
new Rope<M>(this.factory, block.substring(i, block.length())));
s = addChar.applyTo(s, block.charAt(i));
}
if (toBool.isTrueFor(s))
return Pair.of(this, new Rope<M>(this.factory, ""));
return null;
} else {
// Start from seed
if (toBool.isTrueFor(seed))
return Pair.of(new Rope<M>(this.factory, ""), this);
S afterA = addChunk.applyTo(seed, a);
// If adding node A made the condition true, descend into A
if (toBool.isTrueFor(afterA)) {
// Split A and assemble result from b, c and parts of a
Pair<Rope<M>, Rope<M>> sa = a.splitAfterRise(seed, addChunk, addChar, toBool);
return (c == null)
? Pair.of(sa.first, sa.second.append(b))
: Pair.of(sa.first, sa.second.append(b).append(c));
}
// Same for B
S afterB = addChunk.applyTo(afterA, b);
if (toBool.isTrueFor(afterB)) {
Pair<Rope<M>, Rope<M>> sb = b.splitAfterRise(afterA, addChunk, addChar, toBool);
return (c == null)
? Pair.of(a.append(sb.first), sb.second)
: Pair.of(a.append(sb.first), sb.second.append(c));
}
// Same for C, if this is a 3-node
if (c == null)
return null;
S afterC = addChunk.applyTo(afterB, c);
if (toBool.isTrueFor(afterC)) {
Pair<Rope<M>, Rope<M>> sc = c.splitAfterRise(afterB, addChunk, addChar, toBool);
return Pair.of(a.append(b).append(sc.first), sc.second);
}
return null;
}
}
|
Remember how, given a finite automaton, we can associate every string with a “transition function” with respect to this automaton, and when concatenating two strings their transition functions are composed (let us denote the composition of f1 and f2 as f1∘f2.
Composition of transition functions (similarly to concatenation of strings) has a few simple and useful properties:
These two properties allow us to say that transition functions of a finite automaton form a monoid.
More precisely, it is said that the set M, the operation ⊗ and the element u∈M (called the “unit” of this operation) form a monoid if the aforementioned two properties hold.
Since the notion of a monoid is so simple and general, it is unsurprising that upon a close look at the “casual” objects in programming one may see dozens of monoids. Some of them are listed in the table below. Some applications of monoids to programming are also listed in Dan Piponi’s article Monoids and their uses.
The set M | Operation ⊗ | Unit u | Comment |
---|---|---|---|
Numbers | + | 0 | Natural, integer, real, complex, quaternions… |
Numbers | × | 1 | |
Integers | LCM | 1 | |
Polynomials | LCM | 1 | |
Numbers, strings… | MIN, MAX | Maximal and minimal element | |
Booleans | AND | TRUE | |
Booleans | OR | FALSE | |
Matrices | + | 0 | Over numbers (+, ×), over numbers (+, MIN), over booleans (OR, AND), … |
Sets | Union | Empty set | |
Sets | Intersection | Complete set | Restricted to subsets of the “complete” set |
Lists, strings… | Concatenation | Empty sequence | |
Dictionaries | Union | Empty dictionary | “Conflicts” are resolved in another monoid: (dic1⊗dic2)[key]=dic1[key]⊕dic2[key] |
Functions of type A → B | (f⊗g)(a)=f(a)⊕g(a) | e(a)=eB | (B,⊕,eB) is a monoid |
Permutations | Multiplication | Identity permutation | |
Functions | Composition | Identity function | |
Tuples (x,y) where x∈X,y∈Y | (x1,y1)⊗(x2,y2)=(x1⊕Xx2,y1⊕Yy2) | (uX,uY) | If (X,⊕X,uX) and (Y,⊕Y,uY) form monoids |
… | … | … |
[1] | Dan Piponi is a specialist on computer graphics, having participated in the creation of all three “Matrices”, “Star Trek” and some other movies. |
[2] | There exist several algorithms for determinization, described, for example, in the article “An O(nlogn) algorithm for minimizing the states in a finite automaton” by Hopcroft or see the Brzozowski’s algorithm. |
[3] | For example, any deterministic automaton for an expression of the form (0|(01*)(01*)(01*)… 0)* will have size O(2n), where n is the number of repetitions of (01*). |
[4] | This idea has been taken from the article Regular Expression Matching in the Wild by Russ Cox and is used in his re2 engine. |
[5] | There’s a similar situation in graphics programming: coordinate transformations are also represented not with arbitrary functions but with matrices of numbers that can be efficiently multiplied (composed). |
[6] | Actually we don’t need deterministic automata in the final program; they were only used during testing and encouraged creating the automaton abstraction, of which these two are particular cases. |
[7] | Curiously, composition of such transformations is then also captured by multiplication of such boolean matrices. |
[8] | In the early stages of development there was a problem where computing the transition function for a chunk of N characters would require N intermediate matrices, but this problem was easily solved with a small API change without sacrificing its purity. |
[9] | A similar datastructure is used for a similar purpose in the “Data.Sequence” module in the Haskell standard library. |
[10] | It is instructive to look, for comparison, at some implementation of rebalancing in mutable red-black trees. |
[11] | However, it’s not a secret to anyone that Real Programmers don’t use debuggers: unfortunately, they won’t be able to appreciate this particular advantage. |