JRegexPlus

Regular Expression engine - Backreferences without backtracking

Download .zip Download .tar.gz View on GitHub

A tour of JRegexPlus

Step 1. Lexical Analysis

This is conducted by the Tokenizer class which takes the regular expression string and converts it into a stream of tokens. these tokens are represented internally by the RegexToken class and its subclasses. The subclasses contain information specific to various types of tokens such as character classes, lookaround operators, backreferences etc.

Step 2. Parsing

This is the job of the Pattern class. This is a handwritten recursive descent parser that converts the stream of tokens into a transition table and then returns a Matcher object seeded with that transition table. More below on how the transition table gets built.

Step 3. Match

The user can now use the Matcher instance to run matches on any input string. The Matcher instance can then be used to extract any results found.


How it works


Most popular regular expression engines(Perl, java.util.regex, PCRE etc) use a traditional NFA with backtracking to match regular expressions. These are highly effective as seen from their wide use and popularity but are known to have corner cases that they cannot handle. For eg., the regular expression X(.+)+X with the string =XX============== will cause these regular expression engines to hang or crash given sufficient numbers of = signs at the end(30-40 are usually more than enough).

JRegexPlus implements the McNaughton-Yamada-Thompson algorithm to construct an NFA. This algorithm is described in detail in Aho, Lam, Sethi, and Ullman's 2006 Compilers: Principles, Techniques, and Tools(Algorithm 3.23) and has the advantage of being as fast as the standard backtracking implementations in normal cases while maintaining its performance in the corner cases.

However, the algorithm, as described in textbooks, does not allow either for sub-match tracking or for backreferences. The following paragraphs explain how JRegexPlus adapts the algorithm to allow both. The description below is a bit terse for now and is best read after understanding how existing regular expression engines work. Russ Cox's description of RE2 and Eli Bendersky's series of articles both provide good explanations as does Jeff Friedl's book Mastering Regular Expressions.

Submatch grouping

This happens at the parser level and during the simulation. When parsing, the parser stores a list in each token that contains the ID's of the groups to which that token belongs. Later, when the engine simulates the NFA, each state carries a satellite object around with it which contains the partial string match that has happened so far. The satellite object contains an array which records which parts of the string matched which subgroups in the regular expression.

Note that since this is an NFA, it exists in multiple states at the same time, each of which may have followed different paths. As a result, the same letter in the text may belong to different subgroups in different state objects. However, since this is a Thompson NFA, the simulation is text-directed and not regex-directed, meaning that the string is read one character at a time and all the states are fed any given character simultaneously.

At each step, the simulation tracks if the NFA has reached the finish state and records the longest match so far, if any. The simulation ends when the string ends or when the NFA is dead(potentially after it reaches a finish state, records a success and then reads one more character which makes it die). If we did reach the finish state, the NFA reports success and the longest match found is the matched string. As described earlier, a satellite state object records the submatch groupings as it goes along so we can just read the results from the state object associated with the finish state.

Backreferences

This essentially involves simulating multiple NFA's simultaneously. Recall that the Thompson construction has the NFA in multiple states at the same time while we read the string character by character. To accommodate back-references, what we do is maintain a stack of NFA's and, every time we read a character from the input string, we pop the NFA's from the stack one at a time and perform state transitions on them with the one character in the usual way. NFA's that are alive after this step get pushed back on to the stack until all NFA's are dead. At this point, we look if any NFA reached a finish state and recorded a success for us to return.

The next question is where these additional NFA's come from and what their relationship is with backreferences. The answer is as follows - every time an NFA simulation encounters a backreference token in its transition table, it looks at the path taken to that state and figures out what string that backreference is supposed to match. This information is available in the satellite state object associated with each live state in the NFA. Note that the satellite state object records the actual string matched and not the full-fledged regular expression.

Now that we know what the backreference is supposed to match, we create a transition table to match that string and, rather than try to match that table and backtrack and so on, we create a new transition matrix by inserting that table into the existing transition table and create one big table which has replaced the backreference token with a series of tokens that represent the string the backreference should match at that position. This new table then gets pushed onto the stack of NFA's and is simulated along with the parent table and any other table that gets created along the way.

Here is an example to make it clearer - consider the regex (a*)bc\1 on the string zzzaabcaazzz.

The engine chugs along as you would expect with a Thompson NFA and matches aabc in the string with (a*)bc in the regex. Now, it has reached the backreference and the next character needs to match whatever the backreference refers to. In this case, the submatch for group 1 is aa so the simulator inserts a table to match aa and deletes the backreference token from the new table. The simulator now looks for two additional acharacters, finds them and returns a match for the regex which is aabcaa.

And that is the match we were looking for!