JRegexPlus

Regular Expression engine - Backreferences without backtracking

Download .zip Download .tar.gz View on GitHub

This is a proof of concept Regular Expression engine that handles sub-match grouping and back-references without backtracking. See here for a User Guide, here for a test suite that can be used to test the engine and here for a short explanation of how it achieves this. If there are any questions, I can be contacted at karthik_jayaraman at msn dot com.

What it does

This is a full-fledged regular expression engine written in Java. It is unique in that it is a one-pass NFA engine without backtracking that can handle back-references, reluctant quantifiers, provide sub-match groupings and process lookaround operators with arbitrary embedded regular expressions(including variable-length operators like + and *). The ability to handle back-references without backtracking or multiple passes is unique among publicly available regular expression engines - including popular ones like Perl, PCRE or java.util.regex and goes against the popular belief that only backtracking implementations can handle backreferences.

What it includes

It handles most constructs supported by the java.util.regex engine as described below except for those outlined here.

Additional features

  • Backreferences are processed without backtracking or multiple passes.

  • Lookbehind operator can handle arbitrary regular expressions including + and * operators but not backreferences. Lookaround does use a second pass over the input at the point where the lookaround begins.

Advantages and disadvantages

  • The primary advantage of a non-backtracking implementation is the performance improvement in cases that would cause a backtracking engine to hang. Compared to a DFA, this implementation also supports submatch grouping, backreferences and lookaround with arbitrary regular expressions.

  • The general problem of matching regexes that include backreferences is NP-hard so it is still possible to construct combinations of regular expressions and text that will cause exponential blowup. However, this engine's technique has the advantage of other Thompson NFA implementations in that expressions without backreferences will not create exponential blowup. As a result, the number of corner cases with backreferences that blowup the engine are also fewer.

  • The results honour POSIX-style left-most longest matching. Given the expression (nfa|nfa karthik) on a string "nfa karthik", a traditional nfa will match just nfa. JRegexPlus will match "nfa karthik".

  • For simple regexes, the processing speed is competitive though not necessarily the fastest. However, this has more to do with the fact that this is meant as a proof of concept demo to show that backreferences can work without backtracking and not as a production regex engine. The core algorithm can be expected to perform comparably to a backtracking implementation in normal cases while continuing to respond fast in cases that would make a backtracker hang.

Summary of JRegexPlus regular-expression constructs

Construct Matches
 
Characters
x The character x
\\ The backslash character
\0mnn The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
\xhh The character with hexadecimal value 0xhh
\t The tab character ('\u0009')
\n The newline (line feed) character ('\u000A')
\r The carriage-return character ('\u000D')
\f The form-feed character ('\u000C')
 
Character classes
[abc] a, b, or c (simple class)
[^abc] Any character except a, b, or c (negation)
[a-zA-Z] a through z or A through Z, inclusive (range)
[a-d[m-p]] a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]] d, e, or f (intersection)
[a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]] a through z, and not m through p: [a-lq-z](subtraction)
 
Predefined character classes
. Any character (may or may not match line terminators)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]
 
POSIX character classes (US-ASCII only)
\p{Lower} A lower-case alphabetic character: [a-z]
\p{Upper} An upper-case alphabetic character:[A-Z]
\p{ASCII} All ASCII:[\x00-\x7F]
\p{Alpha} An alphabetic character:[\p{Lower}\p{Upper}]
\p{Digit} A decimal digit: [0-9]
\p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}]
\p{Punct} Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph} A visible character: [\p{Alnum}\p{Punct}]
\p{Print} A printable character: [\p{Graph}\x20]
\p{Blank} A space or a tab: [ \t]
\p{Cntrl} A control character: [\x00-\x1F\x7F]
\p{XDigit} A hexadecimal digit: [0-9a-fA-F]
\p{Space} A whitespace character: [ \t\n\x0B\f\r]
\P{<classname>} Negated POSIX class <classname>
 
Boundary matchers
^ The beginning of a line
$ The end of a line
\b A word boundary
\B A non-word boundary
\A The beginning of the input
\Z The end of the input but for the final terminator, if any
\z The end of the input
 
Greedy quantifiers
X? X, once or not at all
X* X, zero or more times
X+ X, one or more times
X{n} X, exactly n times
X{n,} X, at least n times
X{n,m} X, at least n but not more than m times
 
Reluctant quantifiers
X?? X, once or not at all
X*? X, zero or more times
X+? X, one or more times
X{n}? X, exactly n times
X{n,}? X, at least n times
X{n,m}? X, at least n but not more than m times
 
Logical operators
XY X followed by Y
X|Y Either X or Y
(X) X, as a capturing group
 
Back references
\n Whatever the nth capturing group matched (0 < n < 10)
\k<name> Whatever the named-capturing group "name" matched
 
Special constructs
(?<name>X) X, as a named-capturing group
(?=X) X, via zero-width positive lookahead
(?!X) X, via zero-width negative lookahead
(?<=X) X, via zero-width positive lookbehind
(?<!X) X, via zero-width negative lookbehind

What's not included

Atomic grouping and possessive quantifiers are not supported but are not necessary since they are primarily a way to work around issues created by backtracking.

Non-capturing groups are not supported since they do not provide a significant performance advantage with this engine's method of operation.

The boundary matcher \G is not supported in regexes but the Matcher class searches starting at the end of the previous match by default so the same functionality is available.

UNICODE characters and UNICODE character classes are not yet supported.

Quotations - \Q and \E are not yet supported.

Java.util.regex supports various flags - only case-sensitive, comment and the newline related flags are relevant in the absence of UNICODE support but are not yet supported by this engine. The default mode is CASE_INSENSITIVE off, UNIX_LINES and DOTALL modes on