Regular expression matching details

This section contains additional technical details of regular expression matching.

Repetition and match length

The repetition expressions *, +, ? and {} match by default a locally maximal match. Each repeated expression matches the longest possible substring in isolation, but the whole expression does not always match the longest possible match. The regular expression is matched sequentially from left (start) to right, each subexpression matching a locally maximal substring so that the rest of the regular expression matches as well. The first successful found match is always reported. The alternation operator | is evaluated in a similar fashion: if the whole expression matches after trying the first alternative, the second alternative is not tried at all.

For example, the regular expression a?(b|abc) matches ab, not abc, when matched against the string abc.

Minimizing repetition

If a repetition expression is followed by ?, it matches a locally minimal match, i.e. the smallest number of repetitions so that the rest of the regular expression matches.

For example, the regular expression .*?b+ matches the string ab when matched against the string ababb.

Grouping and back references

Each parenthesized subexpression forms a grouped regular expression, or simply a group. The substrings matched by groups can be queried using the match object methods and they can be referred to later in the regular expression using back references of the form \1, \2, etc. Only the first 9 groups can be matched as back references.

A back reference matches the string that was last matched by the group. If a group is repeated, each repetition is considered a separate match. If the group has not matched anything, a back reference fails to match anything.

For example, the regular expression (a|b)+\1 matches any number of a's and b's followed by aa or bb.

Precedence

The alternation operator | has the lowest precedence, followed by the concatenation operator. Repetition has the highest precedence. Use grouping to override the operator precedence.

For example, a|bc* is interpreted as a|(b(c*)) (except for grouping behavior).

Character class syntax

The right square bracket (]) and dash (-) have special meanings within character classes ([...] or [^...]). Therefore they usually have to be escaped using a backslash to have them interpreted literally. As an exception, the right square bracket can be included without escaping as the first character of a character class, and the dash does not need to be escaped as the first character of a character class, as the first character after a character range and before the right square bracket.

The following regular expressions have examples of these special cases:

[]] match right square bracket
[0-] match 0 or -
[-] match dash
[a-c-e] match a, b, c, e or dash

Technical limitations

The maximum regular expression length is limited to approximately 16000 characters. The regular expression matching engine uses backtracking to handle repetition and alternation, so that matching requires exponential time in the worst case. In general, the more complex an expression is, the longer time the matching will take. This is different from regular expression matching based on a DFA (Deterministic Finite Automata).