re: Regular expressions

This module provides operations for matching and manipulating strings using regular expressions. Regular expressions may be represented as strings or instances of the RegExp class.

Examples:

if Match("foo|bar", "bar   ") != nil -- Succeeds
  WriteLn("Match!")
end

var r = RegExp("a*b", IgnoreCase)
var m = Search(r, "... AaaB ...")
m.group(0)                           -- "AaaB"

Note: Use the RegExp class for case insensitive matching.

Functions

Match(regexp as Str, str as Str[, pos as Int]) as MatchResult
Match(regexp as RegExp, str as Str[, pos as Int]) as MatchResult
Test if a regular expression matches at the start of the string. If the regular expression matches a (potentially empty) prefix of the string, return a match object describing the match. Otherwise, return nil. If the pos argument is provided, start the matching at the specified string index instead.
Search(regexp as Str, str as Str[, start as Int]) as MatchResult
Search(regexp as RegExp, str as Str[, start as Int]) as MatchResult
Search a string for a match of a regular expression. Return a match object describing the leftmost match or nil if no match could be found. If the start parameter is provided, start the matching at the specified string index instead of the string start.
Subst(str as Str, regexp as Str, new) as Str
Subst(str as Str, regexp as RegExp, new) as Str
Substitute all non-overlapping occurrences of a regular expression in a string with replacement values described by the new parameter.

If new is a string, it acts as a template for the replacement string. \0 in the new string is replaced with the string matched by the regular expression and \n, where n is a positive integer, is replaced with the string matched by the group n of the regular expression. A backslash not followed by a digit is replaced with the character following the backslash. Finally, the rest of the string is included literally in the replacement string. Example:

Subst("foo fox", "fo+", "<\0>")   -- Result: "<foo> <fo>x"

New can also be a callable object (of type def (MatchResult) as Str). In this case, call the object with the corresponding match object as the argument for each occurrence of the regular expression in the string. The object should return the replacement string when called. Example:

Subst("cat sits on a table", "cat|table", def (m)
                                            return m.group(0).upper()
                                          end)
  -- Result: "CAT sits on a TABLE"
Split(str as Str, regexp as Str) as Array<Str>
Split(str as Str, regexp as RegExp) as Array<Str>
Split the string into fields at each non-overlapping occurrence of the regular expression. Return an array containing the fields. Example:
Split("cat;  dog;horse", "; *")   -- Result: ["cat", "dog", "horse"]
FindAll(regexp as Str, str as Str) as Array<MatchResult>
FindAll(regexp as RegExp, str as Str) as Array<MatchResult>
Find all non-overlapping matches of a regular expression in a string. Return an array of MatchResult objects that represent the matches. Scan the string from start to end. Include also empty matches, unless they coincide with the start of another match.

Class RegExp

class RegExp(regexp as Str, ... as Constant)
Construct a regular expression object. The first parameter must be a regular expression string. Optionally, the IgnoreCase constant can be given as an additional parameter to enable case insensitive matching. RegExp instances can be used instead of Str objects whenever regular expressions are expected.

Constructing a RegExp instance or passing a Str object as a regular expression may be costly, since the regular expression must be compiled into an internal form before matching. Using a single RegExp instance for multiple operations is often faster, since this allows reusing the compiled form.

Constants

IgnoreCase as Constant
Flag for case insensitive matching, passed to RegExp.

MatchResult objects

Match result objects are instances of the MatchResult class. MatchResult should not be used to construct match result objects directly. Match result objects have these methods:

group(n as Int) as Str
Return the substring matched by a specific group. The group 0 is the substring matched by the entire regular expression. Return nil if the group exists, but it is within a part of the regular expression that was not matched.
start(n as Int) as Int
Return the start index of the substring matched by a specific group. The group 0 refers to the substring matched by the entire regular expressions. Return -1 if the group exists in the regular expression, but it is within a part of the regular expression that was not matched. If a group matched an empty string, the start index is equal to the stop index.
stop(n as Int) as Int
Return the stop index of the substring matched by a specific group. The group 0 refers to the substring matched by the entire regular expressions. Return -1 if the group exists in the regular expression, but it is within a part of the regular expression that was not matched. If a group matched an empty string, the span start index is equal to the stop index.
span(n as Int) as Pair<Int, Int>
Deprecated (this feature will be removed in a future Alore version). Return a Pair object representing the (non-negative) start and end indices of the substring matched by a specific group. The group 0 refers to the substring matched by the entire regular expressions. Return nil if the group exists in the regular expression, but it is within a part of the regular expression that was not matched. If a group matched an empty string, the span start index is equal to the stop index.

Exceptions

class RegExpError
Raised when one of the operations in this module is passed an invalid regular expression string. Inherits from std::ValueError.

Regular expression syntax overview

Any character that does not have any other significance is a regular expression that matches itself, i.e. "x" matches the letter "x" and so on. Additionally, regular expressions can be constructed by following the rules below (a and b may refer to any regular expression):

. Match any single character.
^ Anchor match at the beginning of a string.
$ Anchor match at the end of a string.
ab Match a followed by b.
a* Match a repeated 0+ times.
a+ Match a repeated 1+ times.
a? Optionally match a.
a|b Match a or b.
a{n} Match a exactly n times.
a{n,} Match a at least n times.
a{n,m} Match a at least n and at most m times.
(a) Match a. Use parentheses to group expressions. Each regular expression within parentheses is a group. Groups within a regular expression are numbered so that the leftmost parenthesized expression (the one where the index of the "(" character is smallest) is the group 1, the next is the group 2, etc.
[...] Match any character inside the brackets. If the brackets contain character ranges of the form x-y, each such range matches any character in the range from x to y. Backslash sequences have the same behavior as described below, unless otherwise noted.
[^...] Match the inverse of the corresponding [...] expression.

Finally, combining backslash ("\") and another character or characters forms special regular expressions. If the backslash sequence is not special, a backslash followed by any character matches the following character. These are the special backslash sequences:

\n Match the string matched by a parenthesized group (back reference) or an octal character code if n is an integer.
\xnn   Match hexadecimal character code. nn must be an integer in hexadecimal.
\< Match the empty string at the beginning of a word.
\> Match the empty string at the end of a word.
\a Match ASCII bell.
\b Match ASCII backspace.
\d Match any decimal digit [0-9].
\D Match any character except decimal digit [^0-9].
\f Match ASCII form feed.
\n Match ASCII linefeed.
\r Match ASCII carriage return.
\s Match any whitespace character.
\S Match any non-whitespace character.
\t Match ASCII horizontal tab.
\v Match ASCII vertical tab.
\w Match any alphanumeric character or the underscore ("_").
\W Match any non-alphanumeric and non-underscore character.

See also: Regular expression matching details contains additional information about regular expression matching.