@head @module re @title re: Regular expressions

This module provides operations for matching and manipulating strings using regular expressions. Regular expressions may be represented as strings or instances of the @ref{RegExp} class.

Examples: @example if Match("foo|bar", "bar ") != nil -- Succeeds WriteLn("Match!") end var r = RegExp("a*b", IgnoreCase) var m = Search(r, "... AaaB ...") m.group(0) -- "AaaB" @end @note Use the @ref{RegExp} class for case insensitive matching. @end

Functions

@fun Match(regexp as Str, str as Str[, pos as Int]) as MatchResult @fun Match(regexp as RegExp, str as Str[, pos as Int]) as MatchResult @desc Test if a regular expression matches at the start of the string. If the regular expression matches a (potentially empty) prefix of the string, return a match object describing the match. Otherwise, return nil. If the pos argument is provided, start the matching at the specified string index instead. @end @fun Search(regexp as Str, str as Str[, start as Int]) as MatchResult @fun Search(regexp as RegExp, str as Str[, start as Int]) as MatchResult @desc Search a string for a match of a regular expression. Return a match object describing the leftmost match or nil if no match could be found. If the start parameter is provided, start the matching at the specified string index instead of the string start. @end @fun Subst(str as Str, regexp as Str, new) as Str @fun Subst(str as Str, regexp as RegExp, new) as Str @desc Substitute all non-overlapping occurrences of a regular expression in a string with replacement values described by the new parameter.

If new is a string, it acts as a template for the replacement string. \0 in the new string is replaced with the string matched by the regular expression and \n, where n is a positive integer, is replaced with the string matched by the group n of the regular expression. A backslash not followed by a digit is replaced with the character following the backslash. Finally, the rest of the string is included literally in the replacement string. Example: @example Subst("foo fox", "fo+", "<\0>") -- Result: "<foo> <fo>x" @end

New can also be a callable object (of type def (MatchResult) as Str). In this case, call the object with the corresponding match object as the argument for each occurrence of the regular expression in the string. The object should return the replacement string when called. Example: @example Subst("cat sits on a table", "cat|table", def (m) return m.group(0).upper() end) -- Result: "CAT sits on a TABLE" @end @end @fun Split(str as Str, regexp as Str) as Array @fun Split(str as Str, regexp as RegExp) as Array @desc Split the string into fields at each non-overlapping occurrence of the regular expression. Return an array containing the fields. Example: @example Split("cat; dog;horse", "; *") -- Result: ["cat", "dog", "horse"] @end @end @fun FindAll(regexp as Str, str as Str) as Array @fun FindAll(regexp as RegExp, str as Str) as Array @desc Find all non-overlapping matches of a regular expression in a string. Return an array of @ref{MatchResult} objects that represent the matches. Scan the string from start to end. Include also empty matches, unless they coincide with the start of another match. @end

Class RegExp

@class RegExp(regexp as Str, ... as Constant) @desc Construct a regular expression object. The first parameter must be a regular expression string. Optionally, the IgnoreCase constant can be given as an additional parameter to enable case insensitive matching. RegExp instances can be used instead of Str objects whenever regular expressions are expected.

Constructing a RegExp instance or passing a Str object as a regular expression may be costly, since the regular expression must be compiled into an internal form before matching. Using a single RegExp instance for multiple operations is often faster, since this allows reusing the compiled form. @end @end-class

Constants

@var IgnoreCase as Constant @desc Flag for case insensitive matching, passed to @ref{RegExp}. @end @class-hidden MatchResult

MatchResult objects

Match result objects are instances of the MatchResult class. MatchResult should not be used to construct match result objects directly. Match result objects have these methods: @fun group(n as Int) as Str @desc Return the substring matched by a specific group. The group 0 is the substring matched by the entire regular expression. Return nil if the group exists, but it is within a part of the regular expression that was not matched. @end @fun start(n as Int) as Int @desc Return the start index of the substring matched by a specific group. The group 0 refers to the substring matched by the entire regular expressions. Return -1 if the group exists in the regular expression, but it is within a part of the regular expression that was not matched. If a group matched an empty string, the start index is equal to the stop index. @end @fun stop(n as Int) as Int @desc Return the stop index of the substring matched by a specific group. The group 0 refers to the substring matched by the entire regular expressions. Return -1 if the group exists in the regular expression, but it is within a part of the regular expression that was not matched. If a group matched an empty string, the span start index is equal to the stop index. @end @fun span(n as Int) as Pair @desc @deprecated Return a @ref{Pair} object representing the (non-negative) start and end indices of the substring matched by a specific group. The group 0 refers to the substring matched by the entire regular expressions. Return nil if the group exists in the regular expression, but it is within a part of the regular expression that was not matched. If a group matched an empty string, the span start index is equal to the stop index. @end @end-class

Exceptions

@class RegExpError @desc Raised when one of the operations in this module is passed an invalid regular expression string. Inherits from @ref{std::ValueError}. @end

Regular expression syntax overview

Any character that does not have any other significance is a regular expression that matches itself, i.e. "x" matches the letter "x" and so on. Additionally, regular expressions can be constructed by following the rules below (a and b may refer to any regular expression):
. Match any single character.
^ Anchor match at the beginning of a string.
$ Anchor match at the end of a string.
ab Match a followed by b.
a* Match a repeated 0+ times.
a+ Match a repeated 1+ times.
a? Optionally match a.
a|b Match a or b.
a{n} Match a exactly n times.
a{n,} Match a at least n times.
a{n,m} Match a at least n and at most m times.
(a) Match a. Use parentheses to group expressions. Each regular expression within parentheses is a group. Groups within a regular expression are numbered so that the leftmost parenthesized expression (the one where the index of the "(" character is smallest) is the group 1, the next is the group 2, etc.
[...] Match any character inside the brackets. If the brackets contain character ranges of the form x-y, each such range matches any character in the range from x to y. Backslash sequences have the same behavior as described below, unless otherwise noted.
[^...] Match the inverse of the corresponding [...] expression.

Finally, combining backslash ("\") and another character or characters forms special regular expressions. If the backslash sequence is not special, a backslash followed by any character matches the following character. These are the special backslash sequences:
\n Match the string matched by a parenthesized group (back reference) or an octal character code if n is an integer.
\xnn   Match hexadecimal character code. nn must be an integer in hexadecimal.
\< Match the empty string at the beginning of a word.
\> Match the empty string at the end of a word.
\a Match ASCII bell.
\b Match ASCII backspace.
\d Match any decimal digit [0-9].
\D Match any character except decimal digit [^0-9].
\f Match ASCII form feed.
\n Match ASCII linefeed.
\r Match ASCII carriage return.
\s Match any whitespace character.
\S Match any non-whitespace character.
\t Match ASCII horizontal tab.
\v Match ASCII vertical tab.
\w Match any alphanumeric character or the underscore ("_").
\W Match any non-alphanumeric and non-underscore character.
@see @link re-details.html contains additional information about regular expression matching. @end