You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
typesetting/quad/qtest/mds/regexp.md

955 lines
35 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Regular Expressions
> This chapter is a modified version of \[Sitaram05\].
A _regexp_ value encapsulates a pattern that is described by a string or
byte string. The regexp matcher tries to match this pattern against \(a
portion of\) another string or byte string, which we will call the _text
string_, when you call functions like `regexp-match`. The text string
is treated as raw text, and not as a pattern.
1 Writing Regexp Patterns
2 Matching Regexp Patterns
3 Basic Assertions
4 Characters and Character Classes
4.1 Some Frequently Used Character Classes
4.2 POSIX character classes
5 Quantifiers
6 Clusters
6.1 Backreferences
6.2 Non-capturing Clusters
6.3 Cloisters
7 Alternation
8 Backtracking
9 Looking Ahead and Behind
9.1 Lookahead
9.2 Lookbehind
10 An Extended Example
> +\[missing\] in \[missing\] provides more on regexps.
## 1. Writing Regexp Patterns
A string or byte string can be used directly as a regexp pattern, or it
can be prefixed with `#rx` to form a literal regexp value. For example,
`#rx"abc"` is a string-based regexp value, and `#rx#"abc"` is a byte
string-based regexp value. Alternately, a string or byte string can be
prefixed with `#px`, as in `#px"abc"`, for a slightly extended syntax of
patterns within the string.
Most of the characters in a regexp pattern are meant to match
occurrences of themselves in the text string. Thus, the pattern
`#rx"abc"` matches a string that contains the characters `a`, `b`, and
`c` in succession. Other characters act as _metacharacters_, and some
character sequences act as _metasequences_. That is, they specify
something other than their literal selves. For example, in the pattern
`#rx"a.c"`, the characters `a` and `c` stand for themselves, but the
metacharacter `.` can match _any_ character. Therefore, the pattern
`#rx"a.c"` matches an `a`, any character, and `c` in succession.
> When we want a literal `\` inside a Racket string or regexp literal, we
> must escape it so that it shows up in the string at all. Racket strings
> use `\` as the escape character, so we end up with two `\`s: one
> Racket-string `\` to escape the regexp `\`, which then escapes the `.`.
> Another character that would need escaping inside a Racket string is
> `"`.
If we needed to match the character `.` itself, we can escape it by
precede it with a `\`. The character sequence `\.` is thus a
metasequence, since it doesnt match itself but rather just `.`. So, to
match `a`, `.`, and `c` in succession, we use the regexp pattern
`#rx"a\\.c"`; the double `\` is an artifact of Racket strings, not the
regexp pattern itself.
The `regexp` function takes a string or byte string and produces a
regexp value. Use `regexp` when you construct a pattern to be matched
against multiple strings, since a pattern is compiled to a regexp value
before it can be used in a match. The `pregexp` function is like
`regexp`, but using the extended syntax. Regexp values as literals with
`#rx` or `#px` are compiled once and for all when they are read.
The `regexp-quote` function takes an arbitrary string and returns a
string for a pattern that matches exactly the original string. In
particular, characters in the input string that could serve as regexp
metacharacters are escaped with a backslash, so that they safely match
only themselves.
```racket
> (regexp-quote "cons")
"cons"
> (regexp-quote "list?")
"list\\?"
```
The `regexp-quote` function is useful when building a composite regexp
from a mix of regexp strings and verbatim strings.
## 2. Matching Regexp Patterns
The `regexp-match-positions` function takes a regexp pattern and a text
string, and it returns a match if the regexp matches \(some part of\)
the text string, or `#f` if the regexp did not match the string. A
successful match produces a list of _index pairs_.
Examples:
```racket
> (regexp-match-positions #rx"brain" "bird")
#f
> (regexp-match-positions #rx"needle" "hay needle stack")
'((4 . 10))
```
In the second example, the integers `4` and `10` identify the substring
that was matched. The `4` is the starting \(inclusive\) index, and `10`
the ending \(exclusive\) index of the matching substring:
```racket
> (substring "hay needle stack" 4 10)
"needle"
```
In this first example, `regexp-match-positions`s return list contains
only one index pair, and that pair represents the entire substring
matched by the regexp. When we discuss subpatterns later, we will see
how a single match operation can yield a list of submatches.
The `regexp-match-positions` function takes optional third and fourth
arguments that specify the indices of the text string within which the
matching should take place.
```racket
> (regexp-match-positions
#rx"needle"
"his needle stack -- my needle stack -- her needle stack"
20 39)
'((23 . 29))
```
Note that the returned indices are still reckoned relative to the full
text string.
The `regexp-match` function is like `regexp-match-positions`, but
instead of returning index pairs, it returns the matching substrings:
```racket
> (regexp-match #rx"brain" "bird")
#f
> (regexp-match #rx"needle" "hay needle stack")
'("needle")
```
When `regexp-match` is used with byte-string regexp, the result is a
matching byte substring:
```racket
> (regexp-match #rx#"needle" #"hay needle stack")
'(#"needle")
```
> A byte-string regexp can be applied to a string, and a string regexp can
> be applied to a byte string. In both cases, the result is a byte string.
> Internally, all regexp matching is in terms of bytes, and a string
> regexp is expanded to a regexp that matches UTF-8 encodings of
> characters. For maximum efficiency, use byte-string matching instead of
> string, since matching bytes directly avoids UTF-8 encodings.
If you have data that is in a port, theres no need to first read it
into a string. Functions like `regexp-match` can match on the port
directly:
```racket
> (define-values (i o) (make-pipe))
> (write "hay needle stack" o)
> (close-output-port o)
> (regexp-match #rx#"needle" i)
'(#"needle")
```
The `regexp-match?` function is like `regexp-match-positions`, but
simply returns a boolean indicating whether the match succeeded:
```racket
> (regexp-match? #rx"brain" "bird")
#f
> (regexp-match? #rx"needle" "hay needle stack")
#t
```
The `regexp-split` function takes two arguments, a regexp pattern and a
text string, and it returns a list of substrings of the text string; the
pattern identifies the delimiter separating the substrings.
```racket
> (regexp-split #rx":" "/bin:/usr/bin:/usr/bin/X11:/usr/local/bin")
'("/bin" "/usr/bin" "/usr/bin/X11" "/usr/local/bin")
> (regexp-split #rx" " "pea soup")
'("pea" "soup")
```
If the first argument matches empty strings, then the list of all the
single-character substrings is returned.
```racket
> (regexp-split #rx"" "smithereens")
'("" "s" "m" "i" "t" "h" "e" "r" "e" "e" "n" "s" "")
```
Thus, to identify one-or-more spaces as the delimiter, take care to use
the regexp `#rx" +"`, not `#rx" *"`.
```racket
> (regexp-split #rx" +" "split pea soup")
'("split" "pea" "soup")
> (regexp-split #rx" *" "split pea soup")
'("" "s" "p" "l" "i" "t" "" "p" "e" "a" "" "s" "o" "u" "p" "")
```
The `regexp-replace` function replaces the matched portion of the text
string by another string. The first argument is the pattern, the second
the text string, and the third is either the string to be inserted or a
procedure to convert matches to the insert string.
```racket
> (regexp-replace #rx"te" "liberte" "ty")
"liberty"
> (regexp-replace #rx"." "racket" string-upcase)
"Racket"
```
If the pattern doesnt occur in the text string, the returned string is
identical to the text string.
The `regexp-replace*` function replaces _all_ matches in the text string
by the insert string:
```racket
> (regexp-replace* #rx"te" "liberte egalite fraternite" "ty")
"liberty egality fratyrnity"
> (regexp-replace* #rx"[ds]" "drracket" string-upcase)
"Drracket"
```
## 3. Basic Assertions
The _assertions_ `^` and `$` identify the beginning and the end of the
text string, respectively. They ensure that their adjoining regexps
match at one or other end of the text string:
```racket
> (regexp-match-positions #rx"^contact" "first contact")
#f
```
The regexp above fails to match because `contact` does not occur at the
beginning of the text string. In
```racket
> (regexp-match-positions #rx"laugh$" "laugh laugh laugh laugh")
'((18 . 23))
```
the regexp matches the _last_ `laugh`.
The metasequence `\b` asserts that a word boundary exists, but this
metasequence works only with `#px` syntax. In
```racket
> (regexp-match-positions #px"yack\\b" "yackety yack")
'((8 . 12))
```
the `yack` in `yackety` doesnt end at a word boundary so it isnt
matched. The second `yack` does and is.
The metasequence `\B` \(also `#px` only\) has the opposite effect to
`\b`; it asserts that a word boundary does not exist. In
```racket
> (regexp-match-positions #px"an\\B" "an analysis")
'((3 . 5))
```
the `an` that doesnt end in a word boundary is matched.
## 4. Characters and Character Classes
Typically, a character in the regexp matches the same character in the
text string. Sometimes it is necessary or convenient to use a regexp
metasequence to refer to a single character. For example, the
metasequence `\.` matches the period character.
The metacharacter `.` matches _any_ character \(other than newline in
multi-line mode; see Cloisters\):
```racket
> (regexp-match #rx"p.t" "pet")
'("pet")
```
The above pattern also matches `pat`, `pit`, `pot`, `put`, and `p8t`,
but not `peat` or `pfffft`.
A _character class_ matches any one character from a set of characters.
A typical format for this is the _bracketed character class_ `[`...`]`,
which matches any one character from the non-empty sequence of
characters enclosed within the brackets. Thus, `#rx"p[aeiou]t"` matches
`pat`, `pet`, `pit`, `pot`, `put`, and nothing else.
Inside the brackets, a `-` between two characters specifies the Unicode
range between the characters. For example, `#rx"ta[b-dgn-p]"` matches
`tab`, `tac`, `tad`, `tag`, `tan`, `tao`, and `tap`.
An initial `^` after the left bracket inverts the set specified by the
rest of the contents; i.e., it specifies the set of characters _other
than_ those identified in the brackets. For example, `#rx"do[^g]"`
matches all three-character sequences starting with `do` except `dog`.
Note that the metacharacter `^` inside brackets means something quite
different from what it means outside. Most other metacharacters \(`.`,
`*`, `+`, `?`, etc.\) cease to be metacharacters when inside brackets,
although you may still escape them for peace of mind. A `-` is a
metacharacter only when its inside brackets, and when it is neither the
first nor the last character between the brackets.
Bracketed character classes cannot contain other bracketed character
classes \(although they contain certain other types of character
classes; see below\). Thus, a `[` inside a bracketed character class
doesnt have to be a metacharacter; it can stand for itself. For
example, `#rx"[a[b]"` matches `a`, `[`, and `b`.
Furthermore, since empty bracketed character classes are disallowed, a
`]` immediately occurring after the opening left bracket also doesnt
need to be a metacharacter. For example, `#rx"[]ab]"` matches `]`, `a`,
and `b`.
### 4.1. Some Frequently Used Character Classes
In `#px` syntax, some standard character classes can be conveniently
represented as metasequences instead of as explicit bracketed
expressions: `\d` matches a digit \(the same as `[0-9]`\); `\s` matches
an ASCII whitespace character; and `\w` matches a character that could
be part of a “word”.
> Following regexp custom, we identify “word” characters as
> `[A-Za-z0-9_]`, although these are too restrictive for what a Racketeer
> might consider a “word.”
The upper-case versions of these metasequences stand for the inversions
of the corresponding character classes: `\D` matches a non-digit, `\S` a
non-whitespace character, and `\W` a non-“word” character.
Remember to include a double backslash when putting these metasequences
in a Racket string:
```racket
> (regexp-match #px"\\d\\d"
"0 dear, 1 have 2 read catch 22 before 9")
'("22")
```
These character classes can be used inside a bracketed expression. For
example, `#px"[a-z\\d]"` matches a lower-case letter or a digit.
### 4.2. POSIX character classes
A _POSIX character class_ is a special metasequence of the form
`[:`...`:]` that can be used only inside a bracketed expression in `#px`
syntax. The POSIX classes supported are
* `[:alnum:]` — ASCII letters and digits
* `[:alpha:]` — ASCII letters
* `[:ascii:]` — ASCII characters
* `[:blank:]` — ASCII widthful whitespace: space and tab
* `[:cntrl:]` — “control” characters: ASCII 0 to 32
* `[:digit:]` — ASCII digits, same as `\d`
* `[:graph:]` — ASCII characters that use ink
* `[:lower:]` — ASCII lower-case letters
* `[:print:]` — ASCII ink-users plus widthful whitespace
* `[:space:]` — ASCII whitespace, same as `\s`
* `[:upper:]` — ASCII upper-case letters
* `[:word:]` — ASCII letters and `_`, same as `\w`
* `[:xdigit:]` — ASCII hex digits
For example, the `#px"[[:alpha:]_]"` matches a letter or underscore.
```racket
> (regexp-match #px"[[:alpha:]_]" "--x--")
'("x")
> (regexp-match #px"[[:alpha:]_]" "--_--")
'("_")
> (regexp-match #px"[[:alpha:]_]" "--:--")
#f
```
The POSIX class notation is valid _only_ inside a bracketed expression.
For instance, `[:alpha:]`, when not inside a bracketed expression, will
not be read as the letter class. Rather, it is \(from previous
principles\) the character class containing the characters `:`, `a`,
`l`, `p`, `h`.
```racket
> (regexp-match #px"[:alpha:]" "--a--")
'("a")
> (regexp-match #px"[:alpha:]" "--x--")
#f
```
## 5. Quantifiers
The _quantifiers_ `*`, `+`, and `?` match respectively: zero or more,
one or more, and zero or one instances of the preceding subpattern.
```racket
> (regexp-match-positions #rx"c[ad]*r" "cadaddadddr")
'((0 . 11))
> (regexp-match-positions #rx"c[ad]*r" "cr")
'((0 . 2))
> (regexp-match-positions #rx"c[ad]+r" "cadaddadddr")
'((0 . 11))
> (regexp-match-positions #rx"c[ad]+r" "cr")
#f
> (regexp-match-positions #rx"c[ad]?r" "cadaddadddr")
#f
> (regexp-match-positions #rx"c[ad]?r" "cr")
'((0 . 2))
> (regexp-match-positions #rx"c[ad]?r" "car")
'((0 . 3))
```
In `#px` syntax, you can use braces to specify much finer-tuned
quantification than is possible with `*`, `+`, `?`:
* The quantifier `{`_m_`}` matches _exactly_ _m_ instances of the
preceding subpattern; _m_ must be a nonnegative integer.
* The quantifier `{`_m_`,`_n_`}` matches at least _m_ and at most _n_
instances. `m` and `n` are nonnegative integers with _m_ less or
equal to _n_. You may omit either or both numbers, in which case _m_
defaults to __0__ and _n_ to infinity.
It is evident that `+` and `?` are abbreviations for `{1,}` and `{0,1}`
respectively, and `*` abbreviates `{,}`, which is the same as `{0,}`.
```racket
> (regexp-match #px"[aeiou]{3}" "vacuous")
'("uou")
> (regexp-match #px"[aeiou]{3}" "evolve")
#f
> (regexp-match #px"[aeiou]{2,3}" "evolve")
#f
> (regexp-match #px"[aeiou]{2,3}" "zeugma")
'("eu")
```
The quantifiers described so far are all _greedy_: they match the
maximal number of instances that would still lead to an overall match
for the full pattern.
```racket
> (regexp-match #rx"<.*>" "<tag1> <tag2> <tag3>")
'("<tag1> <tag2> <tag3>")
```
To make these quantifiers _non-greedy_, append a `?` to them.
Non-greedy quantifiers match the minimal number of instances needed to
ensure an overall match.
```racket
> (regexp-match #rx"<.*?>" "<tag1> <tag2> <tag3>")
'("<tag1>")
```
The non-greedy quantifiers are `*?`, `+?`, `??`, `{`_m_`}?`, and
`{`_m_`,`_n_`}?`, although `{`_m_`}?` is always the same as `{`_m_`}`.
Note that the metacharacter `?` has two different uses, and both uses
are represented in `??`.
## 6. Clusters
_Clustering_—enclosure within parens `(`...`)`—identifies the enclosed
_subpattern_ as a single entity. It causes the matcher to capture the
_submatch_, or the portion of the string matching the subpattern, in
addition to the overall match:
```racket
> (regexp-match #rx"([a-z]+) ([0-9]+), ([0-9]+)" "jan 1, 1970")
'("jan 1, 1970" "jan" "1" "1970")
```
Clustering also causes a following quantifier to treat the entire
enclosed subpattern as an entity:
```racket
> (regexp-match #rx"(pu )*" "pu pu platter")
'("pu pu " "pu ")
```
The number of submatches returned is always equal to the number of
subpatterns specified in the regexp, even if a particular subpattern
happens to match more than one substring or no substring at all.
```racket
> (regexp-match #rx"([a-z ]+;)*" "lather; rinse; repeat;")
'("lather; rinse; repeat;" " repeat;")
```
Here, the `*`-quantified subpattern matches three times, but it is the
last submatch that is returned.
It is also possible for a quantified subpattern to fail to match, even
if the overall pattern matches. In such cases, the failing submatch is
represented by `#f`
```racket
> (define date-re
; match month year' or month day, year';
; subpattern matches day, if present
#rx"([a-z]+) +([0-9]+,)? *([0-9]+)")
> (regexp-match date-re "jan 1, 1970")
'("jan 1, 1970" "jan" "1," "1970")
> (regexp-match date-re "jan 1970")
'("jan 1970" "jan" #f "1970")
```
### 6.1. Backreferences
Submatches can be used in the insert string argument of the procedures
`regexp-replace` and `regexp-replace*`. The insert string can use
`\`_n_ as a _backreference_ to refer back to the _n_th submatch, which
is the substring that matched the _n_th subpattern. A `\0` refers to
the entire match, and it can also be specified as `\&`.
```racket
> (regexp-replace #rx"_(.+?)_"
"the _nina_, the _pinta_, and the _santa maria_"
"*\\1*")
"the *nina*, the _pinta_, and the _santa maria_"
> (regexp-replace* #rx"_(.+?)_"
"the _nina_, the _pinta_, and the _santa maria_"
"*\\1*")
"the *nina*, the *pinta*, and the *santa maria*"
> (regexp-replace #px"(\\S+) (\\S+) (\\S+)"
"eat to live"
"\\3 \\2 \\1")
"live to eat"
```
Use `\\` in the insert string to specify a literal backslash. Also, `\$`
stands for an empty string, and is useful for separating a backreference
`\`_n_ from an immediately following number.
Backreferences can also be used within a `#px` pattern to refer back to
an already matched subpattern in the pattern. `\`_n_ stands for an exact
repeat of the _n_th submatch. Note that `\0`, which is useful in an
insert string, makes no sense within the regexp pattern, because the
entire regexp has not matched yet so you cannot refer back to it.}
```racket
> (regexp-match #px"([a-z]+) and \\1"
"billions and billions")
'("billions and billions" "billions")
```
Note that the backreference is not simply a repeat of the previous
subpattern. Rather it is a repeat of the particular substring already
matched by the subpattern.
In the above example, the backreference can only match `billions`. It
will not match `millions`, even though the subpattern it harks back
to—`([a-z]+)`—would have had no problem doing so:
```racket
> (regexp-match #px"([a-z]+) and \\1"
"billions and millions")
#f
```
The following example marks all immediately repeating patterns in a
number string:
```racket
> (regexp-replace* #px"(\\d+)\\1"
"123340983242432420980980234"
"{\\1,\\1}")
"12{3,3}40983{24,24}3242{098,098}0234"
```
The following example corrects doubled words:
```racket
> (regexp-replace* #px"\\b(\\S+) \\1\\b"
(string-append "now is the the time for all good men to "
"to come to the aid of of the party")
"\\1")
"now is the time for all good men to come to the aid of the party"
```
### 6.2. Non-capturing Clusters
It is often required to specify a cluster \(typically for
quantification\) but without triggering the capture of submatch
information. Such clusters are called _non-capturing_. To create a
non-capturing cluster, use `(?:` instead of `(` as the cluster opener.
In the following example, a non-capturing cluster eliminates the
“directory” portion of a given Unix pathname, and a capturing cluster
identifies the basename.
> But dont parse paths with regexps. Use functions like `split-path`,
> instead.
```racket
> (regexp-match #rx"^(?:[a-z]*/)*([a-z]+)$"
"/usr/local/bin/racket")
'("/usr/local/bin/racket" "racket")
```
### 6.3. Cloisters
The location between the `?` and the `:` of a non-capturing cluster is
called a _cloister_. You can put modifiers there that will cause the
enclustered subpattern to be treated specially. The modifier `i` causes
the subpattern to match case-insensitively:
> The term _cloister_ is a useful, if terminally cute, coinage from the
> abbots of Perl.
```racket
> (regexp-match #rx"(?i:hearth)" "HeartH")
'("HeartH")
```
The modifier `m` causes the subpattern to match in _multi-line mode_,
where `.` does not match a newline character, `^` can match just after a
newline, and `$` can match just before a newline.
```racket
> (regexp-match #rx"." "\na\n")
'("\n")
> (regexp-match #rx"(?m:.)" "\na\n")
'("a")
> (regexp-match #rx"^A plan$" "A man\nA plan\nA canal")
#f
> (regexp-match #rx"(?m:^A plan$)" "A man\nA plan\nA canal")
'("A plan")
```
You can put more than one modifier in the cloister:
```racket
> (regexp-match #rx"(?mi:^A Plan$)" "a man\na plan\na canal")
'("a plan")
```
A minus sign before a modifier inverts its meaning. Thus, you can use
`-i` in a _subcluster_ to overturn the case-insensitivities caused by an
enclosing cluster.
```racket
> (regexp-match #rx"(?i:the (?-i:TeX)book)"
"The TeXbook")
'("The TeXbook")
```
The above regexp will allow any casing for `the` and `book`, but it
insists that `TeX` not be differently cased.
## 7. Alternation
You can specify a list of _alternate_ subpatterns by separating them by
`|`. The `|` separates subpatterns in the nearest enclosing cluster
\(or in the entire pattern string if there are no enclosing parens\).
```racket
> (regexp-match #rx"f(ee|i|o|um)" "a small, final fee")
'("fi" "i")
> (regexp-replace* #rx"([yi])s(e[sdr]?|ing|ation)"
(string-append
"analyse an energising organisation"
" pulsing with noisy organisms")
"\\1z\\2")
"analyze an energizing organization pulsing with noisy organisms"
```
Note again that if you wish to use clustering merely to specify a list
of alternate subpatterns but do not want the submatch, use `(?:` instead
of `(`.
```racket
> (regexp-match #rx"f(?:ee|i|o|um)" "fun for all")
'("fo")
```
An important thing to note about alternation is that the leftmost
matching alternate is picked regardless of its length. Thus, if one of
the alternates is a prefix of a later alternate, the latter may not have
a chance to match.
```racket
> (regexp-match #rx"call|call-with-current-continuation"
"call-with-current-continuation")
'("call")
```
To allow the longer alternate to have a shot at matching, place it
before the shorter one:
```racket
> (regexp-match #rx"call-with-current-continuation|call"
"call-with-current-continuation")
'("call-with-current-continuation")
```
In any case, an overall match for the entire regexp is always preferred
to an overall non-match. In the following, the longer alternate still
wins, because its preferred shorter prefix fails to yield an overall
match.
```racket
> (regexp-match
#rx"(?:call|call-with-current-continuation) constrained"
"call-with-current-continuation constrained")
'("call-with-current-continuation constrained")
```
## 8. Backtracking
Weve already seen that greedy quantifiers match the maximal number of
times, but the overriding priority is that the overall match succeed.
Consider
```racket
> (regexp-match #rx"a*a" "aaaa")
'("aaaa")
```
The regexp consists of two subregexps: `a*` followed by `a`. The
subregexp `a*` cannot be allowed to match all four `a`s in the text
string `aaaa`, even though `*` is a greedy quantifier. It may match
only the first three, leaving the last one for the second subregexp.
This ensures that the full regexp matches successfully.
The regexp matcher accomplishes this via a process called
_backtracking_. The matcher tentatively allows the greedy quantifier to
match all four `a`s, but then when it becomes clear that the overall
match is in jeopardy, it _backtracks_ to a less greedy match of three
`a`s. If even this fails, as in the call
```racket
> (regexp-match #rx"a*aa" "aaaa")
'("aaaa")
```
the matcher backtracks even further. Overall failure is conceded only
when all possible backtracking has been tried with no success.
Backtracking is not restricted to greedy quantifiers. Nongreedy
quantifiers match as few instances as possible, and progressively
backtrack to more and more instances in order to attain an overall
match. There is backtracking in alternation too, as the more rightward
alternates are tried when locally successful leftward ones fail to yield
an overall match.
Sometimes it is efficient to disable backtracking. For example, we may
wish to commit to a choice, or we know that trying alternatives is
fruitless. A nonbacktracking regexp is enclosed in `(?>`...`)`.
```racket
> (regexp-match #rx"(?>a+)." "aaaa")
#f
```
In this call, the subregexp `?>a+` greedily matches all four `a`s, and
is denied the opportunity to backtrack. So, the overall match is
denied. The effect of the regexp is therefore to match one or more
`a`s followed by something that is definitely non-`a`.
## 9. Looking Ahead and Behind
You can have assertions in your pattern that look _ahead_ or _behind_ to
ensure that a subpattern does or does not occur. These “look around”
assertions are specified by putting the subpattern checked for in a
cluster whose leading characters are: `?=` \(for positive lookahead\),
`?!` \(negative lookahead\), `?<=` \(positive lookbehind\), `?<!`
\(negative lookbehind\). Note that the subpattern in the assertion does
not generate a match in the final result; it merely allows or disallows
the rest of the match.
### 9.1. Lookahead
Positive lookahead with `?=` peeks ahead to ensure that its subpattern
_could_ match.
```racket
> (regexp-match-positions #rx"grey(?=hound)"
"i left my grey socks at the greyhound")
'((28 . 32))
```
The regexp `#rx"grey(?=hound)"` matches `grey`, but _only_ if it is
followed by `hound`. Thus, the first `grey` in the text string is not
matched.
Negative lookahead with `?!` peeks ahead to ensure that its subpattern
_could not_ possibly match.
```racket
> (regexp-match-positions #rx"grey(?!hound)"
"the gray greyhound ate the grey socks")
'((27 . 31))
```
The regexp `#rx"grey(?!hound)"` matches `grey`, but only if it is _not_
followed by `hound`. Thus the `grey` just before `socks` is matched.
### 9.2. Lookbehind
Positive lookbehind with `?<=` checks that its subpattern _could_ match
immediately to the left of the current position in the text string.
```racket
> (regexp-match-positions #rx"(?<=grey)hound"
"the hound in the picture is not a greyhound")
'((38 . 43))
```
The regexp `#rx"(?<=grey)hound"` matches `hound`, but only if it is
preceded by `grey`.
Negative lookbehind with `?<!` checks that its subpattern could not
possibly match immediately to the left.
```racket
> (regexp-match-positions #rx"(?<!grey)hound"
"the greyhound in the picture is not a hound")
'((38 . 43))
```
The regexp `#rx"(?<!grey)hound"` matches `hound`, but only if it is
_not_ preceded by `grey`.
Lookaheads and lookbehinds can be convenient when they are not
confusing.
## 10. An Extended Example
Heres an extended example from Friedls _Mastering Regular
Expressions_, page 189, that covers many of the features described in
this chapter. The problem is to fashion a regexp that will match any
and only IP addresses or _dotted quads_: four numbers separated by three
dots, with each number between 0 and 255.
First, we define a subregexp `n0-255` that matches 0 through 255:
```racket
> (define n0-255
(string-append
"(?:"
"\\d|" ; 0 through 9
"\\d\\d|" ; 00 through 99
"[01]\\d\\d|" ; 000 through 199
"2[0-4]\\d|" ; 200 through 249
"25[0-5]" ; 250 through 255
")"))
```
> Note that `n0-255` lists prefixes as preferred alternates, which is
> something we cautioned against in Alternation. However, since we intend
> to anchor this subregexp explicitly to force an overall match, the order
> of the alternates does not matter.
The first two alternates simply get all single- and double-digit
numbers. Since 0-padding is allowed, we need to match both 1 and 01.
We need to be careful when getting 3-digit numbers, since numbers above
255 must be excluded. So we fashion alternates to get 000 through 199,
then 200 through 249, and finally 250 through 255.
An IP-address is a string that consists of four `n0-255`s with three
dots separating them.
```racket
> (define ip-re1
(string-append
"^" ; nothing before
n0-255 ; the first n0-255,
"(?:" ; then the subpattern of
"\\." ; a dot followed by
n0-255 ; an n0-255,
")" ; which is
"{3}" ; repeated exactly 3 times
"$"))
; with nothing following
```
Lets try it out:
```racket
> (regexp-match (pregexp ip-re1) "1.2.3.4")
'("1.2.3.4")
> (regexp-match (pregexp ip-re1) "55.155.255.265")
#f
```
which is fine, except that we also have
```racket
> (regexp-match (pregexp ip-re1) "0.00.000.00")
'("0.00.000.00")
```
All-zero sequences are not valid IP addresses! Lookahead to the rescue.
Before starting to match `ip-re1`, we look ahead to ensure we dont have
all zeros. We could use positive lookahead to ensure there _is_ a digit
other than zero.
```racket
> (define ip-re
(pregexp
(string-append
"(?=.*[1-9])" ; ensure there's a non-0 digit
ip-re1)))
```
Or we could use negative lookahead to ensure that whats ahead isnt
composed of _only_ zeros and dots.
```racket
> (define ip-re
(pregexp
(string-append
"(?![0.]*$)" ; not just zeros and dots
; (note: . is not metachar inside [...])
ip-re1)))
```
The regexp `ip-re` will match all and only valid IP addresses.
```racket
> (regexp-match ip-re "1.2.3.4")
'("1.2.3.4")
> (regexp-match ip-re "0.0.0.0")
#f
```