MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1 | SYNOPSIS |
| 2 | PCRE - Perl-compatible regular expressions |
| 3 | |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 4 | DESCRIPTION |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 5 | This document describes the regular expressions supported by the PCRE |
| 6 | package. When the package is compiled into the driver, the macro |
| 7 | __PCRE__ is defined. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 8 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 9 | Most of this manpage is lifted directly from the original PCRE manpage |
| 10 | (dated January 2003). |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 11 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 12 | The PCRE library is a set of functions that implement regular |
| 13 | expression pattern matching using the same syntax and semantics as |
| 14 | Perl 5, with just a few differences (see below). The current |
| 15 | implementation corresponds to Perl 5.005, with some additional features |
| 16 | from later versions. This includes some experimental, incomplete |
| 17 | support for UTF-8 encoded strings. Details of exactly what is and what |
| 18 | is not supported are given below. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 19 | |
| 20 | PCRE REGULAR EXPRESSION DETAILS |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 21 | The syntax and semantics of the regular expressions supported by PCRE |
| 22 | are described below. Regular expressions are also described in the Perl |
| 23 | documentation and in a number of other books, some of which have |
| 24 | copious examples. Jeffrey Friedl's "Mastering Regular Expressions", |
| 25 | published by O'Reilly, covers them in great detail. The description |
| 26 | here is intended as reference documentation. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 27 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 28 | The basic operation of PCRE is on strings of bytes. However, there is |
| 29 | also support for UTF-8 character strings. To use this support you must |
| 30 | build PCRE to include UTF-8 support, and then call pcre_compile() with |
| 31 | the PCRE_UTF8 option. How this affects the pattern matching is |
| 32 | mentioned in several places below. There is also a summary of UTF-8 |
| 33 | features in the section on UTF-8 support in the main pcre page. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 34 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 35 | A regular expression is a pattern that is matched against a subject |
| 36 | string from left to right. Most characters stand for themselves in a |
| 37 | pattern, and match the corresponding characters in the subject. As a |
| 38 | trivial example, the pattern |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 39 | |
| 40 | The quick brown fox |
| 41 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 42 | matches a portion of a subject string that is identical to itself. The |
| 43 | power of regular expressions comes from the ability to include |
| 44 | alternatives and repetitions in the pattern. These are encoded in the |
| 45 | pattern by the use of meta-characters, which do not stand for |
| 46 | themselves but instead are interpreted in some special way. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 47 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 48 | There are two different sets of meta-characters: those that are |
| 49 | recognized anywhere in the pattern except within square brackets, and |
| 50 | those that are recognized in square brackets. Outside square brackets, |
| 51 | the meta-characters are as follows: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 52 | |
| 53 | \ general escape character with several uses |
| 54 | ^ assert start of string (or line, in multiline mode) |
| 55 | $ assert end of string (or line, in multiline mode) |
| 56 | . match any character except newline (by default) |
| 57 | [ start character class definition |
| 58 | | start of alternative branch |
| 59 | ( start subpattern |
| 60 | ) end subpattern |
| 61 | ? extends the meaning of ( |
| 62 | also 0 or 1 quantifier |
| 63 | also quantifier minimizer |
| 64 | * 0 or more quantifier |
| 65 | + 1 or more quantifier |
| 66 | also "possessive quantifier" |
| 67 | { start min/max quantifier |
| 68 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 69 | Part of a pattern that is in square brackets is called a "character |
| 70 | class". In a character class the only meta-characters are: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 71 | |
| 72 | \ general escape character |
| 73 | ^ negate the class, but only if the first character |
| 74 | - indicates character range |
| 75 | [ POSIX character class (only if followed by POSIX |
| 76 | syntax) |
| 77 | ] terminates the character class |
| 78 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 79 | The following sections describe the use of each of the meta-characters. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 80 | |
| 81 | BACKSLASH |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 82 | The backslash character has several uses. Firstly, if it is followed by |
| 83 | a non-alphameric character, it takes away any special meaning that |
| 84 | character may have. This use of backslash as an escape character |
| 85 | applies both inside and outside character classes. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 86 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 87 | For example, if you want to match a * character, you write \* in the |
| 88 | pattern. This escaping action applies whether or not the following |
| 89 | character would otherwise be interpreted as a meta-character, so it is |
| 90 | always safe to precede a non-alphameric with backslash to specify that |
| 91 | it stands for itself. In particular, if you want to match a backslash, |
| 92 | you write \\. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 93 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 94 | If a pattern is compiled with the PCRE_EXTENDED option, whitespace in |
| 95 | the pattern (other than in a character class) and characters between a |
| 96 | # outside a character class and the next newline character are ignored. |
| 97 | An escaping backslash can be used to include a whitespace or # |
| 98 | character as part of the pattern. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 99 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 100 | If you want to remove the special meaning from a sequence of |
| 101 | characters, you can do so by putting them between \Q and \E. This is |
| 102 | different from Perl in that $ and @ are handled as literals in \Q...\E |
| 103 | sequences in PCRE, whereas in Perl, $ and @ cause variable |
| 104 | interpolation. Note the following examples: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 105 | |
| 106 | Pattern PCRE matches Perl matches |
| 107 | |
| 108 | \Qabc$xyz\E abc$xyz abc followed by the |
| 109 | contents of $xyz |
| 110 | \Qabc\$xyz\E abc\$xyz abc\$xyz |
| 111 | \Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
| 112 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 113 | The \Q...\E sequence is recognized both inside and outside character |
| 114 | classes. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 115 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 116 | A second use of backslash provides a way of encoding non-printing |
| 117 | characters in patterns in a visible manner. There is no restriction on |
| 118 | the appearance of non-printing characters, apart from the binary zero |
| 119 | that terminates a pattern, but when a pattern is being prepared by text |
| 120 | editing, it is usually easier to use one of the following escape |
| 121 | sequences than the binary character it represents: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 122 | |
| 123 | \a alarm, that is, the BEL character (hex 07) |
| 124 | \cx "control-x", where x is any character |
| 125 | \e escape (hex 1B) |
| 126 | \f formfeed (hex 0C) |
| 127 | \n newline (hex 0A) |
| 128 | \r carriage return (hex 0D) |
| 129 | \t tab (hex 09) |
| 130 | \ddd character with octal code ddd, or backreference |
| 131 | \xhh character with hex code hh |
| 132 | \x{hhh..} character with hex code hhh... (UTF-8 mode only) |
| 133 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 134 | The precise effect of \cx is as follows: if x is a lower case letter, |
| 135 | it is converted to upper case. Then bit 6 of the character (hex 40) is |
| 136 | inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; |
| 137 | becomes hex 7B. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 138 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 139 | After \x, from zero to two hexadecimal digits are read (letters can be |
| 140 | in upper or lower case). In UTF-8 mode, any number of hexadecimal |
| 141 | dig-its may appear between \x{ and }, but the value of the character |
| 142 | code must be less than 2**31 (that is, the maximum hexadecimal value is |
| 143 | 7FFFFFFF). If characters other than hexadecimal digits appear between |
| 144 | \x{ and }, or if there is no terminating }, this form of escape is not |
| 145 | recognized. Instead, the initial \x will be interpreted as a basic |
| 146 | hexadecimal escape, with no following digits, giving a byte whose value |
| 147 | is zero. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 148 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 149 | Characters whose value is less than 256 can be defined by either of the |
| 150 | two syntaxes for \x when PCRE is in UTF-8 mode. There is no difference |
| 151 | in the way they are handled. For example, \xdc is exactly the same as |
| 152 | \x{dc}. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 153 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 154 | After \0 up to two further octal digits are read. In both cases, if |
| 155 | there are fewer than two digits, just those that are present are used. |
| 156 | Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL |
| 157 | character (code value 7). Make sure you supply two digits after the |
| 158 | initial zero if the character that follows is itself an octal digit. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 159 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 160 | The handling of a backslash followed by a digit other than 0 is |
| 161 | complicated. Outside a character class, PCRE reads it and any following |
| 162 | digits as a decimal number. If the number is less than 10, or if there |
| 163 | have been at least that many previous capturing left parentheses in the |
| 164 | expression, the entire sequence is taken as a back reference. A |
| 165 | description of how this works is given later, following the discussion |
| 166 | of parenthesized subpatterns. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 167 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 168 | Inside a character class, or if the decimal number is greater than 9 |
| 169 | and there have not been that many capturing subpatterns, PCRE re-reads |
| 170 | up to three octal digits following the backslash, and generates a |
| 171 | single byte from the least significant 8 bits of the value. Any |
| 172 | subsequent digits stand for themselves. For example: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 173 | |
| 174 | \040 is another way of writing a space |
| 175 | \40 is the same, provided there are fewer than 40 |
| 176 | previous capturing subpatterns |
| 177 | \7 is always a back reference |
| 178 | \11 might be a back reference, or another way of |
| 179 | writing a tab |
| 180 | \011 is always a tab |
| 181 | \0113 is a tab followed by the character "3" |
| 182 | \113 might be a back reference, otherwise the |
| 183 | character with octal code 113 |
| 184 | \377 might be a back reference, otherwise |
| 185 | the byte consisting entirely of 1 bits |
| 186 | \81 is either a back reference, or a binary zero |
| 187 | followed by the two characters "8" and "1" |
| 188 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 189 | Note that octal values of 100 or greater must not be introduced by a |
| 190 | leading zero, because no more than three octal digits are ever read. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 191 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 192 | All the sequences that define a single byte value or a single UTF-8 |
| 193 | character (in UTF-8 mode) can be used both inside and outside character |
| 194 | classes. In addition, inside a character class, the sequence \b is |
| 195 | interpreted as the backspace character (hex 08). Outside a character |
| 196 | class it has a different meaning (see below). |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 197 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 198 | The third use of backslash is for specifying generic character types: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 199 | |
| 200 | \d any decimal digit |
| 201 | \D any character that is not a decimal digit |
| 202 | \s any whitespace character |
| 203 | \S any character that is not a whitespace character |
| 204 | \w any "word" character |
| 205 | \W any "non-word" character |
| 206 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 207 | Each pair of escape sequences partitions the complete set of characters |
| 208 | into two disjoint sets. Any given character matches one, and only one, |
| 209 | of each pair. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 210 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 211 | In UTF-8 mode, characters with values greater than 255 never match \d, |
| 212 | \s, or \w, and always match \D, \S, and \W. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 213 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 214 | For compatibility with Perl, \s does not match the VT character (code |
| 215 | 11). This makes it different from the the POSIX "space" class. The \s |
| 216 | characters are HT (9), LF (10), FF (12), CR (13), and space (32). |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 217 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 218 | A "word" character is any letter or digit or the underscore character, |
| 219 | that is, any character which can be part of a Perl "word". The |
| 220 | definition of letters and digits is controlled by PCRE's character |
| 221 | tables, and may vary if locale-specific matching is taking place (see |
| 222 | "Locale support" in the pcreapi page). For example, in the "fr" |
| 223 | (French) locale, some character codes greater than 128 are used for |
| 224 | accented letters, and these are matched by \w. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 225 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 226 | These character type sequences can appear both inside and outside |
| 227 | character classes. They each match one character of the appropriate |
| 228 | type. If the current matching point is at the end of the subject |
| 229 | string, all of them fail, since there is no character to match. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 230 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 231 | The fourth use of backslash is for certain simple assertions. An |
| 232 | assertion specifies a condition that has to be met at a particular |
| 233 | point in a match, without consuming any characters from the subject |
| 234 | string. The use of subpatterns for more complicated assertions is |
| 235 | described below. The backslashed assertions are: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 236 | |
| 237 | \b matches at a word boundary |
| 238 | \B matches when not at a word boundary |
| 239 | \A matches at start of subject |
| 240 | \Z matches at end of subject or before newline at end |
| 241 | \z matches at end of subject |
| 242 | \G matches at first matching position in subject |
| 243 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 244 | These assertions may not appear in character classes (but note that \b |
| 245 | has a different meaning, namely the backspace character, inside a |
| 246 | character class). |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 247 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 248 | A word boundary is a position in the subject string where the current |
| 249 | character and the previous character do not both match \w or \W (i.e. |
| 250 | one matches \w and the other matches \W), or the start or end of the |
| 251 | string if the first or last character matches \w, respectively. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 252 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 253 | The \A, \Z, and \z assertions differ from the traditional circumflex |
| 254 | and dollar (described below) in that they only ever match at the very |
| 255 | start and end of the subject string, whatever options are set. Thus, |
| 256 | they are independent of multiline mode. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 257 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 258 | They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options. If the |
| 259 | startoffset argument of pcre_exec() is non-zero, indicating that |
| 260 | matching is to start at a point other than the beginning of the |
| 261 | subject, \A can never match. The difference between \Z and \z is that |
| 262 | \Z matches before a newline that is the last character of the string as |
| 263 | well as at the end of the string, whereas \z matches only at the end. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 264 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 265 | The \G assertion is true only when the current matching position is at |
| 266 | the start point of the match, as specified by the startoffset argument |
| 267 | of pcre_exec(). It differs from \A when the value of startoffset is |
| 268 | non-zero. By calling pcre_exec() multiple times with appropriate |
| 269 | arguments, you can mimic Perl's /g option, and it is in this kind of |
| 270 | implementation where \G can be useful. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 271 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 272 | Note, however, that PCRE's interpretation of \G, as the start of the |
| 273 | current match, is subtly different from Perl's, which defines it as the |
| 274 | end of the previous match. In Perl, these can be different when the |
| 275 | previously matched string was empty. Because PCRE does just one match |
| 276 | at a time, it cannot reproduce this behaviour. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 277 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 278 | If all the alternatives of a pattern begin with \G, the expression is |
| 279 | anchored to the starting match position, and the "anchored" flag is set |
| 280 | in the compiled regular expression. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 281 | |
| 282 | CIRCUMFLEX AND DOLLAR |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 283 | Outside a character class, in the default matching mode, the circumflex |
| 284 | character is an assertion which is true only if the current matching |
| 285 | point is at the start of the subject string. If the startoffset |
| 286 | argument of pcre_exec() is non-zero, circumflex can never match if the |
| 287 | PCRE_MULTILINE option is unset. Inside a character class, circumflex |
| 288 | has an entirely different meaning (see below). |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 289 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 290 | Circumflex need not be the first character of the pattern if a number |
| 291 | of alternatives are involved, but it should be the first thing in each |
| 292 | alternative in which it appears if the pattern is ever to match that |
| 293 | branch. If all possible alternatives start with a circumflex, that is, |
| 294 | if the pattern is constrained to match only at the start of the |
| 295 | subject, it is said to be an "anchored" pattern. (There are also other |
| 296 | constructs that can cause a pattern to be anchored.) |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 297 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 298 | A dollar character is an assertion which is true only if the current |
| 299 | matching point is at the end of the subject string, or immediately |
| 300 | before a newline character that is the last character in the string (by |
| 301 | default). Dollar need not be the last character of the pattern if a |
| 302 | number of alternatives are involved, but it should be the last item in |
| 303 | any branch in which it appears. Dollar has no special meaning in a |
| 304 | character class. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 305 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 306 | The meaning of dollar can be changed so that it matches only at the |
| 307 | very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at |
| 308 | compile time. This does not affect the \Z assertion. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 309 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 310 | The meanings of the circumflex and dollar characters are changed if the |
| 311 | PCRE_MULTILINE option is set. When this is the case, they match |
| 312 | immediately after and immediately before an internal newline character, |
| 313 | respectively, in addition to matching at the start and end of the |
| 314 | subject string. For example, the pattern /^abc$/ matches the subject |
| 315 | string "def\nabc" in multiline mode, but not otherwise. Consequently, |
| 316 | patterns that are anchored in single line mode because all branches |
| 317 | start with ^ are not anchored in multiline mode, and a match for |
| 318 | circumflex is possible when the startoffset argument of pcre_exec() is |
| 319 | non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE |
| 320 | is set. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 321 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 322 | Note that the sequences \A, \Z, and \z can be used to match the start |
| 323 | and end of the subject in both modes, and if all branches of a pattern |
| 324 | start with \A it is always anchored, whether PCRE_MULTILINE is set or |
| 325 | not. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 326 | |
| 327 | FULL STOP (PERIOD, DOT) |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 328 | Outside a character class, a dot in the pattern matches any one |
| 329 | character in the subject, including a non-printing character, but not |
| 330 | (by default) newline. In UTF-8 mode, a dot matches any UTF-8 character, |
| 331 | which might be more than one byte long, except (by default) for |
| 332 | newline. If the PCRE_DOTALL option is set, dots match newlines as well. |
| 333 | The handling of dot is entirely independent of the handling of |
| 334 | circumflex and dollar, the only relationship being that they both |
| 335 | involve newline characters. Dot has no special meaning in a character |
| 336 | class. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 337 | |
| 338 | MATCHING A SINGLE BYTE |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 339 | Outside a character class, the escape sequence \C matches any one byte, |
| 340 | both in and out of UTF-8 mode. Unlike a dot, it always matches a |
| 341 | newline. The feature is provided in Perl in order to match individual |
| 342 | bytes in UTF-8 mode. Because it breaks up UTF-8 characters into |
| 343 | individual bytes, what remains in the string may be a malformed UTF-8 |
| 344 | string. For this reason it is best avoided. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 345 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 346 | PCRE does not allow \C to appear in lookbehind assertions (see below), |
| 347 | because in UTF-8 mode it makes it impossible to calculate the length of |
| 348 | the lookbehind. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 349 | |
| 350 | SQUARE BRACKETS |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 351 | An opening square bracket introduces a character class, terminated by a |
| 352 | closing square bracket. A closing square bracket on its own is not |
| 353 | special. If a closing square bracket is required as a member of the |
| 354 | class, it should be the first data character in the class (after an |
| 355 | initial circumflex, if present) or escaped with a backslash. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 356 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 357 | A character class matches a single character in the subject. In UTF-8 |
| 358 | mode, the character may occupy more than one byte. A matched character |
| 359 | must be in the set of characters defined by the class, unless the first |
| 360 | character in the class definition is a circumflex, in which case the |
| 361 | subject character must not be in the set defined by the class. If a |
| 362 | circumflex is actually required as a member of the class, ensure it is |
| 363 | not the first character, or escape it with a backslash. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 364 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 365 | For example, the character class [aeiou] matches any lower case vowel, |
| 366 | while [^aeiou] matches any character that is not a lower case vowel. |
| 367 | Note that a circumflex is just a convenient notation for specifying the |
| 368 | characters which are in the class by enumerating those that are not. It |
| 369 | is not an assertion: it still consumes a character from the subject |
| 370 | string, and fails if the current pointer is at the end of the string. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 371 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 372 | In UTF-8 mode, characters with values greater than 255 can be included |
| 373 | in a class as a literal string of bytes, or by using the \x{ escaping |
| 374 | mechanism. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 375 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 376 | When caseless matching is set, any letters in a class represent both |
| 377 | their upper case and lower case versions, so for example, a caseless |
| 378 | [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not |
| 379 | match "A", whereas a caseful version would. PCRE does not support the |
| 380 | concept of case for characters with values greater than 255. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 381 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 382 | The newline character is never treated in any special way in character |
| 383 | classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE |
| 384 | options is. A class such as [^a] will always match a newline. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 385 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 386 | The minus (hyphen) character can be used to specify a range of |
| 387 | characters in a character class. For example, [d-m] matches any letter |
| 388 | between d and m, inclusive. If a minus character is required in a |
| 389 | class, it must be escaped with a backslash or appear in a position |
| 390 | where it cannot be interpreted as indicating a range, typically as the |
| 391 | first or last character in the class. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 392 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 393 | It is not possible to have the literal character "]" as the end |
| 394 | character of a range. A pattern such as [W-]46] is interpreted as a |
| 395 | class of two characters ("W" and "-") followed by a literal string |
| 396 | "46]", so it would match "W46]" or "-46]". However, if the "]" is |
| 397 | escaped with a backslash it is interpreted as the end of range, so |
| 398 | [W-\]46] is interpreted as a single class containing a range followed |
| 399 | by two separate characters. The octal or hexadecimal representation of |
| 400 | "]" can also be used to end a range. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 401 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 402 | Ranges operate in the collating sequence of character values. They can |
| 403 | also be used for characters specified numerically, for example |
| 404 | [\000-\037]. In UTF-8 mode, ranges can include characters whose values |
| 405 | are greater than 255, for example [\x{100}-\x{2ff}]. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 406 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 407 | If a range that includes letters is used when caseless matching is set, |
| 408 | it matches the letters in either case. For example, [W-c] is equivalent |
| 409 | to [][\^_`wxyzabc], matched caselessly, and if character tables for the |
| 410 | "fr" locale are in use, [\xc8-\xcb] matches accented E characters in |
| 411 | both cases. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 412 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 413 | The character types \d, \D, \s, \S, \w, and \W may also appear in a |
| 414 | character class, and add the characters that they match to the class. |
| 415 | For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can |
| 416 | conveniently be used with the upper case character types to specify a |
| 417 | more restricted set of characters than the matching lower case type. |
| 418 | For example, the class [^\W_] matches any letter or digit, but not |
| 419 | underscore. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 420 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 421 | All non-alphameric characters other than \, -, ^ (at the start) and the |
| 422 | terminating ] are non-special in character classes, but it does no harm |
| 423 | if they are escaped. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 424 | |
| 425 | POSIX CHARACTER CLASSES |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 426 | Perl supports the POSIX notation for character classes, which uses |
| 427 | names enclosed by [: and :] within the enclosing square brackets. PCRE |
| 428 | also supports this notation. For example, |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 429 | |
| 430 | [01[:alpha:]%] |
| 431 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 432 | matches "0", "1", any alphabetic character, or "%". The supported class |
| 433 | names are |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 434 | |
| 435 | alnum letters and digits |
| 436 | alpha letters |
| 437 | ascii character codes 0 - 127 |
| 438 | blank space or tab only |
| 439 | cntrl control characters |
| 440 | digit decimal digits (same as \d) |
| 441 | graph printing characters, excluding space |
| 442 | lower lower case letters |
| 443 | print printing characters, including space |
| 444 | punct printing characters, excluding letters and digits |
| 445 | space white space (not quite the same as \s) |
| 446 | upper upper case letters |
| 447 | word "word" characters (same as \w) |
| 448 | xdigit hexadecimal digits |
| 449 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 450 | The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), |
| 451 | and space (32). Notice that this list includes the VT character (code |
| 452 | 11). This makes "space" different to \s, which does not include VT (for |
| 453 | Perl compatibility). |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 454 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 455 | The name "word" is a Perl extension, and "blank" is a GNU extension |
| 456 | from Perl 5.8. Another Perl extension is negation, which is indicated |
| 457 | by a ^ character after the colon. For example, |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 458 | |
| 459 | [12[:^digit:]] |
| 460 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 461 | matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the |
| 462 | POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but |
| 463 | these are not supported, and an error is given if they are encountered. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 464 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 465 | In UTF-8 mode, characters with values greater than 255 do not match any |
| 466 | of the POSIX character classes. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 467 | |
| 468 | VERTICAL BAR |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 469 | Vertical bar characters are used to separate alternative patterns. For |
| 470 | example, the pattern |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 471 | |
| 472 | gilbert|sullivan |
| 473 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 474 | matches either "gilbert" or "sullivan". Any number of alternatives may |
| 475 | appear, and an empty alternative is permitted (matching the empty |
| 476 | string). The matching process tries each alternative in turn, from |
| 477 | left to right, and the first one that succeeds is used. If the |
| 478 | alternatives are within a subpattern (defined below), "succeeds" means |
| 479 | matching the rest of the main pattern as well as the alternative in the |
| 480 | subpattern. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 481 | |
| 482 | INTERNAL OPTION SETTING |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 483 | The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and |
| 484 | PCRE_EXTENDED options can be changed from within the pattern by a |
| 485 | sequence of Perl option letters enclosed between "(?" and ")". The |
| 486 | option letters are |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 487 | |
| 488 | i for PCRE_CASELESS |
| 489 | m for PCRE_MULTILINE |
| 490 | s for PCRE_DOTALL |
| 491 | x for PCRE_EXTENDED |
| 492 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 493 | For example, (?im) sets caseless, multiline matching. It is also |
| 494 | possible to unset these options by preceding the letter with a hyphen, |
| 495 | and a combined setting and unsetting such as (?im-sx), which sets |
| 496 | PCRE_CASELESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and |
| 497 | PCRE_EXTENDED, is also permitted. If a letter appears both before and |
| 498 | after the hyphen, the option is unset. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 499 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 500 | When an option change occurs at top level (that is, not inside |
| 501 | subpattern parentheses), the change applies to the remainder of the |
| 502 | pattern that follows. If the change is placed right at the start of a |
| 503 | pattern, PCRE extracts it into the global options (and it will |
| 504 | therefore show up in data extracted by the pcre_fullinfo() function). |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 505 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 506 | An option change within a subpattern affects only that part of the |
| 507 | current pattern that follows it, so |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 508 | |
| 509 | (a(?i)b)c |
| 510 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 511 | matches abc and aBc and no other strings (assuming PCRE_CASELESS is not |
| 512 | used). By this means, options can be made to have different settings |
| 513 | in different parts of the pattern. Any changes made in one alternative |
| 514 | do carry on into subsequent branches within the same subpattern. For |
| 515 | example, |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 516 | |
| 517 | (a(?i)b|c) |
| 518 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 519 | matches "ab", "aB", "c", and "C", even though when matching "C" the |
| 520 | first branch is abandoned before the option setting. This is because |
| 521 | the effects of option settings happen at compile time. There would be |
| 522 | some very weird behaviour otherwise. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 523 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 524 | The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed |
| 525 | in the same way as the Perl-compatible options by using the characters |
| 526 | U and X respectively. The (?X) flag setting is special in that it must |
| 527 | always occur earlier in the pattern than any of the additional features |
| 528 | it turns on, even when it is at top level. It is best put at the start. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 529 | |
| 530 | SUBPATTERNS |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 531 | Subpatterns are delimited by parentheses (round brackets), which can be |
| 532 | nested. Marking part of a pattern as a subpattern does two things: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 533 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 534 | 1. It localizes a set of alternatives. For example, the pattern |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 535 | |
| 536 | cat(aract|erpillar|) |
| 537 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 538 | matches one of the words "cat", "cataract", or "caterpillar". Without |
| 539 | the parentheses, it would match "cataract", "erpillar" or the empty |
| 540 | string. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 541 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 542 | 2. It sets up the subpattern as a capturing subpattern (as defined |
| 543 | above). When the whole pattern matches, that portion of the subject |
| 544 | string that matched the subpattern is passed back to the caller via the |
| 545 | ovector argument of pcre_exec(). Opening parentheses are counted from |
| 546 | left to right (starting from 1) to obtain the numbers of the capturing |
| 547 | subpatterns. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 548 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 549 | For example, if the string "the red king" is matched against the |
| 550 | pattern |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 551 | |
| 552 | the ((red|white) (king|queen)) |
| 553 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 554 | the captured substrings are "red king", "red", and "king", and are |
| 555 | numbered 1, 2, and 3, respectively. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 556 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 557 | The fact that plain parentheses fulfil two functions is not always |
| 558 | helpful. There are often times when a grouping subpattern is required |
| 559 | without a capturing requirement. If an opening parenthesis is followed |
| 560 | by a question mark and a colon, the subpattern does not do any |
| 561 | capturing, and is not counted when computing the number of any |
| 562 | subsequent capturing subpatterns. For example, if the string "the white |
| 563 | queen" is matched against the pattern |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 564 | |
| 565 | the ((?:red|white) (king|queen)) |
| 566 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 567 | the captured substrings are "white queen" and "queen", and are numbered |
| 568 | 1 and 2. The maximum number of capturing subpatterns is 65535, and the |
| 569 | maximum depth of nesting of all subpatterns, both capturing and |
| 570 | noncapturing, is 200. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 571 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 572 | As a convenient shorthand, if any option settings are required at the |
| 573 | start of a non-capturing subpattern, the option letters may appear |
| 574 | between the "?" and the ":". Thus the two patterns |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 575 | |
| 576 | (?i:saturday|sunday) |
| 577 | (?:(?i)saturday|sunday) |
| 578 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 579 | match exactly the same set of strings. Because alternative branches are |
| 580 | tried from left to right, and options are not reset until the end of |
| 581 | the subpattern is reached, an option setting in one branch does affect |
| 582 | subsequent branches, so the above patterns match "SUNDAY" as well as |
| 583 | "Saturday". |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 584 | |
| 585 | NAMED SUBPATTERNS |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 586 | Identifying capturing parentheses by number is simple, but it can be |
| 587 | very hard to keep track of the numbers in complicated regular |
| 588 | expressions. Furthermore, if an expression is modified, the numbers may |
| 589 | change. To help with the difficulty, PCRE supports the naming of |
| 590 | subpatterns, something that Perl does not provide. The Python syntax |
| 591 | (?P<name>...) is used. Names consist of alphanumeric characters and |
| 592 | underscores, and must be unique within a pattern. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 593 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 594 | Named capturing parentheses are still allocated numbers as well as |
| 595 | names. The PCRE API provides function calls for extracting the name-to- |
| 596 | number translation table from a compiled pattern. For further details |
| 597 | see the pcreapi documentation. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 598 | |
| 599 | REPETITION |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 600 | Repetition is specified by quantifiers, which can follow any of the |
| 601 | following items: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 602 | |
| 603 | a literal data character |
| 604 | the . metacharacter |
| 605 | the \C escape sequence |
| 606 | escapes such as \d that match single characters |
| 607 | a character class |
| 608 | a back reference (see next section) |
| 609 | a parenthesized subpattern (unless it is an assertion) |
| 610 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 611 | The general repetition quantifier specifies a minimum and maximum |
| 612 | number of permitted matches, by giving the two numbers in curly |
| 613 | brackets (braces), separated by a comma. The numbers must be less than |
| 614 | 65536, and the first must be less than or equal to the second. For |
| 615 | example: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 616 | |
| 617 | z{2,4} |
| 618 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 619 | matches "zz", "zzz", or "zzzz". A closing brace on its own is not a |
| 620 | special character. If the second number is omitted, but the comma is |
| 621 | present, there is no upper limit; if the second number and the comma |
| 622 | are both omitted, the quantifier specifies an exact number of required |
| 623 | matches. Thus |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 624 | |
| 625 | [aeiou]{3,} |
| 626 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 627 | matches at least 3 successive vowels, but may match many more, while |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 628 | |
| 629 | \d{8} |
| 630 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 631 | matches exactly 8 digits. An opening curly bracket that appears in a |
| 632 | position where a quantifier is not allowed, or one that does not match |
| 633 | the syntax of a quantifier, is taken as a literal character. For |
| 634 | example, {,6} is not a quantifier, but a literal string of four |
| 635 | characters. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 636 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 637 | In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to |
| 638 | individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 |
| 639 | characters, each of which is represented by a two-byte sequence. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 640 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 641 | The quantifier {0} is permitted, causing the expression to behave as if |
| 642 | the previous item and the quantifier were not present. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 643 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 644 | For convenience (and historical compatibility) the three most common |
| 645 | quantifiers have single-character abbreviations: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 646 | |
| 647 | * is equivalent to {0,} |
| 648 | + is equivalent to {1,} |
| 649 | ? is equivalent to {0,1} |
| 650 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 651 | It is possible to construct infinite loops by following a subpattern |
| 652 | that can match no characters with a quantifier that has no upper limit, |
| 653 | for example: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 654 | |
| 655 | (a?)* |
| 656 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 657 | Earlier versions of Perl and PCRE used to give an error at compile time |
| 658 | for such patterns. However, because there are cases where this can be |
| 659 | useful, such patterns are now accepted, but if any repetition of the |
| 660 | subpattern does in fact match no characters, the loop is forcibly |
| 661 | broken. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 662 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 663 | By default, the quantifiers are "greedy", that is, they match as much |
| 664 | as possible (up to the maximum number of permitted times), without |
| 665 | causing the rest of the pattern to fail. The classic example of where |
| 666 | this gives problems is in trying to match comments in C programs. These |
| 667 | appear between the sequences /* and */ and within the sequence, |
| 668 | individual * and / characters may appear. An attempt to match C |
| 669 | comments by applying the pattern |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 670 | |
| 671 | /\*.*\*/ |
| 672 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 673 | to the string |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 674 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 675 | /* first command */ not comment /* second comment */ |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 676 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 677 | fails, because it matches the entire string owing to the greediness of |
| 678 | the .* item. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 679 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 680 | However, if a quantifier is followed by a question mark, it ceases to |
| 681 | be greedy, and instead matches the minimum number of times possible, so |
| 682 | the pattern |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 683 | |
| 684 | /\*.*?\*/ |
| 685 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 686 | does the right thing with the C comments. The meaning of the various |
| 687 | quantifiers is not otherwise changed, just the preferred number of |
| 688 | matches. Do not confuse this use of question mark with its use as a |
| 689 | quantifier in its own right. Because it has two uses, it can sometimes |
| 690 | appear doubled, as in |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 691 | |
| 692 | \d??\d |
| 693 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 694 | which matches one digit by preference, but can match two if that is the |
| 695 | only way the rest of the pattern matches. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 696 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 697 | If the PCRE_UNGREEDY option is set (an option which is not available in |
| 698 | Perl), the quantifiers are not greedy by default, but individual ones |
| 699 | can be made greedy by following them with a question mark. In other |
| 700 | words, it inverts the default behaviour. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 701 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 702 | When a parenthesized subpattern is quantified with a minimum repeat |
| 703 | count that is greater than 1 or with a limited maximum, more store is |
| 704 | required for the compiled pattern, in proportion to the size of the |
| 705 | minimum or maximum. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 706 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 707 | If a pattern starts with .* or .{0,} and the PCRE_DOTALL option |
| 708 | (equivalent to Perl's /s) is set, thus allowing the . to match |
| 709 | newlines, the pattern is implicitly anchored, because whatever follows |
| 710 | will be tried against every character position in the subject string, |
| 711 | so there is no point in retrying the overall match at any position |
| 712 | after the first. PCRE normally treats such a pattern as though it were |
| 713 | preceded by \A. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 714 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 715 | In cases where it is known that the subject string contains no |
| 716 | newlines, it is worth setting PCRE_DOTALL in order to obtain this |
| 717 | optimization, or alternatively using ^ to indicate anchoring |
| 718 | explicitly. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 719 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 720 | However, there is one situation where the optimization cannot be used. |
| 721 | When .* is inside capturing parentheses that are the subject of a |
| 722 | backreference elsewhere in the pattern, a match at the start may fail, |
| 723 | and a later one succeed. Consider, for example: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 724 | |
| 725 | (.*)abc\1 |
| 726 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 727 | If the subject is "xyz123abc123" the match point is the fourth |
| 728 | character. For this reason, such a pattern is not implicitly anchored. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 729 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 730 | When a capturing subpattern is repeated, the value captured is the |
| 731 | substring that matched the final iteration. For example, after |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 732 | |
| 733 | (tweedle[dume]{3}\s*)+ |
| 734 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 735 | has matched "tweedledum tweedledee" the value of the captured substring |
| 736 | is "tweedledee". However, if there are nested capturing subpatterns, |
| 737 | the corresponding captured values may have been set in previous |
| 738 | iterations. For example, after |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 739 | |
| 740 | /(a|(b))+/ |
| 741 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 742 | matches "aba" the value of the second captured substring is "b". |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 743 | |
| 744 | ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 745 | With both maximizing and minimizing repetition, failure of what follows |
| 746 | normally causes the repeated item to be re-evaluated to see if a |
| 747 | different number of repeats allows the rest of the pattern to match. |
| 748 | Sometimes it is useful to prevent this, either to change the nature of |
| 749 | the match, or to cause it fail earlier than it otherwise might, when |
| 750 | the author of the pattern knows there is no point in carrying on. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 751 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 752 | Consider, for example, the pattern \d+foo when applied to the subject |
| 753 | line |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 754 | |
| 755 | 123456bar |
| 756 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 757 | After matching all 6 digits and then failing to match "foo", the normal |
| 758 | action of the matcher is to try again with only 5 digits matching the |
| 759 | \d+ item, and then with 4, and so on, before ultimately failing. |
| 760 | "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides |
| 761 | the means for specifying that once a subpattern has matched, it is not |
| 762 | to be re-evaluated in this way. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 763 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 764 | If we use atomic grouping for the previous example, the matcher would |
| 765 | give up immediately on failing to match "foo" the first time. The |
| 766 | notation is a kind of special parenthesis, starting with (?> as in this |
| 767 | example: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 768 | |
| 769 | (?>\d+)foo |
| 770 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 771 | This kind of parenthesis "locks up" the part of the pattern it |
| 772 | contains once it has matched, and a failure further into the pattern is |
| 773 | prevented from backtracking into it. Backtracking past it to previous |
| 774 | items, however, works as normal. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 775 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 776 | An alternative description is that a subpattern of this type matches |
| 777 | the string of characters that an identical standalone pattern would |
| 778 | match, if anchored at the current point in the subject string. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 779 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 780 | Atomic grouping subpatterns are not capturing subpatterns. Simple cases |
| 781 | such as the above example can be thought of as a maximizing repeat that |
| 782 | must swallow everything it can. So, while both \d+ and \d+? are |
| 783 | prepared to adjust the number of digits they match in order to make the |
| 784 | rest of the pattern match, (?>\d+) can only match an entire sequence of |
| 785 | digits. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 786 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 787 | Atomic groups in general can of course contain arbitrarily complicated |
| 788 | subpatterns, and can be nested. However, when the subpattern for an |
| 789 | atomic group is just a single repeated item, as in the example above, a |
| 790 | simpler notation, called a "possessive quantifier" can be used. This |
| 791 | consists of an additional + character following a quantifier. Using |
| 792 | this notation, the previous example can be rewritten as |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 793 | |
| 794 | \d++bar |
| 795 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 796 | Possessive quantifiers are always greedy; the setting of the |
| 797 | PCRE_UNGREEDY option is ignored. They are a convenient notation for the |
| 798 | simpler forms of atomic group. However, there is no difference in the |
| 799 | meaning or processing of a possessive quantifier and the equivalent |
| 800 | atomic group. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 801 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 802 | The possessive quantifier syntax is an extension to the Perl syntax. It |
| 803 | originates in Sun's Java package. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 804 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 805 | When a pattern contains an unlimited repeat inside a subpattern that |
| 806 | can itself be repeated an unlimited number of times, the use of an |
| 807 | atomic group is the only way to avoid some failing matches taking a |
| 808 | very long time indeed. The pattern |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 809 | |
| 810 | (\D+|<\d+>)*[!?] |
| 811 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 812 | matches an unlimited number of substrings that either consist of non- |
| 813 | digits, or digits enclosed in <>, followed by either ! or ?. When it |
| 814 | matches, it runs quickly. However, if it is applied to |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 815 | |
| 816 | aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
| 817 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 818 | it takes a long time before reporting failure. This is because the |
| 819 | string can be divided between the two repeats in a large number of |
| 820 | ways, and all have to be tried. (The example used [!?] rather than a |
| 821 | single character at the end, because both PCRE and Perl have an |
| 822 | optimization that allows for fast failure when a single character is |
| 823 | used. They remember the last single character that is required for a |
| 824 | match, and fail early if it is not present in the string.) If the |
| 825 | pattern is changed to |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 826 | |
| 827 | ((?>\D+)|<\d+>)*[!?] |
| 828 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 829 | sequences of non-digits cannot be broken, and failure happens quickly. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 830 | |
| 831 | BACK REFERENCES |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 832 | Outside a character class, a backslash followed by a digit greater than |
| 833 | 0 (and possibly further digits) is a back reference to a capturing |
| 834 | subpattern earlier (that is, to its left) in the pattern, provided |
| 835 | there have been that many previous capturing left parentheses. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 836 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 837 | However, if the decimal number following the backslash is less than 10, |
| 838 | it is always taken as a back reference, and causes an error only if |
| 839 | there are not that many capturing left parentheses in the entire |
| 840 | pattern. In other words, the parentheses that are referenced need not |
| 841 | be to the left of the reference for numbers less than 10. See the |
| 842 | section entitled "Backslash" above for further details of the handling |
| 843 | of digits following a backslash. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 844 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 845 | A back reference matches whatever actually matched the capturing |
| 846 | subpattern in the current subject string, rather than anything matching |
| 847 | the subpattern itself (see "Subpatterns as subroutines" below for a way |
| 848 | of doing that). So the pattern |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 849 | |
| 850 | (sens|respons)e and \1ibility |
| 851 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 852 | matches "sense and sensibility" and "response and responsibility", but |
| 853 | not "sense and responsibility". If caseful matching is in force at the |
| 854 | time of the back reference, the case of letters is relevant. For |
| 855 | example, |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 856 | |
| 857 | ((?i)rah)\s+\1 |
| 858 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 859 | matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
| 860 | original capturing subpattern is matched caselessly. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 861 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 862 | Back references to named subpatterns use the Python syntax (?P=name). |
| 863 | We could rewrite the above example as follows: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 864 | |
| 865 | (?<p1>(?i)rah)\s+(?P=p1) |
| 866 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 867 | There may be more than one back reference to the same subpattern. If a |
| 868 | subpattern has not actually been used in a particular match, any back |
| 869 | references to it always fail. For example, the pattern |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 870 | |
| 871 | (a|(bc))\2 |
| 872 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 873 | always fails if it starts to match "a" rather than "bc". Because there |
| 874 | may be many capturing parentheses in a pattern, all digits following |
| 875 | the backslash are taken as part of a potential back reference number. |
| 876 | If the pattern continues with a digit character, some delimiter must be |
| 877 | used to terminate the back reference. If the PCRE_EXTENDED option is |
| 878 | set, this can be whitespace. Otherwise an empty comment can be used. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 879 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 880 | A back reference that occurs inside the parentheses to which it refers |
| 881 | fails when the subpattern is first used, so, for example, (a\1) never |
| 882 | matches. However, such references can be useful inside repeated |
| 883 | subpatterns. For example, the pattern |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 884 | |
| 885 | (a|b\1)+ |
| 886 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 887 | matches any number of "a"s and also "aba", "ababbaa" etc. At each |
| 888 | iteration of the subpattern, the back reference matches the character |
| 889 | string corresponding to the previous iteration. In order for this to |
| 890 | work, the pattern must be such that the first iteration does not need |
| 891 | to match the back reference. This can be done using alternation, as in |
| 892 | the example above, or by a quantifier with a minimum of zero. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 893 | |
| 894 | ASSERTIONS |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 895 | An assertion is a test on the characters following or preceding the |
| 896 | current matching point that does not actually consume any characters. |
| 897 | The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
| 898 | described above. More complicated assertions are coded as subpatterns. |
| 899 | There are two kinds: those that look ahead of the current position in |
| 900 | the subject string, and those that look behind it. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 901 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 902 | An assertion subpattern is matched in the normal way, except that it |
| 903 | does not cause the current matching position to be changed. Lookahead |
| 904 | assertions start with (?= for positive assertions and (?! for negative |
| 905 | assertions. For example, |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 906 | |
| 907 | \w+(?=;) |
| 908 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 909 | matches a word followed by a semicolon, but does not include the |
| 910 | semicolon in the match, and |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 911 | |
| 912 | foo(?!bar) |
| 913 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 914 | matches any occurrence of "foo" that is not followed by "bar". Note |
| 915 | that the apparently similar pattern |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 916 | |
| 917 | (?!foo)bar |
| 918 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 919 | does not find an occurrence of "bar" that is preceded by something |
| 920 | other than "foo"; it finds any occurrence of "bar" whatsoever, because |
| 921 | the assertion (?!foo) is always true when the next three characters are |
| 922 | "bar". A lookbehind assertion is needed to achieve this effect. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 923 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 924 | If you want to force a matching failure at some point in a pattern, the |
| 925 | most convenient way to do it is with (?!) because an empty string |
| 926 | always matches, so an assertion that requires there not to be an empty |
| 927 | string must always fail. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 928 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 929 | Lookbehind assertions start with (?<= for positive assertions and (?<! |
| 930 | for negative assertions. For example, |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 931 | |
| 932 | (?<!foo)bar |
| 933 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 934 | does find an occurrence of "bar" that is not preceded by "foo". The |
| 935 | contents of a lookbehind assertion are restricted such that all the |
| 936 | strings it matches must have a fixed length. However, if there are |
| 937 | several alternatives, they do not all have to have the same fixed |
| 938 | length. Thus |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 939 | |
| 940 | (?<=bullock|donkey) |
| 941 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 942 | is permitted, but |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 943 | |
| 944 | (?<!dogs?|cats?) |
| 945 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 946 | causes an error at compile time. Branches that match different length |
| 947 | strings are permitted only at the top level of a lookbehind assertion. |
| 948 | This is an extension compared with Perl (at least for 5.8), which |
| 949 | requires all branches to match the same length of string. An assertion |
| 950 | such as |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 951 | |
| 952 | (?<=ab(c|de)) |
| 953 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 954 | is not permitted, because its single top-level branch can match two |
| 955 | different lengths, but it is acceptable if rewritten to use two top- |
| 956 | level branches: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 957 | |
| 958 | (?<=abc|abde) |
| 959 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 960 | The implementation of lookbehind assertions is, for each alternative, |
| 961 | to temporarily move the current position back by the fixed width and |
| 962 | then try to match. If there are insufficient characters before the |
| 963 | current position, the match is deemed to fail. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 964 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 965 | PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
| 966 | mode) to appear in lookbehind assertions, because it makes it |
| 967 | impossible to calculate the length of the lookbehind. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 968 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 969 | Atomic groups can be used in conjunction with lookbehind assertions to |
| 970 | specify efficient matching at the end of the subject string. Consider a |
| 971 | simple pattern such as |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 972 | |
| 973 | abcd$ |
| 974 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 975 | when applied to a long string that does not match. Because matching |
| 976 | proceeds from left to right, PCRE will look for each "a" in the subject |
| 977 | and then see if what follows matches the rest of the pattern. If the |
| 978 | pattern is specified as |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 979 | |
| 980 | ^.*abcd$ |
| 981 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 982 | the initial .* matches the entire string at first, but when this fails |
| 983 | (because there is no following "a"), it backtracks to match all but the |
| 984 | last character, then all but the last two characters, and so on. Once |
| 985 | again the search for "a" covers the entire string, from right to left, |
| 986 | so we are no better off. However, if the pattern is written as |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 987 | |
| 988 | ^(?>.*)(?<=abcd) |
| 989 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 990 | or, equivalently, |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 991 | |
| 992 | ^.*+(?<=abcd) |
| 993 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 994 | there can be no backtracking for the .* item; it can match only the |
| 995 | entire string. The subsequent lookbehind assertion does a single test |
| 996 | on the last four characters. If it fails, the match fails immediately. |
| 997 | For long strings, this approach makes a significant difference to the |
| 998 | processing time. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 999 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1000 | Several assertions (of any sort) may occur in succession. For example, |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1001 | |
| 1002 | (?<=\d{3})(?<!999)foo |
| 1003 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1004 | matches "foo" preceded by three digits that are not "999". Notice that |
| 1005 | each of the assertions is applied independently at the same point in |
| 1006 | the subject string. First there is a check that the previous three |
| 1007 | characters are all digits, and then there is a check that the same |
| 1008 | three characters are not "999". This pattern does not match "foo" |
| 1009 | preceded by six characters, the first of which are digits and the last |
| 1010 | three of which are not "999". For example, it doesn't match |
| 1011 | "123abcfoo". A pattern to do that is |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1012 | |
| 1013 | (?<=\d{3}...)(?<!999)foo |
| 1014 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1015 | This time the first assertion looks at the preceding six characters, |
| 1016 | checking that the first three are digits, and then the second assertion |
| 1017 | checks that the preceding three characters are not "999". |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1018 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1019 | Assertions can be nested in any combination. For example, |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1020 | |
| 1021 | (?<=(?<!foo)bar)baz |
| 1022 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1023 | matches an occurrence of "baz" that is preceded by "bar" which in turn |
| 1024 | is not preceded by "foo", while |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1025 | |
| 1026 | (?<=\d{3}(?!999)...)foo |
| 1027 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1028 | is another pattern which matches "foo" preceded by three digits and any |
| 1029 | three characters that are not "999". |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1030 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1031 | Assertion subpatterns are not capturing subpatterns, and may not be |
| 1032 | repeated, because it makes no sense to assert the same thing several |
| 1033 | times. If any kind of assertion contains capturing subpatterns within |
| 1034 | it, these are counted for the purposes of numbering the capturing |
| 1035 | subpatterns in the whole pattern. However, substring capturing is |
| 1036 | carried out only for positive assertions, because it does not make |
| 1037 | sense for negative assertions. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1038 | |
| 1039 | CONDITIONAL SUBPATTERNS |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1040 | It is possible to cause the matching process to obey a subpattern |
| 1041 | conditionally or to choose between two alternative subpatterns, |
| 1042 | depending on the result of an assertion, or whether a previous |
| 1043 | capturing subpattern matched or not. The two possible forms of |
| 1044 | conditional subpattern are |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1045 | |
| 1046 | (?(condition)yes-pattern) |
| 1047 | (?(condition)yes-pattern|no-pattern) |
| 1048 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1049 | If the condition is satisfied, the yes-pattern is used; otherwise the |
| 1050 | no-pattern (if present) is used. If there are more than two |
| 1051 | alternatives in the subpattern, a compile-time error occurs. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1052 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1053 | There are three kinds of condition. If the text between the parentheses |
| 1054 | consists of a sequence of digits, the condition is satisfied if the |
| 1055 | capturing subpattern of that number has previously matched. The number |
| 1056 | must be greater than zero. Consider the following pattern, which |
| 1057 | contains non-significant white space to make it more readable (assume |
| 1058 | the PCRE_EXTENDED option) and to divide it into three parts for ease of |
| 1059 | discussion: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1060 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1061 | ( \( )? [^()]+ (?(1) \) ) |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1062 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1063 | The first part matches an optional opening parenthesis, and if that |
| 1064 | character is present, sets it as the first captured substring. The |
| 1065 | second part matches one or more characters that are not parentheses. |
| 1066 | The third part is a conditional subpattern that tests whether the first |
| 1067 | set of parentheses matched or not. If they did, that is, if subject |
| 1068 | started with an opening parenthesis, the condition is true, and so the |
| 1069 | yes-pattern is executed and a closing parenthesis is required. |
| 1070 | Otherwise, since no-pattern is not present, the subpattern matches |
| 1071 | nothing. In other words, this pattern matches a sequence of |
| 1072 | non-parentheses, optionally enclosed in parentheses. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1073 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1074 | If the condition is the string (R), it is satisfied if a recursive call |
| 1075 | to the pattern or subpattern has been made. At "top level", the |
| 1076 | condition is false. This is a PCRE extension. Recursive patterns are |
| 1077 | described in the next section. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1078 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1079 | If the condition is not a sequence of digits or (R), it must be an |
| 1080 | assertion. This may be a positive or negative lookahead or lookbehind |
| 1081 | assertion. Consider this pattern, again containing non-significant |
| 1082 | white space, and with the two alternatives on the second line: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1083 | |
| 1084 | (?(?=[^a-z]*[a-z]) |
| 1085 | \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
| 1086 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1087 | The condition is a positive lookahead assertion that matches an |
| 1088 | optional sequence of non-letters followed by a letter. In other words, |
| 1089 | it tests for the presence of at least one letter in the subject. If a |
| 1090 | letter is found, the subject is matched against the first alternative; |
| 1091 | otherwise it is matched against the second. This pattern matches |
| 1092 | strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
| 1093 | letters and dd are digits. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1094 | |
| 1095 | COMMENTS |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1096 | The sequence (?# marks the start of a comment which continues up to the |
| 1097 | next closing parenthesis. Nested parentheses are not permitted. The |
| 1098 | characters that make up a comment play no part in the pattern matching |
| 1099 | at all. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1100 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1101 | If the PCRE_EXTENDED option is set, an unescaped # character outside a |
| 1102 | character class introduces a comment that continues up to the next |
| 1103 | newline character in the pattern. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1104 | |
| 1105 | RECURSIVE PATTERNS |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1106 | Consider the problem of matching a string in parentheses, allowing for |
| 1107 | unlimited nested parentheses. Without the use of recursion, the best |
| 1108 | that can be done is to use a pattern that matches up to some fixed |
| 1109 | depth of nesting. It is not possible to handle an arbitrary nesting |
| 1110 | depth. Perl has provided an experimental facility that allows regular |
| 1111 | expressions to recurse (amongst other things). It does this by |
| 1112 | interpolating Perl code in the expression at run time, and the code can |
| 1113 | refer to the expression itself. A Perl pattern to solve the parentheses |
| 1114 | problem can be created like this: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1115 | |
| 1116 | $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; |
| 1117 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1118 | The (?p{...}) item interpolates Perl code at run time, and in this case |
| 1119 | refers recursively to the pattern in which it appears. Obviously, PCRE |
| 1120 | cannot support the interpolation of Perl code. Instead, it supports |
| 1121 | some special syntax for recursion of the entire pattern, and also for |
| 1122 | individual subpattern recursion. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1123 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1124 | The special item that consists of (? followed by a number greater than |
| 1125 | zero and a closing parenthesis is a recursive call of the subpattern of |
| 1126 | the given number, provided that it occurs inside that subpattern. (If |
| 1127 | not, it is a "subroutine" call, which is described in the next |
| 1128 | section.) The special item (?R) is a recursive call of the entire |
| 1129 | regular expression. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1130 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1131 | For example, this PCRE pattern solves the nested parentheses problem |
| 1132 | (assume the PCRE_EXTENDED option is set so that white space is |
| 1133 | ignored): |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1134 | |
| 1135 | \( ( (?>[^()]+) | (?R) )* \) |
| 1136 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1137 | First it matches an opening parenthesis. Then it matches any number of |
| 1138 | substrings which can either be a sequence of non-parentheses, or a |
| 1139 | recursive match of the pattern itself (that is a correctly |
| 1140 | parenthesized substring). Finally there is a closing parenthesis. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1141 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1142 | If this were part of a larger pattern, you would not want to recurse |
| 1143 | the entire pattern, so instead you could use this: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1144 | |
| 1145 | ( \( ( (?>[^()]+) | (?1) )* \) ) |
| 1146 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1147 | We have put the pattern into parentheses, and caused the recursion to |
| 1148 | refer to them instead of the whole pattern. In a larger pattern, |
| 1149 | keeping track of parenthesis numbers can be tricky. It may be more |
| 1150 | convenient to use named parentheses instead. For this, PCRE uses |
| 1151 | (?P>name), which is an extension to the Python syntax that PCRE uses |
| 1152 | for named parentheses (Perl does not provide named parentheses). We |
| 1153 | could rewrite the above example as follows: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1154 | |
| 1155 | (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) ) |
| 1156 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1157 | This particular example pattern contains nested unlimited repeats, and |
| 1158 | so the use of atomic grouping for matching strings of non-parentheses |
| 1159 | is important when applying the pattern to strings that do not match. |
| 1160 | For example, when this pattern is applied to |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1161 | |
| 1162 | (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
| 1163 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1164 | it yields "no match" quickly. However, if atomic grouping is not used, |
| 1165 | the match runs for a very long time indeed because there are so many |
| 1166 | different ways the + and * repeats can carve up the subject, and all |
| 1167 | have to be tested before failure can be reported. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1168 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1169 | At the end of a match, the values set for any capturing subpatterns are |
| 1170 | those from the outermost level of the recursion at which the subpattern |
| 1171 | value is set. If you want to obtain intermediate values, a callout |
| 1172 | function can be used (see below and the pcrecallout documentation). If |
| 1173 | the pattern above is matched against |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1174 | |
| 1175 | (ab(cd)ef) |
| 1176 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1177 | the value for the capturing parentheses is "ef", which is the last |
| 1178 | value taken on at the top level. If additional parentheses are added, |
| 1179 | giving |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1180 | |
| 1181 | \( ( ( (?>[^()]+) | (?R) )* ) \) |
| 1182 | ^ ^ |
| 1183 | ^ ^ |
| 1184 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1185 | the string they capture is "ab(cd)ef", the contents of the top level |
| 1186 | parentheses. If there are more than 15 capturing parentheses in a |
| 1187 | pattern, PCRE has to obtain extra memory to store data during a |
| 1188 | recursion, which it does by using pcre_malloc, freeing it via pcre_free |
| 1189 | afterwards. If no memory can be obtained, the match fails with the |
| 1190 | PCRE_ERROR_NOMEMORY error. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1191 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1192 | Do not confuse the (?R) item with the condition (R), which tests for |
| 1193 | recursion. Consider this pattern, which matches text in angle |
| 1194 | brackets, allowing for arbitrary nesting. Only digits are allowed in |
| 1195 | nested brackets (that is, when recursing), whereas any characters are |
| 1196 | permitted at the outer level. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1197 | |
| 1198 | < (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
| 1199 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1200 | In this pattern, (?(R) is the start of a conditional subpattern, with |
| 1201 | two different alternatives for the recursive and non-recursive cases. |
| 1202 | The (?R) item is the actual recursive call. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1203 | |
| 1204 | SUBPATTERNS AS SUBROUTINES |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1205 | If the syntax for a recursive subpattern reference (either by number or |
| 1206 | by name) is used outside the parentheses to which it refers, it |
| 1207 | operates like a subroutine in a programming language. An earlier |
| 1208 | example pointed out that the pattern |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1209 | |
| 1210 | (sens|respons)e and \1ibility |
| 1211 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1212 | matches "sense and sensibility" and "response and responsibility", but |
| 1213 | not "sense and responsibility". If instead the pattern |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1214 | |
| 1215 | (sens|respons)e and (?1)ibility |
| 1216 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1217 | is used, it does match "sense and responsibility" as well as the other |
| 1218 | two strings. Such references must, however, follow the subpattern to |
| 1219 | which they refer. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1220 | |
| 1221 | CALLOUTS |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1222 | Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
| 1223 | Perl code to be obeyed in the middle of matching a regular expression. |
| 1224 | This makes it possible, amongst other things, to extract different |
| 1225 | substrings that match the same pair of parentheses when there is a |
| 1226 | repetition. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1227 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1228 | PCRE provides a similar feature, but of course it cannot obey arbitrary |
| 1229 | Perl code. The feature is called "callout". The caller of PCRE provides |
| 1230 | an external function by putting its entry point in the global variable |
| 1231 | pcre_callout. By default, this variable contains NULL, which disables |
| 1232 | all calling out. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1233 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1234 | Within a regular expression, (?C) indicates the points at which the |
| 1235 | external function is to be called. If you want to identify different |
| 1236 | callout points, you can put a number less than 256 after the letter C. |
| 1237 | The default value is zero. For example, this pattern has two callout |
| 1238 | points: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1239 | |
| 1240 | (?C1)abc(?C2)def |
| 1241 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1242 | During matching, when PCRE reaches a callout point (and pcre_callout is |
| 1243 | set), the external function is called. It is provided with the number |
| 1244 | of the callout, and, optionally, one item of data originally supplied |
| 1245 | by the caller of pcre_exec(). The callout function may cause matching |
| 1246 | to backtrack, or to fail altogether. A complete description of the |
| 1247 | interface to the callout function is given in the pcrecallout |
| 1248 | documentation. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1249 | |
| 1250 | DIFFERENCES FROM PERL |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1251 | This section escribes the differences in the ways that PCRE and Perl |
| 1252 | handle regular expressions. The differences described here are with |
| 1253 | respect to Perl 5.8. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1254 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1255 | 1. PCRE does not have full UTF-8 support. Details of what it does have |
| 1256 | are given in the section on UTF-8 support in the main pcre page. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1257 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1258 | 2. PCRE does not allow repeat quantifiers on lookahead assertions. |
| 1259 | Perl permits them, but they do not mean what you might think. For |
| 1260 | example, (?!a){3} does not assert that the next three characters are |
| 1261 | not "a". It just asserts that the next character is not "a" three |
| 1262 | times. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1263 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1264 | 3. Capturing subpatterns that occur inside negative lookahead |
| 1265 | assertions are counted, but their entries in the offsets vector are |
| 1266 | never set. Perl sets its numerical variables from any such patterns |
| 1267 | that are matched before the assertion fails to match something |
| 1268 | (thereby succeeding), but only if the negative lookahead assertion |
| 1269 | contains just one branch. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1270 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1271 | 4. Though binary zero characters are supported in the subject string, |
| 1272 | they are not allowed in a pattern string because it is passed as a |
| 1273 | normal C string, terminated by zero. The escape sequence "\0" can be |
| 1274 | used in the pattern to represent a binary zero. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1275 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1276 | 5. The following Perl escape sequences are not supported: \l, \u, \L, |
| 1277 | \U, \P, \p, \N, and \X. In fact these are implemented by Perl's general |
| 1278 | string-handling and are not part of its pattern matching engine. If any |
| 1279 | of these are encountered by PCRE, an error is generated. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1280 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1281 | 6. PCRE does support the \Q...\E escape for quoting substrings. |
| 1282 | Characters in between are treated as literals. This is slightly |
| 1283 | different from Perl in that $ and @ are also handled as literals inside |
| 1284 | the quotes. In Perl, they cause variable interpolation (but of course |
| 1285 | PCRE does not have variables). Note the following examples: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1286 | |
| 1287 | Pattern PCRE matches Perl matches |
| 1288 | |
| 1289 | \Qabc$xyz\E abc$xyz abc followed by the |
| 1290 | contents of $xyz |
| 1291 | \Qabc\$xyz\E abc\$xyz abc\$xyz |
| 1292 | \Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
| 1293 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1294 | The \Q...\E sequence is recognized both inside and outside character |
| 1295 | classes. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1296 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1297 | 7. Fairly obviously, PCRE does not support the (?{code}) and |
| 1298 | (?p{code}) constructions. However, there is some experimental support |
| 1299 | for recursive patterns using the non-Perl items (?R), (?number) and |
| 1300 | (?P>name). Also, the PCRE "callout" feature allows an external function |
| 1301 | to be called during pattern matching. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1302 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1303 | 8. There are some differences that are concerned with the settings of |
| 1304 | captured strings when part of a pattern is repeated. For example, |
| 1305 | matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 |
| 1306 | unset, but in PCRE it is set to "b". |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1307 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1308 | 9. PCRE provides some extensions to the Perl regular expression |
| 1309 | facilities: |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1310 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1311 | (a) Although lookbehind assertions must match fixed length strings, |
| 1312 | each alternative branch of a lookbehind assertion can match a different |
| 1313 | length of string. Perl requires them all to have the same length. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1314 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1315 | (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $ |
| 1316 | meta-character matches only at the very end of the string. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1317 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1318 | (c) If PCRE_EXTRA is set, a backslash followed by a letter with no |
| 1319 | special meaning is faulted. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1320 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1321 | (d) If PCRE_UNGREEDY is set, the greediness of the repetition |
| 1322 | quantifiers is inverted, that is, by default they are not greedy, but |
| 1323 | if followed by a question mark they are. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1324 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1325 | (e) PCRE_ANCHORED can be used to force a pattern to be tried only at |
| 1326 | the first matching position in the subject string. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1327 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1328 | (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and |
| 1329 | PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl equivalents. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1330 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1331 | (g) The (?R), (?number), and (?P>name) constructs allows for recursive |
| 1332 | pattern matching (Perl can do this using the (?p{code}) construct, |
| 1333 | which PCRE cannot support.) |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1334 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1335 | (h) PCRE supports named capturing substrings, using the Python syntax. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1336 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1337 | (i) PCRE supports the possessive quantifier "++" syntax, taken from |
| 1338 | Sun's Java package. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1339 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1340 | (j) The (R) condition, for testing recursion, is a PCRE extension. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1341 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1342 | (k) The callout facility is PCRE-specific. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1343 | |
| 1344 | NOTES |
| 1345 | The \< and \> metacharacters from Henry Spencers package |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1346 | are not available in PCRE, but can be emulated with \b, |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1347 | as required, also in conjunction with \W or \w. |
| 1348 | |
| 1349 | In LDMud, backtracks are limited by the EVAL_COST runtime |
| 1350 | limit, to avoid freezing the driver with a match |
| 1351 | like regexp(({"=XX==================="}), "X(.+)+X"). |
| 1352 | |
| 1353 | LDMud doesn't support PCRE callouts. |
| 1354 | |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1355 | LIMITATIONS |
| 1356 | There are some size limitations in PCRE but it is hoped that |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1357 | they will never in practice be relevant. The maximum length |
| 1358 | of a compiled pattern is 65539 (sic) bytes. All values in |
| 1359 | repeating quantifiers must be less than 65536. There |
| 1360 | maximum number of capturing subpatterns is 65535. There is no |
| 1361 | limit to the number of non-capturing subpatterns, but the |
| 1362 | maximum depth of nesting of all kinds of parenthesized |
| 1363 | subpattern, including capturing subpatterns, assertions, |
| 1364 | and other types of subpattern, is 200. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1365 | |
Zesstra | 7ea4a03 | 2019-11-26 20:11:40 +0100 | [diff] [blame] | 1366 | The maximum length of a subject string is the largest |
| 1367 | positive number that an integer variable can hold. However, |
| 1368 | PCRE uses recursion to handle subpatterns and indefinite |
| 1369 | repetition. This means that the available stack space may |
| 1370 | limit the size of a subject string that can be processed by |
| 1371 | certain patterns. |
MG Mud User | 88f1247 | 2016-06-24 23:31:02 +0200 | [diff] [blame] | 1372 | |
| 1373 | AUTHOR |
| 1374 | Philip Hazel <ph10@cam.ac.uk> |
| 1375 | University Computing Service, |
| 1376 | New Museums Site, |
| 1377 | Cambridge CB2 3QG, England. |
| 1378 | Phone: +44 1223 334714 |
| 1379 | |
| 1380 | SEE ALSO |
| 1381 | regexp(C), hsregexp(C) |