blob: 4c4cc023b5280fdd40ec0d22defbbed60a94b2b0 [file] [log] [blame]
MG Mud User88f12472016-06-24 23:31:02 +02001SYNOPSIS
2 PCRE - Perl-compatible regular expressions
3
MG Mud User88f12472016-06-24 23:31:02 +02004DESCRIPTION
Zesstra7ea4a032019-11-26 20:11:40 +01005 This document describes the regular expressions supported by the PCRE
6 package. When the package is compiled into the driver, the macro
7 __PCRE__ is defined.
MG Mud User88f12472016-06-24 23:31:02 +02008
Zesstra7ea4a032019-11-26 20:11:40 +01009 Most of this manpage is lifted directly from the original PCRE manpage
10 (dated January 2003).
MG Mud User88f12472016-06-24 23:31:02 +020011
Zesstra7ea4a032019-11-26 20:11:40 +010012 The PCRE library is a set of functions that implement regular
13 expression pattern matching using the same syntax and semantics as
14 Perl 5, with just a few differences (see below). The current
15 implementation corresponds to Perl 5.005, with some additional features
16 from later versions. This includes some experimental, incomplete
17 support for UTF-8 encoded strings. Details of exactly what is and what
18 is not supported are given below.
MG Mud User88f12472016-06-24 23:31:02 +020019
20PCRE REGULAR EXPRESSION DETAILS
Zesstra7ea4a032019-11-26 20:11:40 +010021 The syntax and semantics of the regular expressions supported by PCRE
22 are described below. Regular expressions are also described in the Perl
23 documentation and in a number of other books, some of which have
24 copious examples. Jeffrey Friedl's "Mastering Regular Expressions",
25 published by O'Reilly, covers them in great detail. The description
26 here is intended as reference documentation.
MG Mud User88f12472016-06-24 23:31:02 +020027
Zesstra7ea4a032019-11-26 20:11:40 +010028 The basic operation of PCRE is on strings of bytes. However, there is
29 also support for UTF-8 character strings. To use this support you must
30 build PCRE to include UTF-8 support, and then call pcre_compile() with
31 the PCRE_UTF8 option. How this affects the pattern matching is
32 mentioned in several places below. There is also a summary of UTF-8
33 features in the section on UTF-8 support in the main pcre page.
MG Mud User88f12472016-06-24 23:31:02 +020034
Zesstra7ea4a032019-11-26 20:11:40 +010035 A regular expression is a pattern that is matched against a subject
36 string from left to right. Most characters stand for themselves in a
37 pattern, and match the corresponding characters in the subject. As a
38 trivial example, the pattern
MG Mud User88f12472016-06-24 23:31:02 +020039
40 The quick brown fox
41
Zesstra7ea4a032019-11-26 20:11:40 +010042 matches a portion of a subject string that is identical to itself. The
43 power of regular expressions comes from the ability to include
44 alternatives and repetitions in the pattern. These are encoded in the
45 pattern by the use of meta-characters, which do not stand for
46 themselves but instead are interpreted in some special way.
MG Mud User88f12472016-06-24 23:31:02 +020047
Zesstra7ea4a032019-11-26 20:11:40 +010048 There are two different sets of meta-characters: those that are
49 recognized anywhere in the pattern except within square brackets, and
50 those that are recognized in square brackets. Outside square brackets,
51 the meta-characters are as follows:
MG Mud User88f12472016-06-24 23:31:02 +020052
53 \ general escape character with several uses
54 ^ assert start of string (or line, in multiline mode)
55 $ assert end of string (or line, in multiline mode)
56 . match any character except newline (by default)
57 [ start character class definition
58 | start of alternative branch
59 ( start subpattern
60 ) end subpattern
61 ? extends the meaning of (
62 also 0 or 1 quantifier
63 also quantifier minimizer
64 * 0 or more quantifier
65 + 1 or more quantifier
66 also "possessive quantifier"
67 { start min/max quantifier
68
Zesstra7ea4a032019-11-26 20:11:40 +010069 Part of a pattern that is in square brackets is called a "character
70 class". In a character class the only meta-characters are:
MG Mud User88f12472016-06-24 23:31:02 +020071
72 \ general escape character
73 ^ negate the class, but only if the first character
74 - indicates character range
75 [ POSIX character class (only if followed by POSIX
76 syntax)
77 ] terminates the character class
78
Zesstra7ea4a032019-11-26 20:11:40 +010079 The following sections describe the use of each of the meta-characters.
MG Mud User88f12472016-06-24 23:31:02 +020080
81BACKSLASH
Zesstra7ea4a032019-11-26 20:11:40 +010082 The backslash character has several uses. Firstly, if it is followed by
83 a non-alphameric character, it takes away any special meaning that
84 character may have. This use of backslash as an escape character
85 applies both inside and outside character classes.
MG Mud User88f12472016-06-24 23:31:02 +020086
Zesstra7ea4a032019-11-26 20:11:40 +010087 For example, if you want to match a * character, you write \* in the
88 pattern. This escaping action applies whether or not the following
89 character would otherwise be interpreted as a meta-character, so it is
90 always safe to precede a non-alphameric with backslash to specify that
91 it stands for itself. In particular, if you want to match a backslash,
92 you write \\.
MG Mud User88f12472016-06-24 23:31:02 +020093
Zesstra7ea4a032019-11-26 20:11:40 +010094 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
95 the pattern (other than in a character class) and characters between a
96 # outside a character class and the next newline character are ignored.
97 An escaping backslash can be used to include a whitespace or #
98 character as part of the pattern.
MG Mud User88f12472016-06-24 23:31:02 +020099
Zesstra7ea4a032019-11-26 20:11:40 +0100100 If you want to remove the special meaning from a sequence of
101 characters, you can do so by putting them between \Q and \E. This is
102 different from Perl in that $ and @ are handled as literals in \Q...\E
103 sequences in PCRE, whereas in Perl, $ and @ cause variable
104 interpolation. Note the following examples:
MG Mud User88f12472016-06-24 23:31:02 +0200105
106 Pattern PCRE matches Perl matches
107
108 \Qabc$xyz\E abc$xyz abc followed by the
109 contents of $xyz
110 \Qabc\$xyz\E abc\$xyz abc\$xyz
111 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
112
Zesstra7ea4a032019-11-26 20:11:40 +0100113 The \Q...\E sequence is recognized both inside and outside character
114 classes.
MG Mud User88f12472016-06-24 23:31:02 +0200115
Zesstra7ea4a032019-11-26 20:11:40 +0100116 A second use of backslash provides a way of encoding non-printing
117 characters in patterns in a visible manner. There is no restriction on
118 the appearance of non-printing characters, apart from the binary zero
119 that terminates a pattern, but when a pattern is being prepared by text
120 editing, it is usually easier to use one of the following escape
121 sequences than the binary character it represents:
MG Mud User88f12472016-06-24 23:31:02 +0200122
123 \a alarm, that is, the BEL character (hex 07)
124 \cx "control-x", where x is any character
125 \e escape (hex 1B)
126 \f formfeed (hex 0C)
127 \n newline (hex 0A)
128 \r carriage return (hex 0D)
129 \t tab (hex 09)
130 \ddd character with octal code ddd, or backreference
131 \xhh character with hex code hh
132 \x{hhh..} character with hex code hhh... (UTF-8 mode only)
133
Zesstra7ea4a032019-11-26 20:11:40 +0100134 The precise effect of \cx is as follows: if x is a lower case letter,
135 it is converted to upper case. Then bit 6 of the character (hex 40) is
136 inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
137 becomes hex 7B.
MG Mud User88f12472016-06-24 23:31:02 +0200138
Zesstra7ea4a032019-11-26 20:11:40 +0100139 After \x, from zero to two hexadecimal digits are read (letters can be
140 in upper or lower case). In UTF-8 mode, any number of hexadecimal
141 dig-its may appear between \x{ and }, but the value of the character
142 code must be less than 2**31 (that is, the maximum hexadecimal value is
143 7FFFFFFF). If characters other than hexadecimal digits appear between
144 \x{ and }, or if there is no terminating }, this form of escape is not
145 recognized. Instead, the initial \x will be interpreted as a basic
146 hexadecimal escape, with no following digits, giving a byte whose value
147 is zero.
MG Mud User88f12472016-06-24 23:31:02 +0200148
Zesstra7ea4a032019-11-26 20:11:40 +0100149 Characters whose value is less than 256 can be defined by either of the
150 two syntaxes for \x when PCRE is in UTF-8 mode. There is no difference
151 in the way they are handled. For example, \xdc is exactly the same as
152 \x{dc}.
MG Mud User88f12472016-06-24 23:31:02 +0200153
Zesstra7ea4a032019-11-26 20:11:40 +0100154 After \0 up to two further octal digits are read. In both cases, if
155 there are fewer than two digits, just those that are present are used.
156 Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL
157 character (code value 7). Make sure you supply two digits after the
158 initial zero if the character that follows is itself an octal digit.
MG Mud User88f12472016-06-24 23:31:02 +0200159
Zesstra7ea4a032019-11-26 20:11:40 +0100160 The handling of a backslash followed by a digit other than 0 is
161 complicated. Outside a character class, PCRE reads it and any following
162 digits as a decimal number. If the number is less than 10, or if there
163 have been at least that many previous capturing left parentheses in the
164 expression, the entire sequence is taken as a back reference. A
165 description of how this works is given later, following the discussion
166 of parenthesized subpatterns.
MG Mud User88f12472016-06-24 23:31:02 +0200167
Zesstra7ea4a032019-11-26 20:11:40 +0100168 Inside a character class, or if the decimal number is greater than 9
169 and there have not been that many capturing subpatterns, PCRE re-reads
170 up to three octal digits following the backslash, and generates a
171 single byte from the least significant 8 bits of the value. Any
172 subsequent digits stand for themselves. For example:
MG Mud User88f12472016-06-24 23:31:02 +0200173
174 \040 is another way of writing a space
175 \40 is the same, provided there are fewer than 40
176 previous capturing subpatterns
177 \7 is always a back reference
178 \11 might be a back reference, or another way of
179 writing a tab
180 \011 is always a tab
181 \0113 is a tab followed by the character "3"
182 \113 might be a back reference, otherwise the
183 character with octal code 113
184 \377 might be a back reference, otherwise
185 the byte consisting entirely of 1 bits
186 \81 is either a back reference, or a binary zero
187 followed by the two characters "8" and "1"
188
Zesstra7ea4a032019-11-26 20:11:40 +0100189 Note that octal values of 100 or greater must not be introduced by a
190 leading zero, because no more than three octal digits are ever read.
MG Mud User88f12472016-06-24 23:31:02 +0200191
Zesstra7ea4a032019-11-26 20:11:40 +0100192 All the sequences that define a single byte value or a single UTF-8
193 character (in UTF-8 mode) can be used both inside and outside character
194 classes. In addition, inside a character class, the sequence \b is
195 interpreted as the backspace character (hex 08). Outside a character
196 class it has a different meaning (see below).
MG Mud User88f12472016-06-24 23:31:02 +0200197
Zesstra7ea4a032019-11-26 20:11:40 +0100198 The third use of backslash is for specifying generic character types:
MG Mud User88f12472016-06-24 23:31:02 +0200199
200 \d any decimal digit
201 \D any character that is not a decimal digit
202 \s any whitespace character
203 \S any character that is not a whitespace character
204 \w any "word" character
205 \W any "non-word" character
206
Zesstra7ea4a032019-11-26 20:11:40 +0100207 Each pair of escape sequences partitions the complete set of characters
208 into two disjoint sets. Any given character matches one, and only one,
209 of each pair.
MG Mud User88f12472016-06-24 23:31:02 +0200210
Zesstra7ea4a032019-11-26 20:11:40 +0100211 In UTF-8 mode, characters with values greater than 255 never match \d,
212 \s, or \w, and always match \D, \S, and \W.
MG Mud User88f12472016-06-24 23:31:02 +0200213
Zesstra7ea4a032019-11-26 20:11:40 +0100214 For compatibility with Perl, \s does not match the VT character (code
215 11). This makes it different from the the POSIX "space" class. The \s
216 characters are HT (9), LF (10), FF (12), CR (13), and space (32).
MG Mud User88f12472016-06-24 23:31:02 +0200217
Zesstra7ea4a032019-11-26 20:11:40 +0100218 A "word" character is any letter or digit or the underscore character,
219 that is, any character which can be part of a Perl "word". The
220 definition of letters and digits is controlled by PCRE's character
221 tables, and may vary if locale-specific matching is taking place (see
222 "Locale support" in the pcreapi page). For example, in the "fr"
223 (French) locale, some character codes greater than 128 are used for
224 accented letters, and these are matched by \w.
MG Mud User88f12472016-06-24 23:31:02 +0200225
Zesstra7ea4a032019-11-26 20:11:40 +0100226 These character type sequences can appear both inside and outside
227 character classes. They each match one character of the appropriate
228 type. If the current matching point is at the end of the subject
229 string, all of them fail, since there is no character to match.
MG Mud User88f12472016-06-24 23:31:02 +0200230
Zesstra7ea4a032019-11-26 20:11:40 +0100231 The fourth use of backslash is for certain simple assertions. An
232 assertion specifies a condition that has to be met at a particular
233 point in a match, without consuming any characters from the subject
234 string. The use of subpatterns for more complicated assertions is
235 described below. The backslashed assertions are:
MG Mud User88f12472016-06-24 23:31:02 +0200236
237 \b matches at a word boundary
238 \B matches when not at a word boundary
239 \A matches at start of subject
240 \Z matches at end of subject or before newline at end
241 \z matches at end of subject
242 \G matches at first matching position in subject
243
Zesstra7ea4a032019-11-26 20:11:40 +0100244 These assertions may not appear in character classes (but note that \b
245 has a different meaning, namely the backspace character, inside a
246 character class).
MG Mud User88f12472016-06-24 23:31:02 +0200247
Zesstra7ea4a032019-11-26 20:11:40 +0100248 A word boundary is a position in the subject string where the current
249 character and the previous character do not both match \w or \W (i.e.
250 one matches \w and the other matches \W), or the start or end of the
251 string if the first or last character matches \w, respectively.
MG Mud User88f12472016-06-24 23:31:02 +0200252
Zesstra7ea4a032019-11-26 20:11:40 +0100253 The \A, \Z, and \z assertions differ from the traditional circumflex
254 and dollar (described below) in that they only ever match at the very
255 start and end of the subject string, whatever options are set. Thus,
256 they are independent of multiline mode.
MG Mud User88f12472016-06-24 23:31:02 +0200257
Zesstra7ea4a032019-11-26 20:11:40 +0100258 They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options. If the
259 startoffset argument of pcre_exec() is non-zero, indicating that
260 matching is to start at a point other than the beginning of the
261 subject, \A can never match. The difference between \Z and \z is that
262 \Z matches before a newline that is the last character of the string as
263 well as at the end of the string, whereas \z matches only at the end.
MG Mud User88f12472016-06-24 23:31:02 +0200264
Zesstra7ea4a032019-11-26 20:11:40 +0100265 The \G assertion is true only when the current matching position is at
266 the start point of the match, as specified by the startoffset argument
267 of pcre_exec(). It differs from \A when the value of startoffset is
268 non-zero. By calling pcre_exec() multiple times with appropriate
269 arguments, you can mimic Perl's /g option, and it is in this kind of
270 implementation where \G can be useful.
MG Mud User88f12472016-06-24 23:31:02 +0200271
Zesstra7ea4a032019-11-26 20:11:40 +0100272 Note, however, that PCRE's interpretation of \G, as the start of the
273 current match, is subtly different from Perl's, which defines it as the
274 end of the previous match. In Perl, these can be different when the
275 previously matched string was empty. Because PCRE does just one match
276 at a time, it cannot reproduce this behaviour.
MG Mud User88f12472016-06-24 23:31:02 +0200277
Zesstra7ea4a032019-11-26 20:11:40 +0100278 If all the alternatives of a pattern begin with \G, the expression is
279 anchored to the starting match position, and the "anchored" flag is set
280 in the compiled regular expression.
MG Mud User88f12472016-06-24 23:31:02 +0200281
282CIRCUMFLEX AND DOLLAR
Zesstra7ea4a032019-11-26 20:11:40 +0100283 Outside a character class, in the default matching mode, the circumflex
284 character is an assertion which is true only if the current matching
285 point is at the start of the subject string. If the startoffset
286 argument of pcre_exec() is non-zero, circumflex can never match if the
287 PCRE_MULTILINE option is unset. Inside a character class, circumflex
288 has an entirely different meaning (see below).
MG Mud User88f12472016-06-24 23:31:02 +0200289
Zesstra7ea4a032019-11-26 20:11:40 +0100290 Circumflex need not be the first character of the pattern if a number
291 of alternatives are involved, but it should be the first thing in each
292 alternative in which it appears if the pattern is ever to match that
293 branch. If all possible alternatives start with a circumflex, that is,
294 if the pattern is constrained to match only at the start of the
295 subject, it is said to be an "anchored" pattern. (There are also other
296 constructs that can cause a pattern to be anchored.)
MG Mud User88f12472016-06-24 23:31:02 +0200297
Zesstra7ea4a032019-11-26 20:11:40 +0100298 A dollar character is an assertion which is true only if the current
299 matching point is at the end of the subject string, or immediately
300 before a newline character that is the last character in the string (by
301 default). Dollar need not be the last character of the pattern if a
302 number of alternatives are involved, but it should be the last item in
303 any branch in which it appears. Dollar has no special meaning in a
304 character class.
MG Mud User88f12472016-06-24 23:31:02 +0200305
Zesstra7ea4a032019-11-26 20:11:40 +0100306 The meaning of dollar can be changed so that it matches only at the
307 very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
308 compile time. This does not affect the \Z assertion.
MG Mud User88f12472016-06-24 23:31:02 +0200309
Zesstra7ea4a032019-11-26 20:11:40 +0100310 The meanings of the circumflex and dollar characters are changed if the
311 PCRE_MULTILINE option is set. When this is the case, they match
312 immediately after and immediately before an internal newline character,
313 respectively, in addition to matching at the start and end of the
314 subject string. For example, the pattern /^abc$/ matches the subject
315 string "def\nabc" in multiline mode, but not otherwise. Consequently,
316 patterns that are anchored in single line mode because all branches
317 start with ^ are not anchored in multiline mode, and a match for
318 circumflex is possible when the startoffset argument of pcre_exec() is
319 non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE
320 is set.
MG Mud User88f12472016-06-24 23:31:02 +0200321
Zesstra7ea4a032019-11-26 20:11:40 +0100322 Note that the sequences \A, \Z, and \z can be used to match the start
323 and end of the subject in both modes, and if all branches of a pattern
324 start with \A it is always anchored, whether PCRE_MULTILINE is set or
325 not.
MG Mud User88f12472016-06-24 23:31:02 +0200326
327FULL STOP (PERIOD, DOT)
Zesstra7ea4a032019-11-26 20:11:40 +0100328 Outside a character class, a dot in the pattern matches any one
329 character in the subject, including a non-printing character, but not
330 (by default) newline. In UTF-8 mode, a dot matches any UTF-8 character,
331 which might be more than one byte long, except (by default) for
332 newline. If the PCRE_DOTALL option is set, dots match newlines as well.
333 The handling of dot is entirely independent of the handling of
334 circumflex and dollar, the only relationship being that they both
335 involve newline characters. Dot has no special meaning in a character
336 class.
MG Mud User88f12472016-06-24 23:31:02 +0200337
338MATCHING A SINGLE BYTE
Zesstra7ea4a032019-11-26 20:11:40 +0100339 Outside a character class, the escape sequence \C matches any one byte,
340 both in and out of UTF-8 mode. Unlike a dot, it always matches a
341 newline. The feature is provided in Perl in order to match individual
342 bytes in UTF-8 mode. Because it breaks up UTF-8 characters into
343 individual bytes, what remains in the string may be a malformed UTF-8
344 string. For this reason it is best avoided.
MG Mud User88f12472016-06-24 23:31:02 +0200345
Zesstra7ea4a032019-11-26 20:11:40 +0100346 PCRE does not allow \C to appear in lookbehind assertions (see below),
347 because in UTF-8 mode it makes it impossible to calculate the length of
348 the lookbehind.
MG Mud User88f12472016-06-24 23:31:02 +0200349
350SQUARE BRACKETS
Zesstra7ea4a032019-11-26 20:11:40 +0100351 An opening square bracket introduces a character class, terminated by a
352 closing square bracket. A closing square bracket on its own is not
353 special. If a closing square bracket is required as a member of the
354 class, it should be the first data character in the class (after an
355 initial circumflex, if present) or escaped with a backslash.
MG Mud User88f12472016-06-24 23:31:02 +0200356
Zesstra7ea4a032019-11-26 20:11:40 +0100357 A character class matches a single character in the subject. In UTF-8
358 mode, the character may occupy more than one byte. A matched character
359 must be in the set of characters defined by the class, unless the first
360 character in the class definition is a circumflex, in which case the
361 subject character must not be in the set defined by the class. If a
362 circumflex is actually required as a member of the class, ensure it is
363 not the first character, or escape it with a backslash.
MG Mud User88f12472016-06-24 23:31:02 +0200364
Zesstra7ea4a032019-11-26 20:11:40 +0100365 For example, the character class [aeiou] matches any lower case vowel,
366 while [^aeiou] matches any character that is not a lower case vowel.
367 Note that a circumflex is just a convenient notation for specifying the
368 characters which are in the class by enumerating those that are not. It
369 is not an assertion: it still consumes a character from the subject
370 string, and fails if the current pointer is at the end of the string.
MG Mud User88f12472016-06-24 23:31:02 +0200371
Zesstra7ea4a032019-11-26 20:11:40 +0100372 In UTF-8 mode, characters with values greater than 255 can be included
373 in a class as a literal string of bytes, or by using the \x{ escaping
374 mechanism.
MG Mud User88f12472016-06-24 23:31:02 +0200375
Zesstra7ea4a032019-11-26 20:11:40 +0100376 When caseless matching is set, any letters in a class represent both
377 their upper case and lower case versions, so for example, a caseless
378 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
379 match "A", whereas a caseful version would. PCRE does not support the
380 concept of case for characters with values greater than 255.
MG Mud User88f12472016-06-24 23:31:02 +0200381
Zesstra7ea4a032019-11-26 20:11:40 +0100382 The newline character is never treated in any special way in character
383 classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE
384 options is. A class such as [^a] will always match a newline.
MG Mud User88f12472016-06-24 23:31:02 +0200385
Zesstra7ea4a032019-11-26 20:11:40 +0100386 The minus (hyphen) character can be used to specify a range of
387 characters in a character class. For example, [d-m] matches any letter
388 between d and m, inclusive. If a minus character is required in a
389 class, it must be escaped with a backslash or appear in a position
390 where it cannot be interpreted as indicating a range, typically as the
391 first or last character in the class.
MG Mud User88f12472016-06-24 23:31:02 +0200392
Zesstra7ea4a032019-11-26 20:11:40 +0100393 It is not possible to have the literal character "]" as the end
394 character of a range. A pattern such as [W-]46] is interpreted as a
395 class of two characters ("W" and "-") followed by a literal string
396 "46]", so it would match "W46]" or "-46]". However, if the "]" is
397 escaped with a backslash it is interpreted as the end of range, so
398 [W-\]46] is interpreted as a single class containing a range followed
399 by two separate characters. The octal or hexadecimal representation of
400 "]" can also be used to end a range.
MG Mud User88f12472016-06-24 23:31:02 +0200401
Zesstra7ea4a032019-11-26 20:11:40 +0100402 Ranges operate in the collating sequence of character values. They can
403 also be used for characters specified numerically, for example
404 [\000-\037]. In UTF-8 mode, ranges can include characters whose values
405 are greater than 255, for example [\x{100}-\x{2ff}].
MG Mud User88f12472016-06-24 23:31:02 +0200406
Zesstra7ea4a032019-11-26 20:11:40 +0100407 If a range that includes letters is used when caseless matching is set,
408 it matches the letters in either case. For example, [W-c] is equivalent
409 to [][\^_`wxyzabc], matched caselessly, and if character tables for the
410 "fr" locale are in use, [\xc8-\xcb] matches accented E characters in
411 both cases.
MG Mud User88f12472016-06-24 23:31:02 +0200412
Zesstra7ea4a032019-11-26 20:11:40 +0100413 The character types \d, \D, \s, \S, \w, and \W may also appear in a
414 character class, and add the characters that they match to the class.
415 For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can
416 conveniently be used with the upper case character types to specify a
417 more restricted set of characters than the matching lower case type.
418 For example, the class [^\W_] matches any letter or digit, but not
419 underscore.
MG Mud User88f12472016-06-24 23:31:02 +0200420
Zesstra7ea4a032019-11-26 20:11:40 +0100421 All non-alphameric characters other than \, -, ^ (at the start) and the
422 terminating ] are non-special in character classes, but it does no harm
423 if they are escaped.
MG Mud User88f12472016-06-24 23:31:02 +0200424
425POSIX CHARACTER CLASSES
Zesstra7ea4a032019-11-26 20:11:40 +0100426 Perl supports the POSIX notation for character classes, which uses
427 names enclosed by [: and :] within the enclosing square brackets. PCRE
428 also supports this notation. For example,
MG Mud User88f12472016-06-24 23:31:02 +0200429
430 [01[:alpha:]%]
431
Zesstra7ea4a032019-11-26 20:11:40 +0100432 matches "0", "1", any alphabetic character, or "%". The supported class
433 names are
MG Mud User88f12472016-06-24 23:31:02 +0200434
435 alnum letters and digits
436 alpha letters
437 ascii character codes 0 - 127
438 blank space or tab only
439 cntrl control characters
440 digit decimal digits (same as \d)
441 graph printing characters, excluding space
442 lower lower case letters
443 print printing characters, including space
444 punct printing characters, excluding letters and digits
445 space white space (not quite the same as \s)
446 upper upper case letters
447 word "word" characters (same as \w)
448 xdigit hexadecimal digits
449
Zesstra7ea4a032019-11-26 20:11:40 +0100450 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
451 and space (32). Notice that this list includes the VT character (code
452 11). This makes "space" different to \s, which does not include VT (for
453 Perl compatibility).
MG Mud User88f12472016-06-24 23:31:02 +0200454
Zesstra7ea4a032019-11-26 20:11:40 +0100455 The name "word" is a Perl extension, and "blank" is a GNU extension
456 from Perl 5.8. Another Perl extension is negation, which is indicated
457 by a ^ character after the colon. For example,
MG Mud User88f12472016-06-24 23:31:02 +0200458
459 [12[:^digit:]]
460
Zesstra7ea4a032019-11-26 20:11:40 +0100461 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
462 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
463 these are not supported, and an error is given if they are encountered.
MG Mud User88f12472016-06-24 23:31:02 +0200464
Zesstra7ea4a032019-11-26 20:11:40 +0100465 In UTF-8 mode, characters with values greater than 255 do not match any
466 of the POSIX character classes.
MG Mud User88f12472016-06-24 23:31:02 +0200467
468VERTICAL BAR
Zesstra7ea4a032019-11-26 20:11:40 +0100469 Vertical bar characters are used to separate alternative patterns. For
470 example, the pattern
MG Mud User88f12472016-06-24 23:31:02 +0200471
472 gilbert|sullivan
473
Zesstra7ea4a032019-11-26 20:11:40 +0100474 matches either "gilbert" or "sullivan". Any number of alternatives may
475 appear, and an empty alternative is permitted (matching the empty
476 string). The matching process tries each alternative in turn, from
477 left to right, and the first one that succeeds is used. If the
478 alternatives are within a subpattern (defined below), "succeeds" means
479 matching the rest of the main pattern as well as the alternative in the
480 subpattern.
MG Mud User88f12472016-06-24 23:31:02 +0200481
482INTERNAL OPTION SETTING
Zesstra7ea4a032019-11-26 20:11:40 +0100483 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
484 PCRE_EXTENDED options can be changed from within the pattern by a
485 sequence of Perl option letters enclosed between "(?" and ")". The
486 option letters are
MG Mud User88f12472016-06-24 23:31:02 +0200487
488 i for PCRE_CASELESS
489 m for PCRE_MULTILINE
490 s for PCRE_DOTALL
491 x for PCRE_EXTENDED
492
Zesstra7ea4a032019-11-26 20:11:40 +0100493 For example, (?im) sets caseless, multiline matching. It is also
494 possible to unset these options by preceding the letter with a hyphen,
495 and a combined setting and unsetting such as (?im-sx), which sets
496 PCRE_CASELESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and
497 PCRE_EXTENDED, is also permitted. If a letter appears both before and
498 after the hyphen, the option is unset.
MG Mud User88f12472016-06-24 23:31:02 +0200499
Zesstra7ea4a032019-11-26 20:11:40 +0100500 When an option change occurs at top level (that is, not inside
501 subpattern parentheses), the change applies to the remainder of the
502 pattern that follows. If the change is placed right at the start of a
503 pattern, PCRE extracts it into the global options (and it will
504 therefore show up in data extracted by the pcre_fullinfo() function).
MG Mud User88f12472016-06-24 23:31:02 +0200505
Zesstra7ea4a032019-11-26 20:11:40 +0100506 An option change within a subpattern affects only that part of the
507 current pattern that follows it, so
MG Mud User88f12472016-06-24 23:31:02 +0200508
509 (a(?i)b)c
510
Zesstra7ea4a032019-11-26 20:11:40 +0100511 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
512 used). By this means, options can be made to have different settings
513 in different parts of the pattern. Any changes made in one alternative
514 do carry on into subsequent branches within the same subpattern. For
515 example,
MG Mud User88f12472016-06-24 23:31:02 +0200516
517 (a(?i)b|c)
518
Zesstra7ea4a032019-11-26 20:11:40 +0100519 matches "ab", "aB", "c", and "C", even though when matching "C" the
520 first branch is abandoned before the option setting. This is because
521 the effects of option settings happen at compile time. There would be
522 some very weird behaviour otherwise.
MG Mud User88f12472016-06-24 23:31:02 +0200523
Zesstra7ea4a032019-11-26 20:11:40 +0100524 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed
525 in the same way as the Perl-compatible options by using the characters
526 U and X respectively. The (?X) flag setting is special in that it must
527 always occur earlier in the pattern than any of the additional features
528 it turns on, even when it is at top level. It is best put at the start.
MG Mud User88f12472016-06-24 23:31:02 +0200529
530SUBPATTERNS
Zesstra7ea4a032019-11-26 20:11:40 +0100531 Subpatterns are delimited by parentheses (round brackets), which can be
532 nested. Marking part of a pattern as a subpattern does two things:
MG Mud User88f12472016-06-24 23:31:02 +0200533
Zesstra7ea4a032019-11-26 20:11:40 +0100534 1. It localizes a set of alternatives. For example, the pattern
MG Mud User88f12472016-06-24 23:31:02 +0200535
536 cat(aract|erpillar|)
537
Zesstra7ea4a032019-11-26 20:11:40 +0100538 matches one of the words "cat", "cataract", or "caterpillar". Without
539 the parentheses, it would match "cataract", "erpillar" or the empty
540 string.
MG Mud User88f12472016-06-24 23:31:02 +0200541
Zesstra7ea4a032019-11-26 20:11:40 +0100542 2. It sets up the subpattern as a capturing subpattern (as defined
543 above). When the whole pattern matches, that portion of the subject
544 string that matched the subpattern is passed back to the caller via the
545 ovector argument of pcre_exec(). Opening parentheses are counted from
546 left to right (starting from 1) to obtain the numbers of the capturing
547 subpatterns.
MG Mud User88f12472016-06-24 23:31:02 +0200548
Zesstra7ea4a032019-11-26 20:11:40 +0100549 For example, if the string "the red king" is matched against the
550 pattern
MG Mud User88f12472016-06-24 23:31:02 +0200551
552 the ((red|white) (king|queen))
553
Zesstra7ea4a032019-11-26 20:11:40 +0100554 the captured substrings are "red king", "red", and "king", and are
555 numbered 1, 2, and 3, respectively.
MG Mud User88f12472016-06-24 23:31:02 +0200556
Zesstra7ea4a032019-11-26 20:11:40 +0100557 The fact that plain parentheses fulfil two functions is not always
558 helpful. There are often times when a grouping subpattern is required
559 without a capturing requirement. If an opening parenthesis is followed
560 by a question mark and a colon, the subpattern does not do any
561 capturing, and is not counted when computing the number of any
562 subsequent capturing subpatterns. For example, if the string "the white
563 queen" is matched against the pattern
MG Mud User88f12472016-06-24 23:31:02 +0200564
565 the ((?:red|white) (king|queen))
566
Zesstra7ea4a032019-11-26 20:11:40 +0100567 the captured substrings are "white queen" and "queen", and are numbered
568 1 and 2. The maximum number of capturing subpatterns is 65535, and the
569 maximum depth of nesting of all subpatterns, both capturing and
570 noncapturing, is 200.
MG Mud User88f12472016-06-24 23:31:02 +0200571
Zesstra7ea4a032019-11-26 20:11:40 +0100572 As a convenient shorthand, if any option settings are required at the
573 start of a non-capturing subpattern, the option letters may appear
574 between the "?" and the ":". Thus the two patterns
MG Mud User88f12472016-06-24 23:31:02 +0200575
576 (?i:saturday|sunday)
577 (?:(?i)saturday|sunday)
578
Zesstra7ea4a032019-11-26 20:11:40 +0100579 match exactly the same set of strings. Because alternative branches are
580 tried from left to right, and options are not reset until the end of
581 the subpattern is reached, an option setting in one branch does affect
582 subsequent branches, so the above patterns match "SUNDAY" as well as
583 "Saturday".
MG Mud User88f12472016-06-24 23:31:02 +0200584
585NAMED SUBPATTERNS
Zesstra7ea4a032019-11-26 20:11:40 +0100586 Identifying capturing parentheses by number is simple, but it can be
587 very hard to keep track of the numbers in complicated regular
588 expressions. Furthermore, if an expression is modified, the numbers may
589 change. To help with the difficulty, PCRE supports the naming of
590 subpatterns, something that Perl does not provide. The Python syntax
591 (?P<name>...) is used. Names consist of alphanumeric characters and
592 underscores, and must be unique within a pattern.
MG Mud User88f12472016-06-24 23:31:02 +0200593
Zesstra7ea4a032019-11-26 20:11:40 +0100594 Named capturing parentheses are still allocated numbers as well as
595 names. The PCRE API provides function calls for extracting the name-to-
596 number translation table from a compiled pattern. For further details
597 see the pcreapi documentation.
MG Mud User88f12472016-06-24 23:31:02 +0200598
599REPETITION
Zesstra7ea4a032019-11-26 20:11:40 +0100600 Repetition is specified by quantifiers, which can follow any of the
601 following items:
MG Mud User88f12472016-06-24 23:31:02 +0200602
603 a literal data character
604 the . metacharacter
605 the \C escape sequence
606 escapes such as \d that match single characters
607 a character class
608 a back reference (see next section)
609 a parenthesized subpattern (unless it is an assertion)
610
Zesstra7ea4a032019-11-26 20:11:40 +0100611 The general repetition quantifier specifies a minimum and maximum
612 number of permitted matches, by giving the two numbers in curly
613 brackets (braces), separated by a comma. The numbers must be less than
614 65536, and the first must be less than or equal to the second. For
615 example:
MG Mud User88f12472016-06-24 23:31:02 +0200616
617 z{2,4}
618
Zesstra7ea4a032019-11-26 20:11:40 +0100619 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
620 special character. If the second number is omitted, but the comma is
621 present, there is no upper limit; if the second number and the comma
622 are both omitted, the quantifier specifies an exact number of required
623 matches. Thus
MG Mud User88f12472016-06-24 23:31:02 +0200624
625 [aeiou]{3,}
626
Zesstra7ea4a032019-11-26 20:11:40 +0100627 matches at least 3 successive vowels, but may match many more, while
MG Mud User88f12472016-06-24 23:31:02 +0200628
629 \d{8}
630
Zesstra7ea4a032019-11-26 20:11:40 +0100631 matches exactly 8 digits. An opening curly bracket that appears in a
632 position where a quantifier is not allowed, or one that does not match
633 the syntax of a quantifier, is taken as a literal character. For
634 example, {,6} is not a quantifier, but a literal string of four
635 characters.
MG Mud User88f12472016-06-24 23:31:02 +0200636
Zesstra7ea4a032019-11-26 20:11:40 +0100637 In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
638 individual bytes. Thus, for example, \x{100}{2} matches two UTF-8
639 characters, each of which is represented by a two-byte sequence.
MG Mud User88f12472016-06-24 23:31:02 +0200640
Zesstra7ea4a032019-11-26 20:11:40 +0100641 The quantifier {0} is permitted, causing the expression to behave as if
642 the previous item and the quantifier were not present.
MG Mud User88f12472016-06-24 23:31:02 +0200643
Zesstra7ea4a032019-11-26 20:11:40 +0100644 For convenience (and historical compatibility) the three most common
645 quantifiers have single-character abbreviations:
MG Mud User88f12472016-06-24 23:31:02 +0200646
647 * is equivalent to {0,}
648 + is equivalent to {1,}
649 ? is equivalent to {0,1}
650
Zesstra7ea4a032019-11-26 20:11:40 +0100651 It is possible to construct infinite loops by following a subpattern
652 that can match no characters with a quantifier that has no upper limit,
653 for example:
MG Mud User88f12472016-06-24 23:31:02 +0200654
655 (a?)*
656
Zesstra7ea4a032019-11-26 20:11:40 +0100657 Earlier versions of Perl and PCRE used to give an error at compile time
658 for such patterns. However, because there are cases where this can be
659 useful, such patterns are now accepted, but if any repetition of the
660 subpattern does in fact match no characters, the loop is forcibly
661 broken.
MG Mud User88f12472016-06-24 23:31:02 +0200662
Zesstra7ea4a032019-11-26 20:11:40 +0100663 By default, the quantifiers are "greedy", that is, they match as much
664 as possible (up to the maximum number of permitted times), without
665 causing the rest of the pattern to fail. The classic example of where
666 this gives problems is in trying to match comments in C programs. These
667 appear between the sequences /* and */ and within the sequence,
668 individual * and / characters may appear. An attempt to match C
669 comments by applying the pattern
MG Mud User88f12472016-06-24 23:31:02 +0200670
671 /\*.*\*/
672
Zesstra7ea4a032019-11-26 20:11:40 +0100673 to the string
MG Mud User88f12472016-06-24 23:31:02 +0200674
Zesstra7ea4a032019-11-26 20:11:40 +0100675 /* first command */ not comment /* second comment */
MG Mud User88f12472016-06-24 23:31:02 +0200676
Zesstra7ea4a032019-11-26 20:11:40 +0100677 fails, because it matches the entire string owing to the greediness of
678 the .* item.
MG Mud User88f12472016-06-24 23:31:02 +0200679
Zesstra7ea4a032019-11-26 20:11:40 +0100680 However, if a quantifier is followed by a question mark, it ceases to
681 be greedy, and instead matches the minimum number of times possible, so
682 the pattern
MG Mud User88f12472016-06-24 23:31:02 +0200683
684 /\*.*?\*/
685
Zesstra7ea4a032019-11-26 20:11:40 +0100686 does the right thing with the C comments. The meaning of the various
687 quantifiers is not otherwise changed, just the preferred number of
688 matches. Do not confuse this use of question mark with its use as a
689 quantifier in its own right. Because it has two uses, it can sometimes
690 appear doubled, as in
MG Mud User88f12472016-06-24 23:31:02 +0200691
692 \d??\d
693
Zesstra7ea4a032019-11-26 20:11:40 +0100694 which matches one digit by preference, but can match two if that is the
695 only way the rest of the pattern matches.
MG Mud User88f12472016-06-24 23:31:02 +0200696
Zesstra7ea4a032019-11-26 20:11:40 +0100697 If the PCRE_UNGREEDY option is set (an option which is not available in
698 Perl), the quantifiers are not greedy by default, but individual ones
699 can be made greedy by following them with a question mark. In other
700 words, it inverts the default behaviour.
MG Mud User88f12472016-06-24 23:31:02 +0200701
Zesstra7ea4a032019-11-26 20:11:40 +0100702 When a parenthesized subpattern is quantified with a minimum repeat
703 count that is greater than 1 or with a limited maximum, more store is
704 required for the compiled pattern, in proportion to the size of the
705 minimum or maximum.
MG Mud User88f12472016-06-24 23:31:02 +0200706
Zesstra7ea4a032019-11-26 20:11:40 +0100707 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option
708 (equivalent to Perl's /s) is set, thus allowing the . to match
709 newlines, the pattern is implicitly anchored, because whatever follows
710 will be tried against every character position in the subject string,
711 so there is no point in retrying the overall match at any position
712 after the first. PCRE normally treats such a pattern as though it were
713 preceded by \A.
MG Mud User88f12472016-06-24 23:31:02 +0200714
Zesstra7ea4a032019-11-26 20:11:40 +0100715 In cases where it is known that the subject string contains no
716 newlines, it is worth setting PCRE_DOTALL in order to obtain this
717 optimization, or alternatively using ^ to indicate anchoring
718 explicitly.
MG Mud User88f12472016-06-24 23:31:02 +0200719
Zesstra7ea4a032019-11-26 20:11:40 +0100720 However, there is one situation where the optimization cannot be used.
721 When .* is inside capturing parentheses that are the subject of a
722 backreference elsewhere in the pattern, a match at the start may fail,
723 and a later one succeed. Consider, for example:
MG Mud User88f12472016-06-24 23:31:02 +0200724
725 (.*)abc\1
726
Zesstra7ea4a032019-11-26 20:11:40 +0100727 If the subject is "xyz123abc123" the match point is the fourth
728 character. For this reason, such a pattern is not implicitly anchored.
MG Mud User88f12472016-06-24 23:31:02 +0200729
Zesstra7ea4a032019-11-26 20:11:40 +0100730 When a capturing subpattern is repeated, the value captured is the
731 substring that matched the final iteration. For example, after
MG Mud User88f12472016-06-24 23:31:02 +0200732
733 (tweedle[dume]{3}\s*)+
734
Zesstra7ea4a032019-11-26 20:11:40 +0100735 has matched "tweedledum tweedledee" the value of the captured substring
736 is "tweedledee". However, if there are nested capturing subpatterns,
737 the corresponding captured values may have been set in previous
738 iterations. For example, after
MG Mud User88f12472016-06-24 23:31:02 +0200739
740 /(a|(b))+/
741
Zesstra7ea4a032019-11-26 20:11:40 +0100742 matches "aba" the value of the second captured substring is "b".
MG Mud User88f12472016-06-24 23:31:02 +0200743
744ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
Zesstra7ea4a032019-11-26 20:11:40 +0100745 With both maximizing and minimizing repetition, failure of what follows
746 normally causes the repeated item to be re-evaluated to see if a
747 different number of repeats allows the rest of the pattern to match.
748 Sometimes it is useful to prevent this, either to change the nature of
749 the match, or to cause it fail earlier than it otherwise might, when
750 the author of the pattern knows there is no point in carrying on.
MG Mud User88f12472016-06-24 23:31:02 +0200751
Zesstra7ea4a032019-11-26 20:11:40 +0100752 Consider, for example, the pattern \d+foo when applied to the subject
753 line
MG Mud User88f12472016-06-24 23:31:02 +0200754
755 123456bar
756
Zesstra7ea4a032019-11-26 20:11:40 +0100757 After matching all 6 digits and then failing to match "foo", the normal
758 action of the matcher is to try again with only 5 digits matching the
759 \d+ item, and then with 4, and so on, before ultimately failing.
760 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
761 the means for specifying that once a subpattern has matched, it is not
762 to be re-evaluated in this way.
MG Mud User88f12472016-06-24 23:31:02 +0200763
Zesstra7ea4a032019-11-26 20:11:40 +0100764 If we use atomic grouping for the previous example, the matcher would
765 give up immediately on failing to match "foo" the first time. The
766 notation is a kind of special parenthesis, starting with (?> as in this
767 example:
MG Mud User88f12472016-06-24 23:31:02 +0200768
769 (?>\d+)foo
770
Zesstra7ea4a032019-11-26 20:11:40 +0100771 This kind of parenthesis "locks up" the part of the pattern it
772 contains once it has matched, and a failure further into the pattern is
773 prevented from backtracking into it. Backtracking past it to previous
774 items, however, works as normal.
MG Mud User88f12472016-06-24 23:31:02 +0200775
Zesstra7ea4a032019-11-26 20:11:40 +0100776 An alternative description is that a subpattern of this type matches
777 the string of characters that an identical standalone pattern would
778 match, if anchored at the current point in the subject string.
MG Mud User88f12472016-06-24 23:31:02 +0200779
Zesstra7ea4a032019-11-26 20:11:40 +0100780 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
781 such as the above example can be thought of as a maximizing repeat that
782 must swallow everything it can. So, while both \d+ and \d+? are
783 prepared to adjust the number of digits they match in order to make the
784 rest of the pattern match, (?>\d+) can only match an entire sequence of
785 digits.
MG Mud User88f12472016-06-24 23:31:02 +0200786
Zesstra7ea4a032019-11-26 20:11:40 +0100787 Atomic groups in general can of course contain arbitrarily complicated
788 subpatterns, and can be nested. However, when the subpattern for an
789 atomic group is just a single repeated item, as in the example above, a
790 simpler notation, called a "possessive quantifier" can be used. This
791 consists of an additional + character following a quantifier. Using
792 this notation, the previous example can be rewritten as
MG Mud User88f12472016-06-24 23:31:02 +0200793
794 \d++bar
795
Zesstra7ea4a032019-11-26 20:11:40 +0100796 Possessive quantifiers are always greedy; the setting of the
797 PCRE_UNGREEDY option is ignored. They are a convenient notation for the
798 simpler forms of atomic group. However, there is no difference in the
799 meaning or processing of a possessive quantifier and the equivalent
800 atomic group.
MG Mud User88f12472016-06-24 23:31:02 +0200801
Zesstra7ea4a032019-11-26 20:11:40 +0100802 The possessive quantifier syntax is an extension to the Perl syntax. It
803 originates in Sun's Java package.
MG Mud User88f12472016-06-24 23:31:02 +0200804
Zesstra7ea4a032019-11-26 20:11:40 +0100805 When a pattern contains an unlimited repeat inside a subpattern that
806 can itself be repeated an unlimited number of times, the use of an
807 atomic group is the only way to avoid some failing matches taking a
808 very long time indeed. The pattern
MG Mud User88f12472016-06-24 23:31:02 +0200809
810 (\D+|<\d+>)*[!?]
811
Zesstra7ea4a032019-11-26 20:11:40 +0100812 matches an unlimited number of substrings that either consist of non-
813 digits, or digits enclosed in <>, followed by either ! or ?. When it
814 matches, it runs quickly. However, if it is applied to
MG Mud User88f12472016-06-24 23:31:02 +0200815
816 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
817
Zesstra7ea4a032019-11-26 20:11:40 +0100818 it takes a long time before reporting failure. This is because the
819 string can be divided between the two repeats in a large number of
820 ways, and all have to be tried. (The example used [!?] rather than a
821 single character at the end, because both PCRE and Perl have an
822 optimization that allows for fast failure when a single character is
823 used. They remember the last single character that is required for a
824 match, and fail early if it is not present in the string.) If the
825 pattern is changed to
MG Mud User88f12472016-06-24 23:31:02 +0200826
827 ((?>\D+)|<\d+>)*[!?]
828
Zesstra7ea4a032019-11-26 20:11:40 +0100829 sequences of non-digits cannot be broken, and failure happens quickly.
MG Mud User88f12472016-06-24 23:31:02 +0200830
831BACK REFERENCES
Zesstra7ea4a032019-11-26 20:11:40 +0100832 Outside a character class, a backslash followed by a digit greater than
833 0 (and possibly further digits) is a back reference to a capturing
834 subpattern earlier (that is, to its left) in the pattern, provided
835 there have been that many previous capturing left parentheses.
MG Mud User88f12472016-06-24 23:31:02 +0200836
Zesstra7ea4a032019-11-26 20:11:40 +0100837 However, if the decimal number following the backslash is less than 10,
838 it is always taken as a back reference, and causes an error only if
839 there are not that many capturing left parentheses in the entire
840 pattern. In other words, the parentheses that are referenced need not
841 be to the left of the reference for numbers less than 10. See the
842 section entitled "Backslash" above for further details of the handling
843 of digits following a backslash.
MG Mud User88f12472016-06-24 23:31:02 +0200844
Zesstra7ea4a032019-11-26 20:11:40 +0100845 A back reference matches whatever actually matched the capturing
846 subpattern in the current subject string, rather than anything matching
847 the subpattern itself (see "Subpatterns as subroutines" below for a way
848 of doing that). So the pattern
MG Mud User88f12472016-06-24 23:31:02 +0200849
850 (sens|respons)e and \1ibility
851
Zesstra7ea4a032019-11-26 20:11:40 +0100852 matches "sense and sensibility" and "response and responsibility", but
853 not "sense and responsibility". If caseful matching is in force at the
854 time of the back reference, the case of letters is relevant. For
855 example,
MG Mud User88f12472016-06-24 23:31:02 +0200856
857 ((?i)rah)\s+\1
858
Zesstra7ea4a032019-11-26 20:11:40 +0100859 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
860 original capturing subpattern is matched caselessly.
MG Mud User88f12472016-06-24 23:31:02 +0200861
Zesstra7ea4a032019-11-26 20:11:40 +0100862 Back references to named subpatterns use the Python syntax (?P=name).
863 We could rewrite the above example as follows:
MG Mud User88f12472016-06-24 23:31:02 +0200864
865 (?<p1>(?i)rah)\s+(?P=p1)
866
Zesstra7ea4a032019-11-26 20:11:40 +0100867 There may be more than one back reference to the same subpattern. If a
868 subpattern has not actually been used in a particular match, any back
869 references to it always fail. For example, the pattern
MG Mud User88f12472016-06-24 23:31:02 +0200870
871 (a|(bc))\2
872
Zesstra7ea4a032019-11-26 20:11:40 +0100873 always fails if it starts to match "a" rather than "bc". Because there
874 may be many capturing parentheses in a pattern, all digits following
875 the backslash are taken as part of a potential back reference number.
876 If the pattern continues with a digit character, some delimiter must be
877 used to terminate the back reference. If the PCRE_EXTENDED option is
878 set, this can be whitespace. Otherwise an empty comment can be used.
MG Mud User88f12472016-06-24 23:31:02 +0200879
Zesstra7ea4a032019-11-26 20:11:40 +0100880 A back reference that occurs inside the parentheses to which it refers
881 fails when the subpattern is first used, so, for example, (a\1) never
882 matches. However, such references can be useful inside repeated
883 subpatterns. For example, the pattern
MG Mud User88f12472016-06-24 23:31:02 +0200884
885 (a|b\1)+
886
Zesstra7ea4a032019-11-26 20:11:40 +0100887 matches any number of "a"s and also "aba", "ababbaa" etc. At each
888 iteration of the subpattern, the back reference matches the character
889 string corresponding to the previous iteration. In order for this to
890 work, the pattern must be such that the first iteration does not need
891 to match the back reference. This can be done using alternation, as in
892 the example above, or by a quantifier with a minimum of zero.
MG Mud User88f12472016-06-24 23:31:02 +0200893
894ASSERTIONS
Zesstra7ea4a032019-11-26 20:11:40 +0100895 An assertion is a test on the characters following or preceding the
896 current matching point that does not actually consume any characters.
897 The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
898 described above. More complicated assertions are coded as subpatterns.
899 There are two kinds: those that look ahead of the current position in
900 the subject string, and those that look behind it.
MG Mud User88f12472016-06-24 23:31:02 +0200901
Zesstra7ea4a032019-11-26 20:11:40 +0100902 An assertion subpattern is matched in the normal way, except that it
903 does not cause the current matching position to be changed. Lookahead
904 assertions start with (?= for positive assertions and (?! for negative
905 assertions. For example,
MG Mud User88f12472016-06-24 23:31:02 +0200906
907 \w+(?=;)
908
Zesstra7ea4a032019-11-26 20:11:40 +0100909 matches a word followed by a semicolon, but does not include the
910 semicolon in the match, and
MG Mud User88f12472016-06-24 23:31:02 +0200911
912 foo(?!bar)
913
Zesstra7ea4a032019-11-26 20:11:40 +0100914 matches any occurrence of "foo" that is not followed by "bar". Note
915 that the apparently similar pattern
MG Mud User88f12472016-06-24 23:31:02 +0200916
917 (?!foo)bar
918
Zesstra7ea4a032019-11-26 20:11:40 +0100919 does not find an occurrence of "bar" that is preceded by something
920 other than "foo"; it finds any occurrence of "bar" whatsoever, because
921 the assertion (?!foo) is always true when the next three characters are
922 "bar". A lookbehind assertion is needed to achieve this effect.
MG Mud User88f12472016-06-24 23:31:02 +0200923
Zesstra7ea4a032019-11-26 20:11:40 +0100924 If you want to force a matching failure at some point in a pattern, the
925 most convenient way to do it is with (?!) because an empty string
926 always matches, so an assertion that requires there not to be an empty
927 string must always fail.
MG Mud User88f12472016-06-24 23:31:02 +0200928
Zesstra7ea4a032019-11-26 20:11:40 +0100929 Lookbehind assertions start with (?<= for positive assertions and (?<!
930 for negative assertions. For example,
MG Mud User88f12472016-06-24 23:31:02 +0200931
932 (?<!foo)bar
933
Zesstra7ea4a032019-11-26 20:11:40 +0100934 does find an occurrence of "bar" that is not preceded by "foo". The
935 contents of a lookbehind assertion are restricted such that all the
936 strings it matches must have a fixed length. However, if there are
937 several alternatives, they do not all have to have the same fixed
938 length. Thus
MG Mud User88f12472016-06-24 23:31:02 +0200939
940 (?<=bullock|donkey)
941
Zesstra7ea4a032019-11-26 20:11:40 +0100942 is permitted, but
MG Mud User88f12472016-06-24 23:31:02 +0200943
944 (?<!dogs?|cats?)
945
Zesstra7ea4a032019-11-26 20:11:40 +0100946 causes an error at compile time. Branches that match different length
947 strings are permitted only at the top level of a lookbehind assertion.
948 This is an extension compared with Perl (at least for 5.8), which
949 requires all branches to match the same length of string. An assertion
950 such as
MG Mud User88f12472016-06-24 23:31:02 +0200951
952 (?<=ab(c|de))
953
Zesstra7ea4a032019-11-26 20:11:40 +0100954 is not permitted, because its single top-level branch can match two
955 different lengths, but it is acceptable if rewritten to use two top-
956 level branches:
MG Mud User88f12472016-06-24 23:31:02 +0200957
958 (?<=abc|abde)
959
Zesstra7ea4a032019-11-26 20:11:40 +0100960 The implementation of lookbehind assertions is, for each alternative,
961 to temporarily move the current position back by the fixed width and
962 then try to match. If there are insufficient characters before the
963 current position, the match is deemed to fail.
MG Mud User88f12472016-06-24 23:31:02 +0200964
Zesstra7ea4a032019-11-26 20:11:40 +0100965 PCRE does not allow the \C escape (which matches a single byte in UTF-8
966 mode) to appear in lookbehind assertions, because it makes it
967 impossible to calculate the length of the lookbehind.
MG Mud User88f12472016-06-24 23:31:02 +0200968
Zesstra7ea4a032019-11-26 20:11:40 +0100969 Atomic groups can be used in conjunction with lookbehind assertions to
970 specify efficient matching at the end of the subject string. Consider a
971 simple pattern such as
MG Mud User88f12472016-06-24 23:31:02 +0200972
973 abcd$
974
Zesstra7ea4a032019-11-26 20:11:40 +0100975 when applied to a long string that does not match. Because matching
976 proceeds from left to right, PCRE will look for each "a" in the subject
977 and then see if what follows matches the rest of the pattern. If the
978 pattern is specified as
MG Mud User88f12472016-06-24 23:31:02 +0200979
980 ^.*abcd$
981
Zesstra7ea4a032019-11-26 20:11:40 +0100982 the initial .* matches the entire string at first, but when this fails
983 (because there is no following "a"), it backtracks to match all but the
984 last character, then all but the last two characters, and so on. Once
985 again the search for "a" covers the entire string, from right to left,
986 so we are no better off. However, if the pattern is written as
MG Mud User88f12472016-06-24 23:31:02 +0200987
988 ^(?>.*)(?<=abcd)
989
Zesstra7ea4a032019-11-26 20:11:40 +0100990 or, equivalently,
MG Mud User88f12472016-06-24 23:31:02 +0200991
992 ^.*+(?<=abcd)
993
Zesstra7ea4a032019-11-26 20:11:40 +0100994 there can be no backtracking for the .* item; it can match only the
995 entire string. The subsequent lookbehind assertion does a single test
996 on the last four characters. If it fails, the match fails immediately.
997 For long strings, this approach makes a significant difference to the
998 processing time.
MG Mud User88f12472016-06-24 23:31:02 +0200999
Zesstra7ea4a032019-11-26 20:11:40 +01001000 Several assertions (of any sort) may occur in succession. For example,
MG Mud User88f12472016-06-24 23:31:02 +02001001
1002 (?<=\d{3})(?<!999)foo
1003
Zesstra7ea4a032019-11-26 20:11:40 +01001004 matches "foo" preceded by three digits that are not "999". Notice that
1005 each of the assertions is applied independently at the same point in
1006 the subject string. First there is a check that the previous three
1007 characters are all digits, and then there is a check that the same
1008 three characters are not "999". This pattern does not match "foo"
1009 preceded by six characters, the first of which are digits and the last
1010 three of which are not "999". For example, it doesn't match
1011 "123abcfoo". A pattern to do that is
MG Mud User88f12472016-06-24 23:31:02 +02001012
1013 (?<=\d{3}...)(?<!999)foo
1014
Zesstra7ea4a032019-11-26 20:11:40 +01001015 This time the first assertion looks at the preceding six characters,
1016 checking that the first three are digits, and then the second assertion
1017 checks that the preceding three characters are not "999".
MG Mud User88f12472016-06-24 23:31:02 +02001018
Zesstra7ea4a032019-11-26 20:11:40 +01001019 Assertions can be nested in any combination. For example,
MG Mud User88f12472016-06-24 23:31:02 +02001020
1021 (?<=(?<!foo)bar)baz
1022
Zesstra7ea4a032019-11-26 20:11:40 +01001023 matches an occurrence of "baz" that is preceded by "bar" which in turn
1024 is not preceded by "foo", while
MG Mud User88f12472016-06-24 23:31:02 +02001025
1026 (?<=\d{3}(?!999)...)foo
1027
Zesstra7ea4a032019-11-26 20:11:40 +01001028 is another pattern which matches "foo" preceded by three digits and any
1029 three characters that are not "999".
MG Mud User88f12472016-06-24 23:31:02 +02001030
Zesstra7ea4a032019-11-26 20:11:40 +01001031 Assertion subpatterns are not capturing subpatterns, and may not be
1032 repeated, because it makes no sense to assert the same thing several
1033 times. If any kind of assertion contains capturing subpatterns within
1034 it, these are counted for the purposes of numbering the capturing
1035 subpatterns in the whole pattern. However, substring capturing is
1036 carried out only for positive assertions, because it does not make
1037 sense for negative assertions.
MG Mud User88f12472016-06-24 23:31:02 +02001038
1039CONDITIONAL SUBPATTERNS
Zesstra7ea4a032019-11-26 20:11:40 +01001040 It is possible to cause the matching process to obey a subpattern
1041 conditionally or to choose between two alternative subpatterns,
1042 depending on the result of an assertion, or whether a previous
1043 capturing subpattern matched or not. The two possible forms of
1044 conditional subpattern are
MG Mud User88f12472016-06-24 23:31:02 +02001045
1046 (?(condition)yes-pattern)
1047 (?(condition)yes-pattern|no-pattern)
1048
Zesstra7ea4a032019-11-26 20:11:40 +01001049 If the condition is satisfied, the yes-pattern is used; otherwise the
1050 no-pattern (if present) is used. If there are more than two
1051 alternatives in the subpattern, a compile-time error occurs.
MG Mud User88f12472016-06-24 23:31:02 +02001052
Zesstra7ea4a032019-11-26 20:11:40 +01001053 There are three kinds of condition. If the text between the parentheses
1054 consists of a sequence of digits, the condition is satisfied if the
1055 capturing subpattern of that number has previously matched. The number
1056 must be greater than zero. Consider the following pattern, which
1057 contains non-significant white space to make it more readable (assume
1058 the PCRE_EXTENDED option) and to divide it into three parts for ease of
1059 discussion:
MG Mud User88f12472016-06-24 23:31:02 +02001060
Zesstra7ea4a032019-11-26 20:11:40 +01001061 ( \( )? [^()]+ (?(1) \) )
MG Mud User88f12472016-06-24 23:31:02 +02001062
Zesstra7ea4a032019-11-26 20:11:40 +01001063 The first part matches an optional opening parenthesis, and if that
1064 character is present, sets it as the first captured substring. The
1065 second part matches one or more characters that are not parentheses.
1066 The third part is a conditional subpattern that tests whether the first
1067 set of parentheses matched or not. If they did, that is, if subject
1068 started with an opening parenthesis, the condition is true, and so the
1069 yes-pattern is executed and a closing parenthesis is required.
1070 Otherwise, since no-pattern is not present, the subpattern matches
1071 nothing. In other words, this pattern matches a sequence of
1072 non-parentheses, optionally enclosed in parentheses.
MG Mud User88f12472016-06-24 23:31:02 +02001073
Zesstra7ea4a032019-11-26 20:11:40 +01001074 If the condition is the string (R), it is satisfied if a recursive call
1075 to the pattern or subpattern has been made. At "top level", the
1076 condition is false. This is a PCRE extension. Recursive patterns are
1077 described in the next section.
MG Mud User88f12472016-06-24 23:31:02 +02001078
Zesstra7ea4a032019-11-26 20:11:40 +01001079 If the condition is not a sequence of digits or (R), it must be an
1080 assertion. This may be a positive or negative lookahead or lookbehind
1081 assertion. Consider this pattern, again containing non-significant
1082 white space, and with the two alternatives on the second line:
MG Mud User88f12472016-06-24 23:31:02 +02001083
1084 (?(?=[^a-z]*[a-z])
1085 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1086
Zesstra7ea4a032019-11-26 20:11:40 +01001087 The condition is a positive lookahead assertion that matches an
1088 optional sequence of non-letters followed by a letter. In other words,
1089 it tests for the presence of at least one letter in the subject. If a
1090 letter is found, the subject is matched against the first alternative;
1091 otherwise it is matched against the second. This pattern matches
1092 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1093 letters and dd are digits.
MG Mud User88f12472016-06-24 23:31:02 +02001094
1095COMMENTS
Zesstra7ea4a032019-11-26 20:11:40 +01001096 The sequence (?# marks the start of a comment which continues up to the
1097 next closing parenthesis. Nested parentheses are not permitted. The
1098 characters that make up a comment play no part in the pattern matching
1099 at all.
MG Mud User88f12472016-06-24 23:31:02 +02001100
Zesstra7ea4a032019-11-26 20:11:40 +01001101 If the PCRE_EXTENDED option is set, an unescaped # character outside a
1102 character class introduces a comment that continues up to the next
1103 newline character in the pattern.
MG Mud User88f12472016-06-24 23:31:02 +02001104
1105RECURSIVE PATTERNS
Zesstra7ea4a032019-11-26 20:11:40 +01001106 Consider the problem of matching a string in parentheses, allowing for
1107 unlimited nested parentheses. Without the use of recursion, the best
1108 that can be done is to use a pattern that matches up to some fixed
1109 depth of nesting. It is not possible to handle an arbitrary nesting
1110 depth. Perl has provided an experimental facility that allows regular
1111 expressions to recurse (amongst other things). It does this by
1112 interpolating Perl code in the expression at run time, and the code can
1113 refer to the expression itself. A Perl pattern to solve the parentheses
1114 problem can be created like this:
MG Mud User88f12472016-06-24 23:31:02 +02001115
1116 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1117
Zesstra7ea4a032019-11-26 20:11:40 +01001118 The (?p{...}) item interpolates Perl code at run time, and in this case
1119 refers recursively to the pattern in which it appears. Obviously, PCRE
1120 cannot support the interpolation of Perl code. Instead, it supports
1121 some special syntax for recursion of the entire pattern, and also for
1122 individual subpattern recursion.
MG Mud User88f12472016-06-24 23:31:02 +02001123
Zesstra7ea4a032019-11-26 20:11:40 +01001124 The special item that consists of (? followed by a number greater than
1125 zero and a closing parenthesis is a recursive call of the subpattern of
1126 the given number, provided that it occurs inside that subpattern. (If
1127 not, it is a "subroutine" call, which is described in the next
1128 section.) The special item (?R) is a recursive call of the entire
1129 regular expression.
MG Mud User88f12472016-06-24 23:31:02 +02001130
Zesstra7ea4a032019-11-26 20:11:40 +01001131 For example, this PCRE pattern solves the nested parentheses problem
1132 (assume the PCRE_EXTENDED option is set so that white space is
1133 ignored):
MG Mud User88f12472016-06-24 23:31:02 +02001134
1135 \( ( (?>[^()]+) | (?R) )* \)
1136
Zesstra7ea4a032019-11-26 20:11:40 +01001137 First it matches an opening parenthesis. Then it matches any number of
1138 substrings which can either be a sequence of non-parentheses, or a
1139 recursive match of the pattern itself (that is a correctly
1140 parenthesized substring). Finally there is a closing parenthesis.
MG Mud User88f12472016-06-24 23:31:02 +02001141
Zesstra7ea4a032019-11-26 20:11:40 +01001142 If this were part of a larger pattern, you would not want to recurse
1143 the entire pattern, so instead you could use this:
MG Mud User88f12472016-06-24 23:31:02 +02001144
1145 ( \( ( (?>[^()]+) | (?1) )* \) )
1146
Zesstra7ea4a032019-11-26 20:11:40 +01001147 We have put the pattern into parentheses, and caused the recursion to
1148 refer to them instead of the whole pattern. In a larger pattern,
1149 keeping track of parenthesis numbers can be tricky. It may be more
1150 convenient to use named parentheses instead. For this, PCRE uses
1151 (?P>name), which is an extension to the Python syntax that PCRE uses
1152 for named parentheses (Perl does not provide named parentheses). We
1153 could rewrite the above example as follows:
MG Mud User88f12472016-06-24 23:31:02 +02001154
1155 (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
1156
Zesstra7ea4a032019-11-26 20:11:40 +01001157 This particular example pattern contains nested unlimited repeats, and
1158 so the use of atomic grouping for matching strings of non-parentheses
1159 is important when applying the pattern to strings that do not match.
1160 For example, when this pattern is applied to
MG Mud User88f12472016-06-24 23:31:02 +02001161
1162 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1163
Zesstra7ea4a032019-11-26 20:11:40 +01001164 it yields "no match" quickly. However, if atomic grouping is not used,
1165 the match runs for a very long time indeed because there are so many
1166 different ways the + and * repeats can carve up the subject, and all
1167 have to be tested before failure can be reported.
MG Mud User88f12472016-06-24 23:31:02 +02001168
Zesstra7ea4a032019-11-26 20:11:40 +01001169 At the end of a match, the values set for any capturing subpatterns are
1170 those from the outermost level of the recursion at which the subpattern
1171 value is set. If you want to obtain intermediate values, a callout
1172 function can be used (see below and the pcrecallout documentation). If
1173 the pattern above is matched against
MG Mud User88f12472016-06-24 23:31:02 +02001174
1175 (ab(cd)ef)
1176
Zesstra7ea4a032019-11-26 20:11:40 +01001177 the value for the capturing parentheses is "ef", which is the last
1178 value taken on at the top level. If additional parentheses are added,
1179 giving
MG Mud User88f12472016-06-24 23:31:02 +02001180
1181 \( ( ( (?>[^()]+) | (?R) )* ) \)
1182 ^ ^
1183 ^ ^
1184
Zesstra7ea4a032019-11-26 20:11:40 +01001185 the string they capture is "ab(cd)ef", the contents of the top level
1186 parentheses. If there are more than 15 capturing parentheses in a
1187 pattern, PCRE has to obtain extra memory to store data during a
1188 recursion, which it does by using pcre_malloc, freeing it via pcre_free
1189 afterwards. If no memory can be obtained, the match fails with the
1190 PCRE_ERROR_NOMEMORY error.
MG Mud User88f12472016-06-24 23:31:02 +02001191
Zesstra7ea4a032019-11-26 20:11:40 +01001192 Do not confuse the (?R) item with the condition (R), which tests for
1193 recursion. Consider this pattern, which matches text in angle
1194 brackets, allowing for arbitrary nesting. Only digits are allowed in
1195 nested brackets (that is, when recursing), whereas any characters are
1196 permitted at the outer level.
MG Mud User88f12472016-06-24 23:31:02 +02001197
1198 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
1199
Zesstra7ea4a032019-11-26 20:11:40 +01001200 In this pattern, (?(R) is the start of a conditional subpattern, with
1201 two different alternatives for the recursive and non-recursive cases.
1202 The (?R) item is the actual recursive call.
MG Mud User88f12472016-06-24 23:31:02 +02001203
1204SUBPATTERNS AS SUBROUTINES
Zesstra7ea4a032019-11-26 20:11:40 +01001205 If the syntax for a recursive subpattern reference (either by number or
1206 by name) is used outside the parentheses to which it refers, it
1207 operates like a subroutine in a programming language. An earlier
1208 example pointed out that the pattern
MG Mud User88f12472016-06-24 23:31:02 +02001209
1210 (sens|respons)e and \1ibility
1211
Zesstra7ea4a032019-11-26 20:11:40 +01001212 matches "sense and sensibility" and "response and responsibility", but
1213 not "sense and responsibility". If instead the pattern
MG Mud User88f12472016-06-24 23:31:02 +02001214
1215 (sens|respons)e and (?1)ibility
1216
Zesstra7ea4a032019-11-26 20:11:40 +01001217 is used, it does match "sense and responsibility" as well as the other
1218 two strings. Such references must, however, follow the subpattern to
1219 which they refer.
MG Mud User88f12472016-06-24 23:31:02 +02001220
1221CALLOUTS
Zesstra7ea4a032019-11-26 20:11:40 +01001222 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
1223 Perl code to be obeyed in the middle of matching a regular expression.
1224 This makes it possible, amongst other things, to extract different
1225 substrings that match the same pair of parentheses when there is a
1226 repetition.
MG Mud User88f12472016-06-24 23:31:02 +02001227
Zesstra7ea4a032019-11-26 20:11:40 +01001228 PCRE provides a similar feature, but of course it cannot obey arbitrary
1229 Perl code. The feature is called "callout". The caller of PCRE provides
1230 an external function by putting its entry point in the global variable
1231 pcre_callout. By default, this variable contains NULL, which disables
1232 all calling out.
MG Mud User88f12472016-06-24 23:31:02 +02001233
Zesstra7ea4a032019-11-26 20:11:40 +01001234 Within a regular expression, (?C) indicates the points at which the
1235 external function is to be called. If you want to identify different
1236 callout points, you can put a number less than 256 after the letter C.
1237 The default value is zero. For example, this pattern has two callout
1238 points:
MG Mud User88f12472016-06-24 23:31:02 +02001239
1240 (?C1)abc(?C2)def
1241
Zesstra7ea4a032019-11-26 20:11:40 +01001242 During matching, when PCRE reaches a callout point (and pcre_callout is
1243 set), the external function is called. It is provided with the number
1244 of the callout, and, optionally, one item of data originally supplied
1245 by the caller of pcre_exec(). The callout function may cause matching
1246 to backtrack, or to fail altogether. A complete description of the
1247 interface to the callout function is given in the pcrecallout
1248 documentation.
MG Mud User88f12472016-06-24 23:31:02 +02001249
1250DIFFERENCES FROM PERL
Zesstra7ea4a032019-11-26 20:11:40 +01001251 This section escribes the differences in the ways that PCRE and Perl
1252 handle regular expressions. The differences described here are with
1253 respect to Perl 5.8.
MG Mud User88f12472016-06-24 23:31:02 +02001254
Zesstra7ea4a032019-11-26 20:11:40 +01001255 1. PCRE does not have full UTF-8 support. Details of what it does have
1256 are given in the section on UTF-8 support in the main pcre page.
MG Mud User88f12472016-06-24 23:31:02 +02001257
Zesstra7ea4a032019-11-26 20:11:40 +01001258 2. PCRE does not allow repeat quantifiers on lookahead assertions.
1259 Perl permits them, but they do not mean what you might think. For
1260 example, (?!a){3} does not assert that the next three characters are
1261 not "a". It just asserts that the next character is not "a" three
1262 times.
MG Mud User88f12472016-06-24 23:31:02 +02001263
Zesstra7ea4a032019-11-26 20:11:40 +01001264 3. Capturing subpatterns that occur inside negative lookahead
1265 assertions are counted, but their entries in the offsets vector are
1266 never set. Perl sets its numerical variables from any such patterns
1267 that are matched before the assertion fails to match something
1268 (thereby succeeding), but only if the negative lookahead assertion
1269 contains just one branch.
MG Mud User88f12472016-06-24 23:31:02 +02001270
Zesstra7ea4a032019-11-26 20:11:40 +01001271 4. Though binary zero characters are supported in the subject string,
1272 they are not allowed in a pattern string because it is passed as a
1273 normal C string, terminated by zero. The escape sequence "\0" can be
1274 used in the pattern to represent a binary zero.
MG Mud User88f12472016-06-24 23:31:02 +02001275
Zesstra7ea4a032019-11-26 20:11:40 +01001276 5. The following Perl escape sequences are not supported: \l, \u, \L,
1277 \U, \P, \p, \N, and \X. In fact these are implemented by Perl's general
1278 string-handling and are not part of its pattern matching engine. If any
1279 of these are encountered by PCRE, an error is generated.
MG Mud User88f12472016-06-24 23:31:02 +02001280
Zesstra7ea4a032019-11-26 20:11:40 +01001281 6. PCRE does support the \Q...\E escape for quoting substrings.
1282 Characters in between are treated as literals. This is slightly
1283 different from Perl in that $ and @ are also handled as literals inside
1284 the quotes. In Perl, they cause variable interpolation (but of course
1285 PCRE does not have variables). Note the following examples:
MG Mud User88f12472016-06-24 23:31:02 +02001286
1287 Pattern PCRE matches Perl matches
1288
1289 \Qabc$xyz\E abc$xyz abc followed by the
1290 contents of $xyz
1291 \Qabc\$xyz\E abc\$xyz abc\$xyz
1292 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1293
Zesstra7ea4a032019-11-26 20:11:40 +01001294 The \Q...\E sequence is recognized both inside and outside character
1295 classes.
MG Mud User88f12472016-06-24 23:31:02 +02001296
Zesstra7ea4a032019-11-26 20:11:40 +01001297 7. Fairly obviously, PCRE does not support the (?{code}) and
1298 (?p{code}) constructions. However, there is some experimental support
1299 for recursive patterns using the non-Perl items (?R), (?number) and
1300 (?P>name). Also, the PCRE "callout" feature allows an external function
1301 to be called during pattern matching.
MG Mud User88f12472016-06-24 23:31:02 +02001302
Zesstra7ea4a032019-11-26 20:11:40 +01001303 8. There are some differences that are concerned with the settings of
1304 captured strings when part of a pattern is repeated. For example,
1305 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
1306 unset, but in PCRE it is set to "b".
MG Mud User88f12472016-06-24 23:31:02 +02001307
Zesstra7ea4a032019-11-26 20:11:40 +01001308 9. PCRE provides some extensions to the Perl regular expression
1309 facilities:
MG Mud User88f12472016-06-24 23:31:02 +02001310
Zesstra7ea4a032019-11-26 20:11:40 +01001311 (a) Although lookbehind assertions must match fixed length strings,
1312 each alternative branch of a lookbehind assertion can match a different
1313 length of string. Perl requires them all to have the same length.
MG Mud User88f12472016-06-24 23:31:02 +02001314
Zesstra7ea4a032019-11-26 20:11:40 +01001315 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
1316 meta-character matches only at the very end of the string.
MG Mud User88f12472016-06-24 23:31:02 +02001317
Zesstra7ea4a032019-11-26 20:11:40 +01001318 (c) If PCRE_EXTRA is set, a backslash followed by a letter with no
1319 special meaning is faulted.
MG Mud User88f12472016-06-24 23:31:02 +02001320
Zesstra7ea4a032019-11-26 20:11:40 +01001321 (d) If PCRE_UNGREEDY is set, the greediness of the repetition
1322 quantifiers is inverted, that is, by default they are not greedy, but
1323 if followed by a question mark they are.
MG Mud User88f12472016-06-24 23:31:02 +02001324
Zesstra7ea4a032019-11-26 20:11:40 +01001325 (e) PCRE_ANCHORED can be used to force a pattern to be tried only at
1326 the first matching position in the subject string.
MG Mud User88f12472016-06-24 23:31:02 +02001327
Zesstra7ea4a032019-11-26 20:11:40 +01001328 (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and
1329 PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl equivalents.
MG Mud User88f12472016-06-24 23:31:02 +02001330
Zesstra7ea4a032019-11-26 20:11:40 +01001331 (g) The (?R), (?number), and (?P>name) constructs allows for recursive
1332 pattern matching (Perl can do this using the (?p{code}) construct,
1333 which PCRE cannot support.)
MG Mud User88f12472016-06-24 23:31:02 +02001334
Zesstra7ea4a032019-11-26 20:11:40 +01001335 (h) PCRE supports named capturing substrings, using the Python syntax.
MG Mud User88f12472016-06-24 23:31:02 +02001336
Zesstra7ea4a032019-11-26 20:11:40 +01001337 (i) PCRE supports the possessive quantifier "++" syntax, taken from
1338 Sun's Java package.
MG Mud User88f12472016-06-24 23:31:02 +02001339
Zesstra7ea4a032019-11-26 20:11:40 +01001340 (j) The (R) condition, for testing recursion, is a PCRE extension.
MG Mud User88f12472016-06-24 23:31:02 +02001341
Zesstra7ea4a032019-11-26 20:11:40 +01001342 (k) The callout facility is PCRE-specific.
MG Mud User88f12472016-06-24 23:31:02 +02001343
1344NOTES
1345 The \< and \> metacharacters from Henry Spencers package
Zesstra7ea4a032019-11-26 20:11:40 +01001346 are not available in PCRE, but can be emulated with \b,
MG Mud User88f12472016-06-24 23:31:02 +02001347 as required, also in conjunction with \W or \w.
1348
1349 In LDMud, backtracks are limited by the EVAL_COST runtime
1350 limit, to avoid freezing the driver with a match
1351 like regexp(({"=XX==================="}), "X(.+)+X").
1352
1353 LDMud doesn't support PCRE callouts.
1354
MG Mud User88f12472016-06-24 23:31:02 +02001355LIMITATIONS
1356 There are some size limitations in PCRE but it is hoped that
Zesstra7ea4a032019-11-26 20:11:40 +01001357 they will never in practice be relevant. The maximum length
1358 of a compiled pattern is 65539 (sic) bytes. All values in
1359 repeating quantifiers must be less than 65536. There
1360 maximum number of capturing subpatterns is 65535. There is no
1361 limit to the number of non-capturing subpatterns, but the
1362 maximum depth of nesting of all kinds of parenthesized
1363 subpattern, including capturing subpatterns, assertions,
1364 and other types of subpattern, is 200.
MG Mud User88f12472016-06-24 23:31:02 +02001365
Zesstra7ea4a032019-11-26 20:11:40 +01001366 The maximum length of a subject string is the largest
1367 positive number that an integer variable can hold. However,
1368 PCRE uses recursion to handle subpatterns and indefinite
1369 repetition. This means that the available stack space may
1370 limit the size of a subject string that can be processed by
1371 certain patterns.
MG Mud User88f12472016-06-24 23:31:02 +02001372
1373AUTHOR
1374 Philip Hazel <ph10@cam.ac.uk>
1375 University Computing Service,
1376 New Museums Site,
1377 Cambridge CB2 3QG, England.
1378 Phone: +44 1223 334714
1379
1380SEE ALSO
1381 regexp(C), hsregexp(C)