Blame - doc/concepts/pcre - mudlib-public

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1

SYNOPSIS

2

PCRE - Perl-compatible regular expressions

3

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

4

DESCRIPTION

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

5

This document describes the regular expressions supported by the PCRE

6

package. When the package is compiled into the driver, the macro

7

__PCRE__ is defined.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

8

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

9

Most of this manpage is lifted directly from the original PCRE manpage

10

(dated January 2003).

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

11

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

12

The PCRE library is a set of functions that implement regular

13

expression pattern matching using the same syntax and semantics as

14

Perl 5, with just a few differences (see below). The current

15

implementation corresponds to Perl 5.005, with some additional features

16

from later versions. This includes some experimental, incomplete

17

support for UTF-8 encoded strings. Details of exactly what is and what

18

is not supported are given below.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

19

20

PCRE REGULAR EXPRESSION DETAILS

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

21

The syntax and semantics of the regular expressions supported by PCRE

22

are described below. Regular expressions are also described in the Perl

23

documentation and in a number of other books, some of which have

24

copious examples. Jeffrey Friedl's "Mastering Regular Expressions",

25

published by O'Reilly, covers them in great detail. The description

26

here is intended as reference documentation.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

27

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

28

The basic operation of PCRE is on strings of bytes. However, there is

29

also support for UTF-8 character strings. To use this support you must

30

build PCRE to include UTF-8 support, and then call pcre_compile() with

31

the PCRE_UTF8 option. How this affects the pattern matching is

32

mentioned in several places below. There is also a summary of UTF-8

33

features in the section on UTF-8 support in the main pcre page.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

34

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

35

A regular expression is a pattern that is matched against a subject

36

string from left to right. Most characters stand for themselves in a

37

pattern, and match the corresponding characters in the subject. As a

38

trivial example, the pattern

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

The quick brown fox

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

42

matches a portion of a subject string that is identical to itself. The

43

power of regular expressions comes from the ability to include

44

alternatives and repetitions in the pattern. These are encoded in the

45

pattern by the use of meta-characters, which do not stand for

46

themselves but instead are interpreted in some special way.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

47

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

48

There are two different sets of meta-characters: those that are

49

recognized anywhere in the pattern except within square brackets, and

50

those that are recognized in square brackets. Outside square brackets,

51

the meta-characters are as follows:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

52

53

\ general escape character with several uses

54

^ assert start of string (or line, in multiline mode)

55

$ assert end of string (or line, in multiline mode)

56

. match any character except newline (by default)

57

[ start character class definition

58

| start of alternative branch

59

( start subpattern

60

) end subpattern

61

? extends the meaning of (

62

also 0 or 1 quantifier

63

also quantifier minimizer

64

* 0 or more quantifier

65

+ 1 or more quantifier

66

also "possessive quantifier"

67

{ start min/max quantifier

68

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

69

Part of a pattern that is in square brackets is called a "character

70

class". In a character class the only meta-characters are:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

71

72

\ general escape character

73

^ negate the class, but only if the first character

74

- indicates character range

75

[ POSIX character class (only if followed by POSIX

76

syntax)

77

] terminates the character class

78

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

79

The following sections describe the use of each of the meta-characters.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

80

81

BACKSLASH

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

82

The backslash character has several uses. Firstly, if it is followed by

83

a non-alphameric character, it takes away any special meaning that

84

character may have. This use of backslash as an escape character

85

applies both inside and outside character classes.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

86

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

87

For example, if you want to match a * character, you write \* in the

88

pattern. This escaping action applies whether or not the following

89

character would otherwise be interpreted as a meta-character, so it is

90

always safe to precede a non-alphameric with backslash to specify that

91

it stands for itself. In particular, if you want to match a backslash,

92

you write \\.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

93

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

94

If a pattern is compiled with the PCRE_EXTENDED option, whitespace in

95

the pattern (other than in a character class) and characters between a

96

# outside a character class and the next newline character are ignored.

97

An escaping backslash can be used to include a whitespace or #

98

character as part of the pattern.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

99

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

100

If you want to remove the special meaning from a sequence of

101

characters, you can do so by putting them between \Q and \E. This is

102

different from Perl in that $ and @ are handled as literals in \Q...\E

103

sequences in PCRE, whereas in Perl, $ and @ cause variable

104

interpolation. Note the following examples:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

105

106

Pattern PCRE matches Perl matches

107

108

\Qabc$xyz\E abc$xyz abc followed by the

109

contents of $xyz

110

\Qabc\$xyz\E abc\$xyz abc\$xyz

111

\Qabc\E\$\Qxyz\E abc$xyz abc$xyz

112

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

113

The \Q...\E sequence is recognized both inside and outside character

114

classes.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

115

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

116

A second use of backslash provides a way of encoding non-printing

117

characters in patterns in a visible manner. There is no restriction on

118

the appearance of non-printing characters, apart from the binary zero

119

that terminates a pattern, but when a pattern is being prepared by text

120

editing, it is usually easier to use one of the following escape

121

sequences than the binary character it represents:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

122

123

\a alarm, that is, the BEL character (hex 07)

124

\cx "control-x", where x is any character

\e escape (hex 1B)

\f formfeed (hex 0C)

\n newline (hex 0A)

\r carriage return (hex 0D)

129

\t tab (hex 09)

130

\ddd character with octal code ddd, or backreference

131

\xhh character with hex code hh

132

\x{hhh..} character with hex code hhh... (UTF-8 mode only)

133

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

134

The precise effect of \cx is as follows: if x is a lower case letter,

135

it is converted to upper case. Then bit 6 of the character (hex 40) is

136

inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;

137

becomes hex 7B.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

138

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

139

After \x, from zero to two hexadecimal digits are read (letters can be

140

in upper or lower case). In UTF-8 mode, any number of hexadecimal

141

dig-its may appear between \x{ and }, but the value of the character

142

code must be less than 2**31 (that is, the maximum hexadecimal value is

143

7FFFFFFF). If characters other than hexadecimal digits appear between

144

\x{ and }, or if there is no terminating }, this form of escape is not

145

recognized. Instead, the initial \x will be interpreted as a basic

146

hexadecimal escape, with no following digits, giving a byte whose value

147

is zero.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

148

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

149

Characters whose value is less than 256 can be defined by either of the

150

two syntaxes for \x when PCRE is in UTF-8 mode. There is no difference

151

in the way they are handled. For example, \xdc is exactly the same as

152

\x{dc}.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

153

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

154

After \0 up to two further octal digits are read. In both cases, if

155

there are fewer than two digits, just those that are present are used.

156

Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL

157

character (code value 7). Make sure you supply two digits after the

158

initial zero if the character that follows is itself an octal digit.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

159

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

160

The handling of a backslash followed by a digit other than 0 is

161

complicated. Outside a character class, PCRE reads it and any following

162

digits as a decimal number. If the number is less than 10, or if there

163

have been at least that many previous capturing left parentheses in the

164

expression, the entire sequence is taken as a back reference. A

165

description of how this works is given later, following the discussion

166

of parenthesized subpatterns.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

167

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

168

Inside a character class, or if the decimal number is greater than 9

169

and there have not been that many capturing subpatterns, PCRE re-reads

170

up to three octal digits following the backslash, and generates a

171

single byte from the least significant 8 bits of the value. Any

172

subsequent digits stand for themselves. For example:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

173

174

\040 is another way of writing a space

175

\40 is the same, provided there are fewer than 40

176

previous capturing subpatterns

177

\7 is always a back reference

178

\11 might be a back reference, or another way of

179

writing a tab

180

\011 is always a tab

181

\0113 is a tab followed by the character "3"

182

\113 might be a back reference, otherwise the

183

character with octal code 113

184

\377 might be a back reference, otherwise

185

the byte consisting entirely of 1 bits

186

\81 is either a back reference, or a binary zero

187

followed by the two characters "8" and "1"

188

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

189

Note that octal values of 100 or greater must not be introduced by a

190

leading zero, because no more than three octal digits are ever read.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

191

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

192

All the sequences that define a single byte value or a single UTF-8

193

character (in UTF-8 mode) can be used both inside and outside character

194

classes. In addition, inside a character class, the sequence \b is

195

interpreted as the backspace character (hex 08). Outside a character

196

class it has a different meaning (see below).

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

197

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

198

The third use of backslash is for specifying generic character types:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

199

200

\d any decimal digit

201

\D any character that is not a decimal digit

202

\s any whitespace character

203

\S any character that is not a whitespace character

204

\w any "word" character

205

\W any "non-word" character

206

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

207

Each pair of escape sequences partitions the complete set of characters

208

into two disjoint sets. Any given character matches one, and only one,

209

of each pair.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

210

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

211

In UTF-8 mode, characters with values greater than 255 never match \d,

212

\s, or \w, and always match \D, \S, and \W.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

213

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

214

For compatibility with Perl, \s does not match the VT character (code

215

11). This makes it different from the the POSIX "space" class. The \s

216

characters are HT (9), LF (10), FF (12), CR (13), and space (32).

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

217

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

218

A "word" character is any letter or digit or the underscore character,

219

that is, any character which can be part of a Perl "word". The

220

definition of letters and digits is controlled by PCRE's character

221

tables, and may vary if locale-specific matching is taking place (see

222

"Locale support" in the pcreapi page). For example, in the "fr"

223

(French) locale, some character codes greater than 128 are used for

224

accented letters, and these are matched by \w.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

225

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

226

These character type sequences can appear both inside and outside

227

character classes. They each match one character of the appropriate

228

type. If the current matching point is at the end of the subject

229

string, all of them fail, since there is no character to match.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

230

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

231

The fourth use of backslash is for certain simple assertions. An

232

assertion specifies a condition that has to be met at a particular

233

point in a match, without consuming any characters from the subject

234

string. The use of subpatterns for more complicated assertions is

235

described below. The backslashed assertions are:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

236

237

\b matches at a word boundary

238

\B matches when not at a word boundary

239

\A matches at start of subject

240

\Z matches at end of subject or before newline at end

241

\z matches at end of subject

242

\G matches at first matching position in subject

243

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

244

These assertions may not appear in character classes (but note that \b

245

has a different meaning, namely the backspace character, inside a

246

character class).

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

247

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

248

A word boundary is a position in the subject string where the current

249

character and the previous character do not both match \w or \W (i.e.

250

one matches \w and the other matches \W), or the start or end of the

251

string if the first or last character matches \w, respectively.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

252

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

253

The \A, \Z, and \z assertions differ from the traditional circumflex

254

and dollar (described below) in that they only ever match at the very

255

start and end of the subject string, whatever options are set. Thus,

256

they are independent of multiline mode.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

257

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

258

They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options. If the

259

startoffset argument of pcre_exec() is non-zero, indicating that

260

matching is to start at a point other than the beginning of the

261

subject, \A can never match. The difference between \Z and \z is that

262

\Z matches before a newline that is the last character of the string as

263

well as at the end of the string, whereas \z matches only at the end.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

264

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

265

The \G assertion is true only when the current matching position is at

266

the start point of the match, as specified by the startoffset argument

267

of pcre_exec(). It differs from \A when the value of startoffset is

268

non-zero. By calling pcre_exec() multiple times with appropriate

269

arguments, you can mimic Perl's /g option, and it is in this kind of

270

implementation where \G can be useful.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

271

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

272

Note, however, that PCRE's interpretation of \G, as the start of the

273

current match, is subtly different from Perl's, which defines it as the

274

end of the previous match. In Perl, these can be different when the

275

previously matched string was empty. Because PCRE does just one match

276

at a time, it cannot reproduce this behaviour.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

277

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

278

If all the alternatives of a pattern begin with \G, the expression is

279

anchored to the starting match position, and the "anchored" flag is set

280

in the compiled regular expression.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

281

282

CIRCUMFLEX AND DOLLAR

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

283

Outside a character class, in the default matching mode, the circumflex

284

character is an assertion which is true only if the current matching

285

point is at the start of the subject string. If the startoffset

286

argument of pcre_exec() is non-zero, circumflex can never match if the

287

PCRE_MULTILINE option is unset. Inside a character class, circumflex

288

has an entirely different meaning (see below).

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

289

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

290

Circumflex need not be the first character of the pattern if a number

291

of alternatives are involved, but it should be the first thing in each

292

alternative in which it appears if the pattern is ever to match that

293

branch. If all possible alternatives start with a circumflex, that is,

294

if the pattern is constrained to match only at the start of the

295

subject, it is said to be an "anchored" pattern. (There are also other

296

constructs that can cause a pattern to be anchored.)

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

297

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

298

A dollar character is an assertion which is true only if the current

299

matching point is at the end of the subject string, or immediately

300

before a newline character that is the last character in the string (by

301

default). Dollar need not be the last character of the pattern if a

302

number of alternatives are involved, but it should be the last item in

303

any branch in which it appears. Dollar has no special meaning in a

304

character class.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

305

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

306

The meaning of dollar can be changed so that it matches only at the

307

very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at

308

compile time. This does not affect the \Z assertion.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

309

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

310

The meanings of the circumflex and dollar characters are changed if the

311

PCRE_MULTILINE option is set. When this is the case, they match

312

immediately after and immediately before an internal newline character,

313

respectively, in addition to matching at the start and end of the

314

subject string. For example, the pattern /^abc$/ matches the subject

315

string "def\nabc" in multiline mode, but not otherwise. Consequently,

316

patterns that are anchored in single line mode because all branches

317

start with ^ are not anchored in multiline mode, and a match for

318

circumflex is possible when the startoffset argument of pcre_exec() is

319

non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE

320

is set.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

321

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

322

Note that the sequences \A, \Z, and \z can be used to match the start

323

and end of the subject in both modes, and if all branches of a pattern

324

start with \A it is always anchored, whether PCRE_MULTILINE is set or

325

not.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

326

327

FULL STOP (PERIOD, DOT)

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

328

Outside a character class, a dot in the pattern matches any one

329

character in the subject, including a non-printing character, but not

330

(by default) newline. In UTF-8 mode, a dot matches any UTF-8 character,

331

which might be more than one byte long, except (by default) for

332

newline. If the PCRE_DOTALL option is set, dots match newlines as well.

333

The handling of dot is entirely independent of the handling of

334

circumflex and dollar, the only relationship being that they both

335

involve newline characters. Dot has no special meaning in a character

336

class.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

337

338

MATCHING A SINGLE BYTE

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

339

Outside a character class, the escape sequence \C matches any one byte,

340

both in and out of UTF-8 mode. Unlike a dot, it always matches a

341

newline. The feature is provided in Perl in order to match individual

342

bytes in UTF-8 mode. Because it breaks up UTF-8 characters into

343

individual bytes, what remains in the string may be a malformed UTF-8

344

string. For this reason it is best avoided.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

345

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

346

PCRE does not allow \C to appear in lookbehind assertions (see below),

347

because in UTF-8 mode it makes it impossible to calculate the length of

348

the lookbehind.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

349

350

SQUARE BRACKETS

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

351

An opening square bracket introduces a character class, terminated by a

352

closing square bracket. A closing square bracket on its own is not

353

special. If a closing square bracket is required as a member of the

354

class, it should be the first data character in the class (after an

355

initial circumflex, if present) or escaped with a backslash.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

356

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

357

A character class matches a single character in the subject. In UTF-8

358

mode, the character may occupy more than one byte. A matched character

359

must be in the set of characters defined by the class, unless the first

360

character in the class definition is a circumflex, in which case the

361

subject character must not be in the set defined by the class. If a

362

circumflex is actually required as a member of the class, ensure it is

363

not the first character, or escape it with a backslash.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

364

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

365

For example, the character class [aeiou] matches any lower case vowel,

366

while [^aeiou] matches any character that is not a lower case vowel.

367

Note that a circumflex is just a convenient notation for specifying the

368

characters which are in the class by enumerating those that are not. It

369

is not an assertion: it still consumes a character from the subject

370

string, and fails if the current pointer is at the end of the string.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

371

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

372

In UTF-8 mode, characters with values greater than 255 can be included

373

in a class as a literal string of bytes, or by using the \x{ escaping

374

mechanism.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

375

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

376

When caseless matching is set, any letters in a class represent both

377

their upper case and lower case versions, so for example, a caseless

378

[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not

379

match "A", whereas a caseful version would. PCRE does not support the

380

concept of case for characters with values greater than 255.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

381

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

382

The newline character is never treated in any special way in character

383

classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE

384

options is. A class such as [^a] will always match a newline.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

385

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

386

The minus (hyphen) character can be used to specify a range of

387

characters in a character class. For example, [d-m] matches any letter

388

between d and m, inclusive. If a minus character is required in a

389

class, it must be escaped with a backslash or appear in a position

390

where it cannot be interpreted as indicating a range, typically as the

391

first or last character in the class.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

392

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

393

It is not possible to have the literal character "]" as the end

394

character of a range. A pattern such as [W-]46] is interpreted as a

395

class of two characters ("W" and "-") followed by a literal string

396

"46]", so it would match "W46]" or "-46]". However, if the "]" is

397

escaped with a backslash it is interpreted as the end of range, so

398

[W-\]46] is interpreted as a single class containing a range followed

399

by two separate characters. The octal or hexadecimal representation of

400

"]" can also be used to end a range.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

401

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

402

Ranges operate in the collating sequence of character values. They can

403

also be used for characters specified numerically, for example

404

[\000-\037]. In UTF-8 mode, ranges can include characters whose values

405

are greater than 255, for example [\x{100}-\x{2ff}].

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

406

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

407

If a range that includes letters is used when caseless matching is set,

408

it matches the letters in either case. For example, [W-c] is equivalent

409

to [][\^_`wxyzabc], matched caselessly, and if character tables for the

410

"fr" locale are in use, [\xc8-\xcb] matches accented E characters in

411

both cases.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

412

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

413

The character types \d, \D, \s, \S, \w, and \W may also appear in a

414

character class, and add the characters that they match to the class.

415

For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can

416

conveniently be used with the upper case character types to specify a

417

more restricted set of characters than the matching lower case type.

418

For example, the class [^\W_] matches any letter or digit, but not

419

underscore.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

420

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

421

All non-alphameric characters other than \, -, ^ (at the start) and the

422

terminating ] are non-special in character classes, but it does no harm

423

if they are escaped.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

424

425

POSIX CHARACTER CLASSES

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

426

Perl supports the POSIX notation for character classes, which uses

427

names enclosed by [: and :] within the enclosing square brackets. PCRE

428

also supports this notation. For example,

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

[01[:alpha:]%]

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

432

matches "0", "1", any alphabetic character, or "%". The supported class

433

names are

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

434

435

alnum letters and digits

436

alpha letters

437

ascii character codes 0 - 127

438

blank space or tab only

439

cntrl control characters

440

digit decimal digits (same as \d)

441

graph printing characters, excluding space

442

lower lower case letters

443

print printing characters, including space

444

punct printing characters, excluding letters and digits

445

space white space (not quite the same as \s)

446

upper upper case letters

447

word "word" characters (same as \w)

448

xdigit hexadecimal digits

449

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

450

The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),

451

and space (32). Notice that this list includes the VT character (code

452

11). This makes "space" different to \s, which does not include VT (for

453

Perl compatibility).

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

454

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

455

The name "word" is a Perl extension, and "blank" is a GNU extension

456

from Perl 5.8. Another Perl extension is negation, which is indicated

457

by a ^ character after the colon. For example,

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

[12[:^digit:]]

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

461

matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the

462

POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but

463

these are not supported, and an error is given if they are encountered.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

464

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

465

In UTF-8 mode, characters with values greater than 255 do not match any

466

of the POSIX character classes.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

467

468

VERTICAL BAR

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

469

Vertical bar characters are used to separate alternative patterns. For

470

example, the pattern

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

gilbert|sullivan

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

474

matches either "gilbert" or "sullivan". Any number of alternatives may

475

appear, and an empty alternative is permitted (matching the empty

476

string). The matching process tries each alternative in turn, from

477

left to right, and the first one that succeeds is used. If the

478

alternatives are within a subpattern (defined below), "succeeds" means

479

matching the rest of the main pattern as well as the alternative in the

480

subpattern.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

481

482

INTERNAL OPTION SETTING

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

483

The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and

484

PCRE_EXTENDED options can be changed from within the pattern by a

485

sequence of Perl option letters enclosed between "(?" and ")". The

486

option letters are

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

i for PCRE_CASELESS

m for PCRE_MULTILINE

s for PCRE_DOTALL

x for PCRE_EXTENDED

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

493

For example, (?im) sets caseless, multiline matching. It is also

494

possible to unset these options by preceding the letter with a hyphen,

495

and a combined setting and unsetting such as (?im-sx), which sets

496

PCRE_CASELESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and

497

PCRE_EXTENDED, is also permitted. If a letter appears both before and

498

after the hyphen, the option is unset.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

499

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

500

When an option change occurs at top level (that is, not inside

501

subpattern parentheses), the change applies to the remainder of the

502

pattern that follows. If the change is placed right at the start of a

503

pattern, PCRE extracts it into the global options (and it will

504

therefore show up in data extracted by the pcre_fullinfo() function).

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

505

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

506

An option change within a subpattern affects only that part of the

507

current pattern that follows it, so

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

(a(?i)b)c

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

511

matches abc and aBc and no other strings (assuming PCRE_CASELESS is not

512

used). By this means, options can be made to have different settings

513

in different parts of the pattern. Any changes made in one alternative

514

do carry on into subsequent branches within the same subpattern. For

515

example,

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

(a(?i)b|c)

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

519

matches "ab", "aB", "c", and "C", even though when matching "C" the

520

first branch is abandoned before the option setting. This is because

521

the effects of option settings happen at compile time. There would be

522

some very weird behaviour otherwise.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

523

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

524

The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed

525

in the same way as the Perl-compatible options by using the characters

526

U and X respectively. The (?X) flag setting is special in that it must

527

always occur earlier in the pattern than any of the additional features

528

it turns on, even when it is at top level. It is best put at the start.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

529

530

SUBPATTERNS

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

531

Subpatterns are delimited by parentheses (round brackets), which can be

532

nested. Marking part of a pattern as a subpattern does two things:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

533

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

534

1. It localizes a set of alternatives. For example, the pattern

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

cat(aract|erpillar|)

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

538

matches one of the words "cat", "cataract", or "caterpillar". Without

539

the parentheses, it would match "cataract", "erpillar" or the empty

540

string.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

541

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

542

2. It sets up the subpattern as a capturing subpattern (as defined

543

above). When the whole pattern matches, that portion of the subject

544

string that matched the subpattern is passed back to the caller via the

545

ovector argument of pcre_exec(). Opening parentheses are counted from

546

left to right (starting from 1) to obtain the numbers of the capturing

547

subpatterns.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

548

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

549

For example, if the string "the red king" is matched against the

550

pattern

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

551

552

the ((red|white) (king|queen))

553

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

554

the captured substrings are "red king", "red", and "king", and are

555

numbered 1, 2, and 3, respectively.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

556

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

557

The fact that plain parentheses fulfil two functions is not always

558

helpful. There are often times when a grouping subpattern is required

559

without a capturing requirement. If an opening parenthesis is followed

560

by a question mark and a colon, the subpattern does not do any

561

capturing, and is not counted when computing the number of any

562

subsequent capturing subpatterns. For example, if the string "the white

563

queen" is matched against the pattern

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

564

565

the ((?:red|white) (king|queen))

566

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

567

the captured substrings are "white queen" and "queen", and are numbered

568

1 and 2. The maximum number of capturing subpatterns is 65535, and the

569

maximum depth of nesting of all subpatterns, both capturing and

570

noncapturing, is 200.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

571

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

572

As a convenient shorthand, if any option settings are required at the

573

start of a non-capturing subpattern, the option letters may appear

574

between the "?" and the ":". Thus the two patterns

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

575

576

(?i:saturday|sunday)

577

(?:(?i)saturday|sunday)

578

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

579

match exactly the same set of strings. Because alternative branches are

580

tried from left to right, and options are not reset until the end of

581

the subpattern is reached, an option setting in one branch does affect

582

subsequent branches, so the above patterns match "SUNDAY" as well as

583

"Saturday".

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

584

585

NAMED SUBPATTERNS

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

586

Identifying capturing parentheses by number is simple, but it can be

587

very hard to keep track of the numbers in complicated regular

588

expressions. Furthermore, if an expression is modified, the numbers may

589

change. To help with the difficulty, PCRE supports the naming of

590

subpatterns, something that Perl does not provide. The Python syntax

591

(?P<name>...) is used. Names consist of alphanumeric characters and

592

underscores, and must be unique within a pattern.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

593

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

594

Named capturing parentheses are still allocated numbers as well as

595

names. The PCRE API provides function calls for extracting the name-to-

596

number translation table from a compiled pattern. For further details

597

see the pcreapi documentation.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

598

599

REPETITION

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

600

Repetition is specified by quantifiers, which can follow any of the

601

following items:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

602

603

a literal data character

604

the . metacharacter

605

the \C escape sequence

606

escapes such as \d that match single characters

607

a character class

608

a back reference (see next section)

609

a parenthesized subpattern (unless it is an assertion)

610

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

611

The general repetition quantifier specifies a minimum and maximum

612

number of permitted matches, by giving the two numbers in curly

613

brackets (braces), separated by a comma. The numbers must be less than

614

65536, and the first must be less than or equal to the second. For

615

example:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

z{2,4}

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

619

matches "zz", "zzz", or "zzzz". A closing brace on its own is not a

620

special character. If the second number is omitted, but the comma is

621

present, there is no upper limit; if the second number and the comma

622

are both omitted, the quantifier specifies an exact number of required

623

matches. Thus

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

[aeiou]{3,}

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

627

matches at least 3 successive vowels, but may match many more, while

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

\d{8}

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

631

matches exactly 8 digits. An opening curly bracket that appears in a

632

position where a quantifier is not allowed, or one that does not match

633

the syntax of a quantifier, is taken as a literal character. For

634

example, {,6} is not a quantifier, but a literal string of four

635

characters.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

636

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

637

In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to

638

individual bytes. Thus, for example, \x{100}{2} matches two UTF-8

639

characters, each of which is represented by a two-byte sequence.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

640

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

641

The quantifier {0} is permitted, causing the expression to behave as if

642

the previous item and the quantifier were not present.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

643

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

644

For convenience (and historical compatibility) the three most common

645

quantifiers have single-character abbreviations:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

646

647

* is equivalent to {0,}

648

+ is equivalent to {1,}

649

? is equivalent to {0,1}

650

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

651

It is possible to construct infinite loops by following a subpattern

652

that can match no characters with a quantifier that has no upper limit,

653

for example:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

(a?)*

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

657

Earlier versions of Perl and PCRE used to give an error at compile time

658

for such patterns. However, because there are cases where this can be

659

useful, such patterns are now accepted, but if any repetition of the

660

subpattern does in fact match no characters, the loop is forcibly

661

broken.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

662

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

663

By default, the quantifiers are "greedy", that is, they match as much

664

as possible (up to the maximum number of permitted times), without

665

causing the rest of the pattern to fail. The classic example of where

666

this gives problems is in trying to match comments in C programs. These

667

appear between the sequences /* and */ and within the sequence,

668

individual * and / characters may appear. An attempt to match C

669

comments by applying the pattern

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

/\*.*\*/

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

673

to the string

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

674

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

675

/* first command */ not comment /* second comment */

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

676

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

677

fails, because it matches the entire string owing to the greediness of

678

the .* item.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

679

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

680

However, if a quantifier is followed by a question mark, it ceases to

681

be greedy, and instead matches the minimum number of times possible, so

682

the pattern

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

/\*.*?\*/

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

686

does the right thing with the C comments. The meaning of the various

687

quantifiers is not otherwise changed, just the preferred number of

688

matches. Do not confuse this use of question mark with its use as a

689

quantifier in its own right. Because it has two uses, it can sometimes

690

appear doubled, as in

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

\d??\d

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

694

which matches one digit by preference, but can match two if that is the

695

only way the rest of the pattern matches.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

696

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

697

If the PCRE_UNGREEDY option is set (an option which is not available in

698

Perl), the quantifiers are not greedy by default, but individual ones

699

can be made greedy by following them with a question mark. In other

700

words, it inverts the default behaviour.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

701

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

702

When a parenthesized subpattern is quantified with a minimum repeat

703

count that is greater than 1 or with a limited maximum, more store is

704

required for the compiled pattern, in proportion to the size of the

705

minimum or maximum.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

706

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

707

If a pattern starts with .* or .{0,} and the PCRE_DOTALL option

708

(equivalent to Perl's /s) is set, thus allowing the . to match

709

newlines, the pattern is implicitly anchored, because whatever follows

710

will be tried against every character position in the subject string,

711

so there is no point in retrying the overall match at any position

712

after the first. PCRE normally treats such a pattern as though it were

713

preceded by \A.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

714

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

715

In cases where it is known that the subject string contains no

716

newlines, it is worth setting PCRE_DOTALL in order to obtain this

717

optimization, or alternatively using ^ to indicate anchoring

718

explicitly.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

719

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

720

However, there is one situation where the optimization cannot be used.

721

When .* is inside capturing parentheses that are the subject of a

722

backreference elsewhere in the pattern, a match at the start may fail,

723

and a later one succeed. Consider, for example:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

(.*)abc\1

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

727

If the subject is "xyz123abc123" the match point is the fourth

728

character. For this reason, such a pattern is not implicitly anchored.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

729

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

730

When a capturing subpattern is repeated, the value captured is the

731

substring that matched the final iteration. For example, after

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

732

733

(tweedle[dume]{3}\s*)+

734

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

735

has matched "tweedledum tweedledee" the value of the captured substring

736

is "tweedledee". However, if there are nested capturing subpatterns,

737

the corresponding captured values may have been set in previous

738

iterations. For example, after

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

/(a|(b))+/

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

742

matches "aba" the value of the second captured substring is "b".

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

743

744

ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

745

With both maximizing and minimizing repetition, failure of what follows

746

normally causes the repeated item to be re-evaluated to see if a

747

different number of repeats allows the rest of the pattern to match.

748

Sometimes it is useful to prevent this, either to change the nature of

749

the match, or to cause it fail earlier than it otherwise might, when

750

the author of the pattern knows there is no point in carrying on.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

751

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

752

Consider, for example, the pattern \d+foo when applied to the subject

753

line

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

123456bar

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

757

After matching all 6 digits and then failing to match "foo", the normal

758

action of the matcher is to try again with only 5 digits matching the

759

\d+ item, and then with 4, and so on, before ultimately failing.

760

"Atomic grouping" (a term taken from Jeffrey Friedl's book) provides

761

the means for specifying that once a subpattern has matched, it is not

762

to be re-evaluated in this way.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

763

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

764

If we use atomic grouping for the previous example, the matcher would

765

give up immediately on failing to match "foo" the first time. The

766

notation is a kind of special parenthesis, starting with (?> as in this

767

example:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

(?>\d+)foo

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

771

This kind of parenthesis "locks up" the part of the pattern it

772

contains once it has matched, and a failure further into the pattern is

773

prevented from backtracking into it. Backtracking past it to previous

774

items, however, works as normal.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

775

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

776

An alternative description is that a subpattern of this type matches

777

the string of characters that an identical standalone pattern would

778

match, if anchored at the current point in the subject string.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

779

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

780

Atomic grouping subpatterns are not capturing subpatterns. Simple cases

781

such as the above example can be thought of as a maximizing repeat that

782

must swallow everything it can. So, while both \d+ and \d+? are

783

prepared to adjust the number of digits they match in order to make the

784

rest of the pattern match, (?>\d+) can only match an entire sequence of

785

digits.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

786

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

787

Atomic groups in general can of course contain arbitrarily complicated

788

subpatterns, and can be nested. However, when the subpattern for an

789

atomic group is just a single repeated item, as in the example above, a

790

simpler notation, called a "possessive quantifier" can be used. This

791

consists of an additional + character following a quantifier. Using

792

this notation, the previous example can be rewritten as

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

\d++bar

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

796

Possessive quantifiers are always greedy; the setting of the

797

PCRE_UNGREEDY option is ignored. They are a convenient notation for the

798

simpler forms of atomic group. However, there is no difference in the

799

meaning or processing of a possessive quantifier and the equivalent

800

atomic group.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

801

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

802

The possessive quantifier syntax is an extension to the Perl syntax. It

803

originates in Sun's Java package.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

804

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

805

When a pattern contains an unlimited repeat inside a subpattern that

806

can itself be repeated an unlimited number of times, the use of an

807

atomic group is the only way to avoid some failing matches taking a

808

very long time indeed. The pattern

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

(\D+|<\d+>)*[!?]

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

812

matches an unlimited number of substrings that either consist of non-

813

digits, or digits enclosed in <>, followed by either ! or ?. When it

814

matches, it runs quickly. However, if it is applied to

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

815

816

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

817

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

818

it takes a long time before reporting failure. This is because the

819

string can be divided between the two repeats in a large number of

820

ways, and all have to be tried. (The example used [!?] rather than a

821

single character at the end, because both PCRE and Perl have an

822

optimization that allows for fast failure when a single character is

823

used. They remember the last single character that is required for a

824

match, and fail early if it is not present in the string.) If the

825

pattern is changed to

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

((?>\D+)|<\d+>)*[!?]

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

829

sequences of non-digits cannot be broken, and failure happens quickly.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

830

831

BACK REFERENCES

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

832

Outside a character class, a backslash followed by a digit greater than

833

0 (and possibly further digits) is a back reference to a capturing

834

subpattern earlier (that is, to its left) in the pattern, provided

835

there have been that many previous capturing left parentheses.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

836

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

837

However, if the decimal number following the backslash is less than 10,

838

it is always taken as a back reference, and causes an error only if

839

there are not that many capturing left parentheses in the entire

840

pattern. In other words, the parentheses that are referenced need not

841

be to the left of the reference for numbers less than 10. See the

842

section entitled "Backslash" above for further details of the handling

843

of digits following a backslash.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

844

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

845

A back reference matches whatever actually matched the capturing

846

subpattern in the current subject string, rather than anything matching

847

the subpattern itself (see "Subpatterns as subroutines" below for a way

848

of doing that). So the pattern

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

849

850

(sens|respons)e and \1ibility

851

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

852

matches "sense and sensibility" and "response and responsibility", but

853

not "sense and responsibility". If caseful matching is in force at the

854

time of the back reference, the case of letters is relevant. For

855

example,

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

((?i)rah)\s+\1

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

859

matches "rah rah" and "RAH RAH", but not "RAH rah", even though the

860

original capturing subpattern is matched caselessly.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

861

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

862

Back references to named subpatterns use the Python syntax (?P=name).

863

We could rewrite the above example as follows:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

864

865

(?<p1>(?i)rah)\s+(?P=p1)

866

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

867

There may be more than one back reference to the same subpattern. If a

868

subpattern has not actually been used in a particular match, any back

869

references to it always fail. For example, the pattern

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

(a|(bc))\2

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

873

always fails if it starts to match "a" rather than "bc". Because there

874

may be many capturing parentheses in a pattern, all digits following

875

the backslash are taken as part of a potential back reference number.

876

If the pattern continues with a digit character, some delimiter must be

877

used to terminate the back reference. If the PCRE_EXTENDED option is

878

set, this can be whitespace. Otherwise an empty comment can be used.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

879

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

880

A back reference that occurs inside the parentheses to which it refers

881

fails when the subpattern is first used, so, for example, (a\1) never

882

matches. However, such references can be useful inside repeated

883

subpatterns. For example, the pattern

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

(a|b\1)+

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

887

matches any number of "a"s and also "aba", "ababbaa" etc. At each

888

iteration of the subpattern, the back reference matches the character

889

string corresponding to the previous iteration. In order for this to

890

work, the pattern must be such that the first iteration does not need

891

to match the back reference. This can be done using alternation, as in

892

the example above, or by a quantifier with a minimum of zero.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

893

894

ASSERTIONS

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

895

An assertion is a test on the characters following or preceding the

896

current matching point that does not actually consume any characters.

897

The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are

898

described above. More complicated assertions are coded as subpatterns.

899

There are two kinds: those that look ahead of the current position in

900

the subject string, and those that look behind it.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

901

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

902

An assertion subpattern is matched in the normal way, except that it

903

does not cause the current matching position to be changed. Lookahead

904

assertions start with (?= for positive assertions and (?! for negative

905

assertions. For example,

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

\w+(?=;)

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

909

matches a word followed by a semicolon, but does not include the

910

semicolon in the match, and

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

foo(?!bar)

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

914

matches any occurrence of "foo" that is not followed by "bar". Note

915

that the apparently similar pattern

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

(?!foo)bar

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

919

does not find an occurrence of "bar" that is preceded by something

920

other than "foo"; it finds any occurrence of "bar" whatsoever, because

921

the assertion (?!foo) is always true when the next three characters are

922

"bar". A lookbehind assertion is needed to achieve this effect.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

923

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

924

If you want to force a matching failure at some point in a pattern, the

925

most convenient way to do it is with (?!) because an empty string

926

always matches, so an assertion that requires there not to be an empty

927

string must always fail.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

928

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

929

Lookbehind assertions start with (?<= for positive assertions and (?<!

930

for negative assertions. For example,

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

(?<!foo)bar

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

934

does find an occurrence of "bar" that is not preceded by "foo". The

935

contents of a lookbehind assertion are restricted such that all the

936

strings it matches must have a fixed length. However, if there are

937

several alternatives, they do not all have to have the same fixed

938

length. Thus

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

(?<=bullock|donkey)

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

942

is permitted, but

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

(?<!dogs?|cats?)

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

946

causes an error at compile time. Branches that match different length

947

strings are permitted only at the top level of a lookbehind assertion.

948

This is an extension compared with Perl (at least for 5.8), which

949

requires all branches to match the same length of string. An assertion

950

such as

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

(?<=ab(c|de))

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

954

is not permitted, because its single top-level branch can match two

955

different lengths, but it is acceptable if rewritten to use two top-

956

level branches:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

(?<=abc|abde)

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

960

The implementation of lookbehind assertions is, for each alternative,

961

to temporarily move the current position back by the fixed width and

962

then try to match. If there are insufficient characters before the

963

current position, the match is deemed to fail.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

964

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

965

PCRE does not allow the \C escape (which matches a single byte in UTF-8

966

mode) to appear in lookbehind assertions, because it makes it

967

impossible to calculate the length of the lookbehind.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

968

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

969

Atomic groups can be used in conjunction with lookbehind assertions to

970

specify efficient matching at the end of the subject string. Consider a

971

simple pattern such as

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

abcd$

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

975

when applied to a long string that does not match. Because matching

976

proceeds from left to right, PCRE will look for each "a" in the subject

977

and then see if what follows matches the rest of the pattern. If the

978

pattern is specified as

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

^.*abcd$

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

982

the initial .* matches the entire string at first, but when this fails

983

(because there is no following "a"), it backtracks to match all but the

984

last character, then all but the last two characters, and so on. Once

985

again the search for "a" covers the entire string, from right to left,

986

so we are no better off. However, if the pattern is written as

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

^(?>.*)(?<=abcd)

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

990

or, equivalently,

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

^.*+(?<=abcd)

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

994

there can be no backtracking for the .* item; it can match only the

995

entire string. The subsequent lookbehind assertion does a single test

996

on the last four characters. If it fails, the match fails immediately.

997

For long strings, this approach makes a significant difference to the

998

processing time.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

999

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1000

Several assertions (of any sort) may occur in succession. For example,

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1001

1002

(?<=\d{3})(?<!999)foo

1003

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1004

matches "foo" preceded by three digits that are not "999". Notice that

1005

each of the assertions is applied independently at the same point in

1006

the subject string. First there is a check that the previous three

1007

characters are all digits, and then there is a check that the same

1008

three characters are not "999". This pattern does not match "foo"

1009

preceded by six characters, the first of which are digits and the last

1010

three of which are not "999". For example, it doesn't match

1011

"123abcfoo". A pattern to do that is

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1012

1013

(?<=\d{3}...)(?<!999)foo

1014

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1015

This time the first assertion looks at the preceding six characters,

1016

checking that the first three are digits, and then the second assertion

1017

checks that the preceding three characters are not "999".

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1018

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1019

Assertions can be nested in any combination. For example,

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

(?<=(?<!foo)bar)baz

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

1023

matches an occurrence of "baz" that is preceded by "bar" which in turn

1024

is not preceded by "foo", while

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1025

1026

(?<=\d{3}(?!999)...)foo

1027

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1028

is another pattern which matches "foo" preceded by three digits and any

1029

three characters that are not "999".

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1030

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1031

Assertion subpatterns are not capturing subpatterns, and may not be

1032

repeated, because it makes no sense to assert the same thing several

1033

times. If any kind of assertion contains capturing subpatterns within

1034

it, these are counted for the purposes of numbering the capturing

1035

subpatterns in the whole pattern. However, substring capturing is

1036

carried out only for positive assertions, because it does not make

1037

sense for negative assertions.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1038

1039

CONDITIONAL SUBPATTERNS

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1040

It is possible to cause the matching process to obey a subpattern

1041

conditionally or to choose between two alternative subpatterns,

1042

depending on the result of an assertion, or whether a previous

1043

capturing subpattern matched or not. The two possible forms of

1044

conditional subpattern are

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1045

1046

(?(condition)yes-pattern)

1047

(?(condition)yes-pattern|no-pattern)

1048

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1049

If the condition is satisfied, the yes-pattern is used; otherwise the

1050

no-pattern (if present) is used. If there are more than two

1051

alternatives in the subpattern, a compile-time error occurs.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1052

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1053

There are three kinds of condition. If the text between the parentheses

1054

consists of a sequence of digits, the condition is satisfied if the

1055

capturing subpattern of that number has previously matched. The number

1056

must be greater than zero. Consider the following pattern, which

1057

contains non-significant white space to make it more readable (assume

1058

the PCRE_EXTENDED option) and to divide it into three parts for ease of

1059

discussion:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1060

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1061

( $ )? [^()]+ (?(1) $ )

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1062

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1063

The first part matches an optional opening parenthesis, and if that

1064

character is present, sets it as the first captured substring. The

1065

second part matches one or more characters that are not parentheses.

1066

The third part is a conditional subpattern that tests whether the first

1067

set of parentheses matched or not. If they did, that is, if subject

1068

started with an opening parenthesis, the condition is true, and so the

1069

yes-pattern is executed and a closing parenthesis is required.

1070

Otherwise, since no-pattern is not present, the subpattern matches

1071

nothing. In other words, this pattern matches a sequence of

1072

non-parentheses, optionally enclosed in parentheses.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1073

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1074

If the condition is the string (R), it is satisfied if a recursive call

1075

to the pattern or subpattern has been made. At "top level", the

1076

condition is false. This is a PCRE extension. Recursive patterns are

1077

described in the next section.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1078

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1079

If the condition is not a sequence of digits or (R), it must be an

1080

assertion. This may be a positive or negative lookahead or lookbehind

1081

assertion. Consider this pattern, again containing non-significant

1082

white space, and with the two alternatives on the second line:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1083

1084

(?(?=[^a-z]*[a-z])

1085

\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )

1086

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1087

The condition is a positive lookahead assertion that matches an

1088

optional sequence of non-letters followed by a letter. In other words,

1089

it tests for the presence of at least one letter in the subject. If a

1090

letter is found, the subject is matched against the first alternative;

1091

otherwise it is matched against the second. This pattern matches

1092

strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are

1093

letters and dd are digits.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1094

1095

COMMENTS

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1096

The sequence (?# marks the start of a comment which continues up to the

1097

next closing parenthesis. Nested parentheses are not permitted. The

1098

characters that make up a comment play no part in the pattern matching

1099

at all.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1100

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1101

If the PCRE_EXTENDED option is set, an unescaped # character outside a

1102

character class introduces a comment that continues up to the next

1103

newline character in the pattern.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1104

1105

RECURSIVE PATTERNS

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1106

Consider the problem of matching a string in parentheses, allowing for

1107

unlimited nested parentheses. Without the use of recursion, the best

1108

that can be done is to use a pattern that matches up to some fixed

1109

depth of nesting. It is not possible to handle an arbitrary nesting

1110

depth. Perl has provided an experimental facility that allows regular

1111

expressions to recurse (amongst other things). It does this by

1112

interpolating Perl code in the expression at run time, and the code can

1113

refer to the expression itself. A Perl pattern to solve the parentheses

1114

problem can be created like this:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1115

1116

$re = qr{$ (?: (?>[^()]+) | (?p{$re}) )* $}x;

1117

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1118

The (?p{...}) item interpolates Perl code at run time, and in this case

1119

refers recursively to the pattern in which it appears. Obviously, PCRE

1120

cannot support the interpolation of Perl code. Instead, it supports

1121

some special syntax for recursion of the entire pattern, and also for

1122

individual subpattern recursion.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1123

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1124

The special item that consists of (? followed by a number greater than

1125

zero and a closing parenthesis is a recursive call of the subpattern of

1126

the given number, provided that it occurs inside that subpattern. (If

1127

not, it is a "subroutine" call, which is described in the next

1128

section.) The special item (?R) is a recursive call of the entire

1129

regular expression.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1130

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1131

For example, this PCRE pattern solves the nested parentheses problem

1132

(assume the PCRE_EXTENDED option is set so that white space is

1133

ignored):

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1134

1135

$ ( (?>[^()]+) | (?R) )* $

1136

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1137

First it matches an opening parenthesis. Then it matches any number of

1138

substrings which can either be a sequence of non-parentheses, or a

1139

recursive match of the pattern itself (that is a correctly

1140

parenthesized substring). Finally there is a closing parenthesis.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1141

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1142

If this were part of a larger pattern, you would not want to recurse

1143

the entire pattern, so instead you could use this:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1144

1145

( $ ( (?>[^()]+) | (?1) )* $ )

1146

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1147

We have put the pattern into parentheses, and caused the recursion to

1148

refer to them instead of the whole pattern. In a larger pattern,

1149

keeping track of parenthesis numbers can be tricky. It may be more

1150

convenient to use named parentheses instead. For this, PCRE uses

1151

(?P>name), which is an extension to the Python syntax that PCRE uses

1152

for named parentheses (Perl does not provide named parentheses). We

1153

could rewrite the above example as follows:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1154

1155

(?P<pn> $ ( (?>[^()]+) | (?P>pn) )* $ )

1156

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1157

This particular example pattern contains nested unlimited repeats, and

1158

so the use of atomic grouping for matching strings of non-parentheses

1159

is important when applying the pattern to strings that do not match.

1160

For example, when this pattern is applied to

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1161

1162

(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()

1163

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1164

it yields "no match" quickly. However, if atomic grouping is not used,

1165

the match runs for a very long time indeed because there are so many

1166

different ways the + and * repeats can carve up the subject, and all

1167

have to be tested before failure can be reported.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1168

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1169

At the end of a match, the values set for any capturing subpatterns are

1170

those from the outermost level of the recursion at which the subpattern

1171

value is set. If you want to obtain intermediate values, a callout

1172

function can be used (see below and the pcrecallout documentation). If

1173

the pattern above is matched against

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

(ab(cd)ef)

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

1177

the value for the capturing parentheses is "ef", which is the last

1178

value taken on at the top level. If additional parentheses are added,

1179

giving

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1180

1181

$ ( ( (?>[^()]+) | (?R) )* ) $

^ ^

^ ^

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

1185

the string they capture is "ab(cd)ef", the contents of the top level

1186

parentheses. If there are more than 15 capturing parentheses in a

1187

pattern, PCRE has to obtain extra memory to store data during a

1188

recursion, which it does by using pcre_malloc, freeing it via pcre_free

1189

afterwards. If no memory can be obtained, the match fails with the

1190

PCRE_ERROR_NOMEMORY error.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1191

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1192

Do not confuse the (?R) item with the condition (R), which tests for

1193

recursion. Consider this pattern, which matches text in angle

1194

brackets, allowing for arbitrary nesting. Only digits are allowed in

1195

nested brackets (that is, when recursing), whereas any characters are

1196

permitted at the outer level.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1197

1198

< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >

1199

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1200

In this pattern, (?(R) is the start of a conditional subpattern, with

1201

two different alternatives for the recursive and non-recursive cases.

1202

The (?R) item is the actual recursive call.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1203

1204

SUBPATTERNS AS SUBROUTINES

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1205

If the syntax for a recursive subpattern reference (either by number or

1206

by name) is used outside the parentheses to which it refers, it

1207

operates like a subroutine in a programming language. An earlier

1208

example pointed out that the pattern

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1209

1210

(sens|respons)e and \1ibility

1211

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1212

matches "sense and sensibility" and "response and responsibility", but

1213

not "sense and responsibility". If instead the pattern

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1214

1215

(sens|respons)e and (?1)ibility

1216

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1217

is used, it does match "sense and responsibility" as well as the other

1218

two strings. Such references must, however, follow the subpattern to

1219

which they refer.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1220

1221

CALLOUTS

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1222

Perl has a feature whereby using the sequence (?{...}) causes arbitrary

1223

Perl code to be obeyed in the middle of matching a regular expression.

1224

This makes it possible, amongst other things, to extract different

1225

substrings that match the same pair of parentheses when there is a

1226

repetition.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1227

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1228

PCRE provides a similar feature, but of course it cannot obey arbitrary

1229

Perl code. The feature is called "callout". The caller of PCRE provides

1230

an external function by putting its entry point in the global variable

1231

pcre_callout. By default, this variable contains NULL, which disables

1232

all calling out.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1233

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1234

Within a regular expression, (?C) indicates the points at which the

1235

external function is to be called. If you want to identify different

1236

callout points, you can put a number less than 256 after the letter C.

1237

The default value is zero. For example, this pattern has two callout

1238

points:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

(?C1)abc(?C2)def

Zesstra

2019-11-26 20:11:40 +0100

[diff] [blame]

1242

During matching, when PCRE reaches a callout point (and pcre_callout is

1243

set), the external function is called. It is provided with the number

1244

of the callout, and, optionally, one item of data originally supplied

1245

by the caller of pcre_exec(). The callout function may cause matching

1246

to backtrack, or to fail altogether. A complete description of the

1247

interface to the callout function is given in the pcrecallout

1248

documentation.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1249

1250

DIFFERENCES FROM PERL

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1251

This section escribes the differences in the ways that PCRE and Perl

1252

handle regular expressions. The differences described here are with

1253

respect to Perl 5.8.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1254

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1255

1. PCRE does not have full UTF-8 support. Details of what it does have

1256

are given in the section on UTF-8 support in the main pcre page.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1257

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1258

2. PCRE does not allow repeat quantifiers on lookahead assertions.

1259

Perl permits them, but they do not mean what you might think. For

1260

example, (?!a){3} does not assert that the next three characters are

1261

not "a". It just asserts that the next character is not "a" three

1262

times.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1263

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1264

3. Capturing subpatterns that occur inside negative lookahead

1265

assertions are counted, but their entries in the offsets vector are

1266

never set. Perl sets its numerical variables from any such patterns

1267

that are matched before the assertion fails to match something

1268

(thereby succeeding), but only if the negative lookahead assertion

1269

contains just one branch.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1270

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1271

4. Though binary zero characters are supported in the subject string,

1272

they are not allowed in a pattern string because it is passed as a

1273

normal C string, terminated by zero. The escape sequence "\0" can be

1274

used in the pattern to represent a binary zero.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1275

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1276

5. The following Perl escape sequences are not supported: \l, \u, \L,

1277

\U, \P, \p, \N, and \X. In fact these are implemented by Perl's general

1278

string-handling and are not part of its pattern matching engine. If any

1279

of these are encountered by PCRE, an error is generated.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1280

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1281

6. PCRE does support the \Q...\E escape for quoting substrings.

1282

Characters in between are treated as literals. This is slightly

1283

different from Perl in that $ and @ are also handled as literals inside

1284

the quotes. In Perl, they cause variable interpolation (but of course

1285

PCRE does not have variables). Note the following examples:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1286

1287

Pattern PCRE matches Perl matches

1288

1289

\Qabc$xyz\E abc$xyz abc followed by the

1290

contents of $xyz

1291

\Qabc\$xyz\E abc\$xyz abc\$xyz

1292

\Qabc\E\$\Qxyz\E abc$xyz abc$xyz

1293

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1294

The \Q...\E sequence is recognized both inside and outside character

1295

classes.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1296

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1297

7. Fairly obviously, PCRE does not support the (?{code}) and

1298

(?p{code}) constructions. However, there is some experimental support

1299

for recursive patterns using the non-Perl items (?R), (?number) and

1300

(?P>name). Also, the PCRE "callout" feature allows an external function

1301

to be called during pattern matching.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1302

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1303

8. There are some differences that are concerned with the settings of

1304

captured strings when part of a pattern is repeated. For example,

1305

matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2

1306

unset, but in PCRE it is set to "b".

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1307

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1308

9. PCRE provides some extensions to the Perl regular expression

1309

facilities:

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1310

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1311

(a) Although lookbehind assertions must match fixed length strings,

1312

each alternative branch of a lookbehind assertion can match a different

1313

length of string. Perl requires them all to have the same length.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1314

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1315

(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $

1316

meta-character matches only at the very end of the string.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1317

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1318

(c) If PCRE_EXTRA is set, a backslash followed by a letter with no

1319

special meaning is faulted.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1320

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1321

(d) If PCRE_UNGREEDY is set, the greediness of the repetition

1322

quantifiers is inverted, that is, by default they are not greedy, but

1323

if followed by a question mark they are.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1324

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1325

(e) PCRE_ANCHORED can be used to force a pattern to be tried only at

1326

the first matching position in the subject string.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1327

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1328

(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and

1329

PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl equivalents.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1330

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1331

(g) The (?R), (?number), and (?P>name) constructs allows for recursive

1332

pattern matching (Perl can do this using the (?p{code}) construct,

1333

which PCRE cannot support.)

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1334

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1335

(h) PCRE supports named capturing substrings, using the Python syntax.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1336

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1337

(i) PCRE supports the possessive quantifier "++" syntax, taken from

1338

Sun's Java package.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1339

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1340

(j) The (R) condition, for testing recursion, is a PCRE extension.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1341

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1342

(k) The callout facility is PCRE-specific.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1343

1344

NOTES

1345

The \< and \> metacharacters from Henry Spencers package

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1346

are not available in PCRE, but can be emulated with \b,

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1347

as required, also in conjunction with \W or \w.

1348

1349

In LDMud, backtracks are limited by the EVAL_COST runtime

1350

limit, to avoid freezing the driver with a match

1351

like regexp(({"=XX==================="}), "X(.+)+X").

1352

1353

LDMud doesn't support PCRE callouts.

1354

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1355

LIMITATIONS

1356

There are some size limitations in PCRE but it is hoped that

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1357

they will never in practice be relevant. The maximum length

1358

of a compiled pattern is 65539 (sic) bytes. All values in

1359

repeating quantifiers must be less than 65536. There

1360

maximum number of capturing subpatterns is 65535. There is no

1361

limit to the number of non-capturing subpatterns, but the

1362

maximum depth of nesting of all kinds of parenthesized

1363

subpattern, including capturing subpatterns, assertions,

1364

and other types of subpattern, is 200.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1365

Zesstra

7ea4a03

2019-11-26 20:11:40 +0100

[diff] [blame]

1366

The maximum length of a subject string is the largest

1367

positive number that an integer variable can hold. However,

1368

PCRE uses recursion to handle subpatterns and indefinite

1369

repetition. This means that the available stack space may

1370

limit the size of a subject string that can be processed by

1371

certain patterns.

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame]

1372

1373

AUTHOR

1374

Philip Hazel <ph10@cam.ac.uk>

1375

University Computing Service,

1376

New Museums Site,

1377

Cambridge CB2 3QG, England.

1378

Phone: +44 1223 334714

1379

1380