Blame - doc/concepts/pcre - mudlib-public

MG Mud User

88f1247

2016-06-24 23:31:02 +0200

[diff] [blame^]

1

SYNOPSIS

2

PCRE - Perl-compatible regular expressions

DESCRIPTION

This document describes the regular expressions supported by the

7

PCRE package. When the package is compiled into the driver, the

8

macro __PCRE__ is defined.

9

10

Most of this manpage is lifted directly from the original PCRE

11

manpage (dated January 2003).

12

13

The PCRE library is a set of functions that implement regular

14

expression pattern matching using the same syntax and semantics

15

as Perl 5, with just a few differences (see below). The

16

current implementation corresponds to Perl 5.005, with some

17

additional features from later versions. This includes some

18

experimental, incomplete support for UTF-8 encoded strings.

19

Details of exactly what is and what is not supported are given

below.

PCRE REGULAR EXPRESSION DETAILS

24

25

The syntax and semantics of the regular expressions supported by PCRE

26

are described below. Regular expressions are also described in the Perl

27

documentation and in a number of other books, some of which have copi-

28

ous examples. Jeffrey Friedl's "Mastering Regular Expressions", pub-

29

lished by O'Reilly, covers them in great detail. The description here

30

is intended as reference documentation.

31

32

The basic operation of PCRE is on strings of bytes. However, there is

33

also support for UTF-8 character strings. To use this support you must

34

build PCRE to include UTF-8 support, and then call pcre_compile() with

35

the PCRE_UTF8 option. How this affects the pattern matching is men-

36

tioned in several places below. There is also a summary of UTF-8 fea-

37

tures in the section on UTF-8 support in the main pcre page.

38

39

A regular expression is a pattern that is matched against a subject

40

string from left to right. Most characters stand for themselves in a

41

pattern, and match the corresponding characters in the subject. As a

42

trivial example, the pattern

The quick brown fox

matches a portion of a subject string that is identical to itself. The

47

power of regular expressions comes from the ability to include alterna-

48

tives and repetitions in the pattern. These are encoded in the pattern

49

by the use of meta-characters, which do not stand for themselves but

50

instead are interpreted in some special way.

51

52

There are two different sets of meta-characters: those that are recog-

53

nized anywhere in the pattern except within square brackets, and those

54

that are recognized in square brackets. Outside square brackets, the

55

meta-characters are as follows:

56

57

\ general escape character with several uses

58

^ assert start of string (or line, in multiline mode)

59

$ assert end of string (or line, in multiline mode)

60

. match any character except newline (by default)

61

[ start character class definition

62

| start of alternative branch

63

( start subpattern

64

) end subpattern

65

? extends the meaning of (

66

also 0 or 1 quantifier

67

also quantifier minimizer

68

* 0 or more quantifier

69

+ 1 or more quantifier

70

also "possessive quantifier"

71

{ start min/max quantifier

72

73

Part of a pattern that is in square brackets is called a "character

74

class". In a character class the only meta-characters are:

75

76

\ general escape character

77

^ negate the class, but only if the first character

78

- indicates character range

79

[ POSIX character class (only if followed by POSIX

80

syntax)

81

] terminates the character class

82

83

The following sections describe the use of each of the meta-characters.

BACKSLASH

The backslash character has several uses. Firstly, if it is followed by

89

a non-alphameric character, it takes away any special meaning that

90

character may have. This use of backslash as an escape character

91

applies both inside and outside character classes.

92

93

For example, if you want to match a * character, you write \* in the

94

pattern. This escaping action applies whether or not the following

95

character would otherwise be interpreted as a meta-character, so it is

96

always safe to precede a non-alphameric with backslash to specify that

97

it stands for itself. In particular, if you want to match a backslash,

98

you write \\.

99

100

If a pattern is compiled with the PCRE_EXTENDED option, whitespace in

101

the pattern (other than in a character class) and characters between a

102

# outside a character class and the next newline character are ignored.

103

An escaping backslash can be used to include a whitespace or # charac-

104

ter as part of the pattern.

105

106

If you want to remove the special meaning from a sequence of charac-

107

ters, you can do so by putting them between \Q and \E. This is differ-

108

ent from Perl in that $ and @ are handled as literals in \Q...\E

109

sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-

110

tion. Note the following examples:

111

112

Pattern PCRE matches Perl matches

113

114

\Qabc$xyz\E abc$xyz abc followed by the

115

contents of $xyz

116

\Qabc\$xyz\E abc\$xyz abc\$xyz

117

\Qabc\E\$\Qxyz\E abc$xyz abc$xyz

118

119

The \Q...\E sequence is recognized both inside and outside character

120

classes.

121

122

A second use of backslash provides a way of encoding non-printing char-

123

acters in patterns in a visible manner. There is no restriction on the

124

appearance of non-printing characters, apart from the binary zero that

125

terminates a pattern, but when a pattern is being prepared by text

126

editing, it is usually easier to use one of the following escape

127

sequences than the binary character it represents:

128

129

\a alarm, that is, the BEL character (hex 07)

130

\cx "control-x", where x is any character

\e escape (hex 1B)

\f formfeed (hex 0C)

\n newline (hex 0A)

\r carriage return (hex 0D)

135

\t tab (hex 09)

136

\ddd character with octal code ddd, or backreference

137

\xhh character with hex code hh

138

\x{hhh..} character with hex code hhh... (UTF-8 mode only)

139

140

The precise effect of \cx is as follows: if x is a lower case letter,

141

it is converted to upper case. Then bit 6 of the character (hex 40) is

142

inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;

143

becomes hex 7B.

144

145

After \x, from zero to two hexadecimal digits are read (letters can be

146

in upper or lower case). In UTF-8 mode, any number of hexadecimal dig-

147

its may appear between \x{ and }, but the value of the character code

148

must be less than 2**31 (that is, the maximum hexadecimal value is

149

7FFFFFFF). If characters other than hexadecimal digits appear between

150

\x{ and }, or if there is no terminating }, this form of escape is not

151

recognized. Instead, the initial \x will be interpreted as a basic hex-

152

adecimal escape, with no following digits, giving a byte whose value is

153

zero.

154

155

Characters whose value is less than 256 can be defined by either of the

156

two syntaxes for \x when PCRE is in UTF-8 mode. There is no difference

157

in the way they are handled. For example, \xdc is exactly the same as

158

\x{dc}.

159

160

After \0 up to two further octal digits are read. In both cases, if

161

there are fewer than two digits, just those that are present are used.

162

Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL

163

character (code value 7). Make sure you supply two digits after the

164

initial zero if the character that follows is itself an octal digit.

165

166

The handling of a backslash followed by a digit other than 0 is compli-

167

cated. Outside a character class, PCRE reads it and any following dig-

168

its as a decimal number. If the number is less than 10, or if there

169

have been at least that many previous capturing left parentheses in the

170

expression, the entire sequence is taken as a back reference. A

171

description of how this works is given later, following the discussion

172

of parenthesized subpatterns.

173

174

Inside a character class, or if the decimal number is greater than 9

175

and there have not been that many capturing subpatterns, PCRE re-reads

176

up to three octal digits following the backslash, and generates a sin-

177

gle byte from the least significant 8 bits of the value. Any subsequent

178

digits stand for themselves. For example:

179

180

\040 is another way of writing a space

181

\40 is the same, provided there are fewer than 40

182

previous capturing subpatterns

183

\7 is always a back reference

184

\11 might be a back reference, or another way of

185

writing a tab

186

\011 is always a tab

187

\0113 is a tab followed by the character "3"

188

\113 might be a back reference, otherwise the

189

character with octal code 113

190

\377 might be a back reference, otherwise

191

the byte consisting entirely of 1 bits

192

\81 is either a back reference, or a binary zero

193

followed by the two characters "8" and "1"

194

195

Note that octal values of 100 or greater must not be introduced by a

196

leading zero, because no more than three octal digits are ever read.

197

198

All the sequences that define a single byte value or a single UTF-8

199

character (in UTF-8 mode) can be used both inside and outside character

200

classes. In addition, inside a character class, the sequence \b is

201

interpreted as the backspace character (hex 08). Outside a character

202

class it has a different meaning (see below).

203

204

The third use of backslash is for specifying generic character types:

205

206

\d any decimal digit

207

\D any character that is not a decimal digit

208

\s any whitespace character

209

\S any character that is not a whitespace character

210

\w any "word" character

211

\W any "non-word" character

212

213

Each pair of escape sequences partitions the complete set of characters

214

into two disjoint sets. Any given character matches one, and only one,

215

of each pair.

216

217

In UTF-8 mode, characters with values greater than 255 never match \d,

218

\s, or \w, and always match \D, \S, and \W.

219

220

For compatibility with Perl, \s does not match the VT character (code

221

11). This makes it different from the the POSIX "space" class. The \s

222

characters are HT (9), LF (10), FF (12), CR (13), and space (32).

223

224

A "word" character is any letter or digit or the underscore character,

225

that is, any character which can be part of a Perl "word". The defini-

226

tion of letters and digits is controlled by PCRE's character tables,

227

and may vary if locale- specific matching is taking place (see "Locale

228

support" in the pcreapi page). For example, in the "fr" (French)

229

locale, some character codes greater than 128 are used for accented

230

letters, and these are matched by \w.

231

232

These character type sequences can appear both inside and outside char-

233

acter classes. They each match one character of the appropriate type.

234

If the current matching point is at the end of the subject string, all

235

of them fail, since there is no character to match.

236

237

The fourth use of backslash is for certain simple assertions. An asser-

238

tion specifies a condition that has to be met at a particular point in

239

a match, without consuming any characters from the subject string. The

240

use of subpatterns for more complicated assertions is described below.

241

The backslashed assertions are

242

243

\b matches at a word boundary

244

\B matches when not at a word boundary

245

\A matches at start of subject

246

\Z matches at end of subject or before newline at end

247

\z matches at end of subject

248

\G matches at first matching position in subject

249

250

These assertions may not appear in character classes (but note that \b

251

has a different meaning, namely the backspace character, inside a char-

252

acter class).

253

254

A word boundary is a position in the subject string where the current

255

character and the previous character do not both match \w or \W (i.e.

256

one matches \w and the other matches \W), or the start or end of the

257

string if the first or last character matches \w, respectively.

258

259

The \A, \Z, and \z assertions differ from the traditional circumflex

260

and dollar (described below) in that they only ever match at the very

261

start and end of the subject string, whatever options are set. Thus,

262

they are independent of multiline mode.

263

264

They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options. If the

265

startoffset argument of pcre_exec() is non-zero, indicating that match-

266

ing is to start at a point other than the beginning of the subject, \A

267

can never match. The difference between \Z and \z is that \Z matches

268

before a newline that is the last character of the string as well as at

269

the end of the string, whereas \z matches only at the end.

270

271

The \G assertion is true only when the current matching position is at

272

the start point of the match, as specified by the startoffset argument

273

of pcre_exec(). It differs from \A when the value of startoffset is

274

non-zero. By calling pcre_exec() multiple times with appropriate argu-

275

ments, you can mimic Perl's /g option, and it is in this kind of imple-

276

mentation where \G can be useful.

277

278

Note, however, that PCRE's interpretation of \G, as the start of the

279

current match, is subtly different from Perl's, which defines it as the

280

end of the previous match. In Perl, these can be different when the

281

previously matched string was empty. Because PCRE does just one match

282

at a time, it cannot reproduce this behaviour.

283

284

If all the alternatives of a pattern begin with \G, the expression is

285

anchored to the starting match position, and the "anchored" flag is set

286

in the compiled regular expression.

287

288

289

CIRCUMFLEX AND DOLLAR

290

291

Outside a character class, in the default matching mode, the circumflex

292

character is an assertion which is true only if the current matching

293

point is at the start of the subject string. If the startoffset argu-

294

ment of pcre_exec() is non-zero, circumflex can never match if the

295

PCRE_MULTILINE option is unset. Inside a character class, circumflex

296

has an entirely different meaning (see below).

297

298

Circumflex need not be the first character of the pattern if a number

299

of alternatives are involved, but it should be the first thing in each

300

alternative in which it appears if the pattern is ever to match that

301

branch. If all possible alternatives start with a circumflex, that is,

302

if the pattern is constrained to match only at the start of the sub-

303

ject, it is said to be an "anchored" pattern. (There are also other

304

constructs that can cause a pattern to be anchored.)

305

306

A dollar character is an assertion which is true only if the current

307

matching point is at the end of the subject string, or immediately

308

before a newline character that is the last character in the string (by

309

default). Dollar need not be the last character of the pattern if a

310

number of alternatives are involved, but it should be the last item in

311

any branch in which it appears. Dollar has no special meaning in a

312

character class.

313

314

The meaning of dollar can be changed so that it matches only at the

315

very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at

316

compile time. This does not affect the \Z assertion.

317

318

The meanings of the circumflex and dollar characters are changed if the

319

PCRE_MULTILINE option is set. When this is the case, they match immedi-

320

ately after and immediately before an internal newline character,

321

respectively, in addition to matching at the start and end of the sub-

322

ject string. For example, the pattern /^abc$/ matches the subject

323

string "def\nabc" in multiline mode, but not otherwise. Consequently,

324

patterns that are anchored in single line mode because all branches

325

start with ^ are not anchored in multiline mode, and a match for cir-

326

cumflex is possible when the startoffset argument of pcre_exec() is

327

non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE

328

is set.

329

330

Note that the sequences \A, \Z, and \z can be used to match the start

331

and end of the subject in both modes, and if all branches of a pattern

332

start with \A it is always anchored, whether PCRE_MULTILINE is set or

not.

FULL STOP (PERIOD, DOT)

337

338

Outside a character class, a dot in the pattern matches any one charac-

339

ter in the subject, including a non-printing character, but not (by

340

default) newline. In UTF-8 mode, a dot matches any UTF-8 character,

341

which might be more than one byte long, except (by default) for new-

342

line. If the PCRE_DOTALL option is set, dots match newlines as well.

343

The handling of dot is entirely independent of the handling of circum-

344

flex and dollar, the only relationship being that they both involve

345

newline characters. Dot has no special meaning in a character class.

346

347

348

MATCHING A SINGLE BYTE

349

350

Outside a character class, the escape sequence \C matches any one byte,

351

both in and out of UTF-8 mode. Unlike a dot, it always matches a new-

352

line. The feature is provided in Perl in order to match individual

353

bytes in UTF-8 mode. Because it breaks up UTF-8 characters into indi-

354

vidual bytes, what remains in the string may be a malformed UTF-8

355

string. For this reason it is best avoided.

356

357

PCRE does not allow \C to appear in lookbehind assertions (see below),

358

because in UTF-8 mode it makes it impossible to calculate the length of

the lookbehind.

SQUARE BRACKETS

An opening square bracket introduces a character class, terminated by a

365

closing square bracket. A closing square bracket on its own is not spe-

366

cial. If a closing square bracket is required as a member of the class,

367

it should be the first data character in the class (after an initial

368

circumflex, if present) or escaped with a backslash.

369

370

A character class matches a single character in the subject. In UTF-8

371

mode, the character may occupy more than one byte. A matched character

372

must be in the set of characters defined by the class, unless the first

373

character in the class definition is a circumflex, in which case the

374

subject character must not be in the set defined by the class. If a

375

circumflex is actually required as a member of the class, ensure it is

376

not the first character, or escape it with a backslash.

377

378

For example, the character class [aeiou] matches any lower case vowel,

379

while [^aeiou] matches any character that is not a lower case vowel.

380

Note that a circumflex is just a convenient notation for specifying the

381

characters which are in the class by enumerating those that are not. It

382

is not an assertion: it still consumes a character from the subject

383

string, and fails if the current pointer is at the end of the string.

384

385

In UTF-8 mode, characters with values greater than 255 can be included

386

in a class as a literal string of bytes, or by using the \x{ escaping

387

mechanism.

388

389

When caseless matching is set, any letters in a class represent both

390

their upper case and lower case versions, so for example, a caseless

391

[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not

392

match "A", whereas a caseful version would. PCRE does not support the

393

concept of case for characters with values greater than 255.

394

395

The newline character is never treated in any special way in character

396

classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE

397

options is. A class such as [^a] will always match a newline.

398

399

The minus (hyphen) character can be used to specify a range of charac-

400

ters in a character class. For example, [d-m] matches any letter

401

between d and m, inclusive. If a minus character is required in a

402

class, it must be escaped with a backslash or appear in a position

403

where it cannot be interpreted as indicating a range, typically as the

404

first or last character in the class.

405

406

It is not possible to have the literal character "]" as the end charac-

407

ter of a range. A pattern such as [W-]46] is interpreted as a class of

408

two characters ("W" and "-") followed by a literal string "46]", so it

409

would match "W46]" or "-46]". However, if the "]" is escaped with a

410

backslash it is interpreted as the end of range, so [W-\]46] is inter-

411

preted as a single class containing a range followed by two separate

412

characters. The octal or hexadecimal representation of "]" can also be

413

used to end a range.

414

415

Ranges operate in the collating sequence of character values. They can

416

also be used for characters specified numerically, for example

417

[\000-\037]. In UTF-8 mode, ranges can include characters whose values

418

are greater than 255, for example [\x{100}-\x{2ff}].

419

420

If a range that includes letters is used when caseless matching is set,

421

it matches the letters in either case. For example, [W-c] is equivalent

422

to [][\^_`wxyzabc], matched caselessly, and if character tables for the

423

"fr" locale are in use, [\xc8-\xcb] matches accented E characters in

424

both cases.

425

426

The character types \d, \D, \s, \S, \w, and \W may also appear in a

427

character class, and add the characters that they match to the class.

428

For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can

429

conveniently be used with the upper case character types to specify a

430

more restricted set of characters than the matching lower case type.

431

For example, the class [^\W_] matches any letter or digit, but not

432

underscore.

433

434

All non-alphameric characters other than \, -, ^ (at the start) and the

435

terminating ] are non-special in character classes, but it does no harm

if they are escaped.

POSIX CHARACTER CLASSES

440

441

Perl supports the POSIX notation for character classes, which uses

442

names enclosed by [: and :] within the enclosing square brackets. PCRE

443

also supports this notation. For example,

[01[:alpha:]%]

matches "0", "1", any alphabetic character, or "%". The supported class

448

names are

449

450

alnum letters and digits

451

alpha letters

452

ascii character codes 0 - 127

453

blank space or tab only

454

cntrl control characters

455

digit decimal digits (same as \d)

456

graph printing characters, excluding space

457

lower lower case letters

458

print printing characters, including space

459

punct printing characters, excluding letters and digits

460

space white space (not quite the same as \s)

461

upper upper case letters

462

word "word" characters (same as \w)

463

xdigit hexadecimal digits

464

465

The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),

466

and space (32). Notice that this list includes the VT character (code

467

11). This makes "space" different to \s, which does not include VT (for

468

Perl compatibility).

469

470

The name "word" is a Perl extension, and "blank" is a GNU extension

471

from Perl 5.8. Another Perl extension is negation, which is indicated

472

by a ^ character after the colon. For example,

[12[:^digit:]]

matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the

477

POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but

478

these are not supported, and an error is given if they are encountered.

479

480

In UTF-8 mode, characters with values greater than 255 do not match any

481

of the POSIX character classes.

VERTICAL BAR

Vertical bar characters are used to separate alternative patterns. For

example, the pattern

gilbert|sullivan

matches either "gilbert" or "sullivan". Any number of alternatives may

492

appear, and an empty alternative is permitted (matching the empty

493

string). The matching process tries each alternative in turn, from

494

left to right, and the first one that succeeds is used. If the alterna-

495

tives are within a subpattern (defined below), "succeeds" means match-

496

ing the rest of the main pattern as well as the alternative in the sub-

pattern.

INTERNAL OPTION SETTING

501

502

The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and

503

PCRE_EXTENDED options can be changed from within the pattern by a

504

sequence of Perl option letters enclosed between "(?" and ")". The

option letters are

i for PCRE_CASELESS

m for PCRE_MULTILINE

s for PCRE_DOTALL

x for PCRE_EXTENDED

For example, (?im) sets caseless, multiline matching. It is also possi-

513

ble to unset these options by preceding the letter with a hyphen, and a

514

combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-

515

LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,

516

is also permitted. If a letter appears both before and after the

517

hyphen, the option is unset.

518

519

When an option change occurs at top level (that is, not inside subpat-

520

tern parentheses), the change applies to the remainder of the pattern

521

that follows. If the change is placed right at the start of a pattern,

522

PCRE extracts it into the global options (and it will therefore show up

523

in data extracted by the pcre_fullinfo() function).

524

525

An option change within a subpattern affects only that part of the cur-

526

rent pattern that follows it, so

(a(?i)b)c

matches abc and aBc and no other strings (assuming PCRE_CASELESS is not

531

used). By this means, options can be made to have different settings

532

in different parts of the pattern. Any changes made in one alternative

533

do carry on into subsequent branches within the same subpattern. For

example,

(a(?i)b|c)

matches "ab", "aB", "c", and "C", even though when matching "C" the

539

first branch is abandoned before the option setting. This is because

540

the effects of option settings happen at compile time. There would be

541

some very weird behaviour otherwise.

542

543

The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed

544

in the same way as the Perl-compatible options by using the characters

545

U and X respectively. The (?X) flag setting is special in that it must

546

always occur earlier in the pattern than any of the additional features

547

it turns on, even when it is at top level. It is best put at the start.

SUBPATTERNS

Subpatterns are delimited by parentheses (round brackets), which can be

553

nested. Marking part of a pattern as a subpattern does two things:

554

555

1. It localizes a set of alternatives. For example, the pattern

cat(aract|erpillar|)

matches one of the words "cat", "cataract", or "caterpillar". Without

560

the parentheses, it would match "cataract", "erpillar" or the empty

561

string.

562

563

2. It sets up the subpattern as a capturing subpattern (as defined

564

above). When the whole pattern matches, that portion of the subject

565

string that matched the subpattern is passed back to the caller via the

566

ovector argument of pcre_exec(). Opening parentheses are counted from

567

left to right (starting from 1) to obtain the numbers of the capturing

568

subpatterns.

569

570

For example, if the string "the red king" is matched against the pat-

571

tern

572

573

the ((red|white) (king|queen))

574

575

the captured substrings are "red king", "red", and "king", and are num-

576

bered 1, 2, and 3, respectively.

577

578

The fact that plain parentheses fulfil two functions is not always

579

helpful. There are often times when a grouping subpattern is required

580

without a capturing requirement. If an opening parenthesis is followed

581

by a question mark and a colon, the subpattern does not do any captur-

582

ing, and is not counted when computing the number of any subsequent

583

capturing subpatterns. For example, if the string "the white queen" is

584

matched against the pattern

585

586

the ((?:red|white) (king|queen))

587

588

the captured substrings are "white queen" and "queen", and are numbered

589

1 and 2. The maximum number of capturing subpatterns is 65535, and the

590

maximum depth of nesting of all subpatterns, both capturing and non-

591

capturing, is 200.

592

593

As a convenient shorthand, if any option settings are required at the

594

start of a non-capturing subpattern, the option letters may appear

595

between the "?" and the ":". Thus the two patterns

596

597

(?i:saturday|sunday)

598

(?:(?i)saturday|sunday)

599

600

match exactly the same set of strings. Because alternative branches are

601

tried from left to right, and options are not reset until the end of

602

the subpattern is reached, an option setting in one branch does affect

603

subsequent branches, so the above patterns match "SUNDAY" as well as

"Saturday".

NAMED SUBPATTERNS

Identifying capturing parentheses by number is simple, but it can be

610

very hard to keep track of the numbers in complicated regular expres-

611

sions. Furthermore, if an expression is modified, the numbers may

612

change. To help with the difficulty, PCRE supports the naming of sub-

613

patterns, something that Perl does not provide. The Python syntax

614

(?P<name>...) is used. Names consist of alphanumeric characters and

615

underscores, and must be unique within a pattern.

616

617

Named capturing parentheses are still allocated numbers as well as

618

names. The PCRE API provides function calls for extracting the name-to-

619

number translation table from a compiled pattern. For further details

620

see the pcreapi documentation.

REPETITION

Repetition is specified by quantifiers, which can follow any of the

626

following items:

627

628

a literal data character

629

the . metacharacter

630

the \C escape sequence

631

escapes such as \d that match single characters

632

a character class

633

a back reference (see next section)

634

a parenthesized subpattern (unless it is an assertion)

635

636

The general repetition quantifier specifies a minimum and maximum num-

637

ber of permitted matches, by giving the two numbers in curly brackets

638

(braces), separated by a comma. The numbers must be less than 65536,

639

and the first must be less than or equal to the second. For example:

z{2,4}

matches "zz", "zzz", or "zzzz". A closing brace on its own is not a

644

special character. If the second number is omitted, but the comma is

645

present, there is no upper limit; if the second number and the comma

646

are both omitted, the quantifier specifies an exact number of required

matches. Thus

[aeiou]{3,}

matches at least 3 successive vowels, but may match many more, while

\d{8}

matches exactly 8 digits. An opening curly bracket that appears in a

656

position where a quantifier is not allowed, or one that does not match

657

the syntax of a quantifier, is taken as a literal character. For exam-

658

ple, {,6} is not a quantifier, but a literal string of four characters.

659

660

In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to

661

individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-

662

acters, each of which is represented by a two-byte sequence.

663

664

The quantifier {0} is permitted, causing the expression to behave as if

665

the previous item and the quantifier were not present.

666

667

For convenience (and historical compatibility) the three most common

668

quantifiers have single-character abbreviations:

669

670

* is equivalent to {0,}

671

+ is equivalent to {1,}

672

? is equivalent to {0,1}

673

674

It is possible to construct infinite loops by following a subpattern

675

that can match no characters with a quantifier that has no upper limit,

for example:

(a?)*

Earlier versions of Perl and PCRE used to give an error at compile time

681

for such patterns. However, because there are cases where this can be

682

useful, such patterns are now accepted, but if any repetition of the

683

subpattern does in fact match no characters, the loop is forcibly bro-

684

ken.

685

686

By default, the quantifiers are "greedy", that is, they match as much

687

as possible (up to the maximum number of permitted times), without

688

causing the rest of the pattern to fail. The classic example of where

689

this gives problems is in trying to match comments in C programs. These

690

appear between the sequences /* and */ and within the sequence, indi-

691

vidual * and / characters may appear. An attempt to match C comments by

applying the pattern

/\*.*\*/

to the string

/* first command */ not comment /* second comment */

699

700

fails, because it matches the entire string owing to the greediness of

701

the .* item.

702

703

However, if a quantifier is followed by a question mark, it ceases to

704

be greedy, and instead matches the minimum number of times possible, so

the pattern

/\*.*?\*/

does the right thing with the C comments. The meaning of the various

710

quantifiers is not otherwise changed, just the preferred number of

711

matches. Do not confuse this use of question mark with its use as a

712

quantifier in its own right. Because it has two uses, it can sometimes

713

appear doubled, as in

\d??\d

which matches one digit by preference, but can match two if that is the

718

only way the rest of the pattern matches.

719

720

If the PCRE_UNGREEDY option is set (an option which is not available in

721

Perl), the quantifiers are not greedy by default, but individual ones

722

can be made greedy by following them with a question mark. In other

723

words, it inverts the default behaviour.

724

725

When a parenthesized subpattern is quantified with a minimum repeat

726

count that is greater than 1 or with a limited maximum, more store is

727

required for the compiled pattern, in proportion to the size of the

728

minimum or maximum.

729

730

If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-

731

alent to Perl's /s) is set, thus allowing the . to match newlines, the

732

pattern is implicitly anchored, because whatever follows will be tried

733

against every character position in the subject string, so there is no

734

point in retrying the overall match at any position after the first.

735

PCRE normally treats such a pattern as though it were preceded by \A.

736

737

In cases where it is known that the subject string contains no new-

738

lines, it is worth setting PCRE_DOTALL in order to obtain this opti-

739

mization, or alternatively using ^ to indicate anchoring explicitly.

740

741

However, there is one situation where the optimization cannot be used.

742

When .* is inside capturing parentheses that are the subject of a

743

backreference elsewhere in the pattern, a match at the start may fail,

744

and a later one succeed. Consider, for example:

(.*)abc\1

If the subject is "xyz123abc123" the match point is the fourth charac-

749

ter. For this reason, such a pattern is not implicitly anchored.

750

751

When a capturing subpattern is repeated, the value captured is the sub-

752

string that matched the final iteration. For example, after

753

754

(tweedle[dume]{3}\s*)+

755

756

has matched "tweedledum tweedledee" the value of the captured substring

757

is "tweedledee". However, if there are nested capturing subpatterns,

758

the corresponding captured values may have been set in previous itera-

759

tions. For example, after

/(a|(b))+/

matches "aba" the value of the second captured substring is "b".

764

765

766

ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

767

768

With both maximizing and minimizing repetition, failure of what follows

769

normally causes the repeated item to be re-evaluated to see if a dif-

770

ferent number of repeats allows the rest of the pattern to match. Some-

771

times it is useful to prevent this, either to change the nature of the

772

match, or to cause it fail earlier than it otherwise might, when the

773

author of the pattern knows there is no point in carrying on.

774

775

Consider, for example, the pattern \d+foo when applied to the subject

line

123456bar

After matching all 6 digits and then failing to match "foo", the normal

781

action of the matcher is to try again with only 5 digits matching the

782

\d+ item, and then with 4, and so on, before ultimately failing.

783

"Atomic grouping" (a term taken from Jeffrey Friedl's book) provides

784

the means for specifying that once a subpattern has matched, it is not

785

to be re-evaluated in this way.

786

787

If we use atomic grouping for the previous example, the matcher would

788

give up immediately on failing to match "foo" the first time. The nota-

789

tion is a kind of special parenthesis, starting with (?> as in this

example:

(?>\d+)foo

This kind of parenthesis "locks up" the part of the pattern it con-

795

tains once it has matched, and a failure further into the pattern is

796

prevented from backtracking into it. Backtracking past it to previous

797

items, however, works as normal.

798

799

An alternative description is that a subpattern of this type matches

800

the string of characters that an identical standalone pattern would

801

match, if anchored at the current point in the subject string.

802

803

Atomic grouping subpatterns are not capturing subpatterns. Simple cases

804

such as the above example can be thought of as a maximizing repeat that

805

must swallow everything it can. So, while both \d+ and \d+? are pre-

806

pared to adjust the number of digits they match in order to make the

807

rest of the pattern match, (?>\d+) can only match an entire sequence of

808

digits.

809

810

Atomic groups in general can of course contain arbitrarily complicated

811

subpatterns, and can be nested. However, when the subpattern for an

812

atomic group is just a single repeated item, as in the example above, a

813

simpler notation, called a "possessive quantifier" can be used. This

814

consists of an additional + character following a quantifier. Using

815

this notation, the previous example can be rewritten as

\d++bar

Possessive quantifiers are always greedy; the setting of the

820

PCRE_UNGREEDY option is ignored. They are a convenient notation for the

821

simpler forms of atomic group. However, there is no difference in the

822

meaning or processing of a possessive quantifier and the equivalent

823

atomic group.

824

825

The possessive quantifier syntax is an extension to the Perl syntax. It

826

originates in Sun's Java package.

827

828

When a pattern contains an unlimited repeat inside a subpattern that

829

can itself be repeated an unlimited number of times, the use of an

830

atomic group is the only way to avoid some failing matches taking a

831

very long time indeed. The pattern

(\D+|<\d+>)*[!?]

matches an unlimited number of substrings that either consist of non-

836

digits, or digits enclosed in <>, followed by either ! or ?. When it

837

matches, it runs quickly. However, if it is applied to

838

839

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

840

841

it takes a long time before reporting failure. This is because the

842

string can be divided between the two repeats in a large number of

843

ways, and all have to be tried. (The example used [!?] rather than a

844

single character at the end, because both PCRE and Perl have an opti-

845

mization that allows for fast failure when a single character is used.

846

They remember the last single character that is required for a match,

847

and fail early if it is not present in the string.) If the pattern is

changed to

((?>\D+)|<\d+>)*[!?]

sequences of non-digits cannot be broken, and failure happens quickly.

BACK REFERENCES

Outside a character class, a backslash followed by a digit greater than

858

0 (and possibly further digits) is a back reference to a capturing sub-

859

pattern earlier (that is, to its left) in the pattern, provided there

860

have been that many previous capturing left parentheses.

861

862

However, if the decimal number following the backslash is less than 10,

863

it is always taken as a back reference, and causes an error only if

864

there are not that many capturing left parentheses in the entire pat-

865

tern. In other words, the parentheses that are referenced need not be

866

to the left of the reference for numbers less than 10. See the section

867

entitled "Backslash" above for further details of the handling of dig-

868

its following a backslash.

869

870

A back reference matches whatever actually matched the capturing sub-

871

pattern in the current subject string, rather than anything matching

872

the subpattern itself (see "Subpatterns as subroutines" below for a way

873

of doing that). So the pattern

874

875

(sens|respons)e and \1ibility

876

877

matches "sense and sensibility" and "response and responsibility", but

878

not "sense and responsibility". If caseful matching is in force at the

879

time of the back reference, the case of letters is relevant. For exam-

ple,

((?i)rah)\s+\1

matches "rah rah" and "RAH RAH", but not "RAH rah", even though the

885

original capturing subpattern is matched caselessly.

886

887

Back references to named subpatterns use the Python syntax (?P=name).

888

We could rewrite the above example as follows:

889

890

(?<p1>(?i)rah)\s+(?P=p1)

891

892

There may be more than one back reference to the same subpattern. If a

893

subpattern has not actually been used in a particular match, any back

894

references to it always fail. For example, the pattern

(a|(bc))\2

always fails if it starts to match "a" rather than "bc". Because there

899

may be many capturing parentheses in a pattern, all digits following

900

the backslash are taken as part of a potential back reference number.

901

If the pattern continues with a digit character, some delimiter must be

902

used to terminate the back reference. If the PCRE_EXTENDED option is

903

set, this can be whitespace. Otherwise an empty comment can be used.

904

905

A back reference that occurs inside the parentheses to which it refers

906

fails when the subpattern is first used, so, for example, (a\1) never

907

matches. However, such references can be useful inside repeated sub-

908

patterns. For example, the pattern

(a|b\1)+

matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-

913

ation of the subpattern, the back reference matches the character

914

string corresponding to the previous iteration. In order for this to

915

work, the pattern must be such that the first iteration does not need

916

to match the back reference. This can be done using alternation, as in

917

the example above, or by a quantifier with a minimum of zero.

ASSERTIONS

An assertion is a test on the characters following or preceding the

923

current matching point that does not actually consume any characters.

924

The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are

925

described above. More complicated assertions are coded as subpatterns.

926

There are two kinds: those that look ahead of the current position in

927

the subject string, and those that look behind it.

928

929

An assertion subpattern is matched in the normal way, except that it

930

does not cause the current matching position to be changed. Lookahead

931

assertions start with (?= for positive assertions and (?! for negative

932

assertions. For example,

\w+(?=;)

matches a word followed by a semicolon, but does not include the semi-

937

colon in the match, and

foo(?!bar)

matches any occurrence of "foo" that is not followed by "bar". Note

942

that the apparently similar pattern

(?!foo)bar

does not find an occurrence of "bar" that is preceded by something

947

other than "foo"; it finds any occurrence of "bar" whatsoever, because

948

the assertion (?!foo) is always true when the next three characters are

949

"bar". A lookbehind assertion is needed to achieve this effect.

950

951

If you want to force a matching failure at some point in a pattern, the

952

most convenient way to do it is with (?!) because an empty string

953

always matches, so an assertion that requires there not to be an empty

954

string must always fail.

955

956

Lookbehind assertions start with (?<= for positive assertions and (?<!

957

for negative assertions. For example,

(?<!foo)bar

does find an occurrence of "bar" that is not preceded by "foo". The

962

contents of a lookbehind assertion are restricted such that all the

963

strings it matches must have a fixed length. However, if there are sev-

964

eral alternatives, they do not all have to have the same fixed length.

Thus

(?<=bullock|donkey)

is permitted, but

(?<!dogs?|cats?)

causes an error at compile time. Branches that match different length

974

strings are permitted only at the top level of a lookbehind assertion.

975

This is an extension compared with Perl (at least for 5.8), which

976

requires all branches to match the same length of string. An assertion

such as

(?<=ab(c|de))

is not permitted, because its single top-level branch can match two

982

different lengths, but it is acceptable if rewritten to use two top-

level branches:

(?<=abc|abde)

The implementation of lookbehind assertions is, for each alternative,

988

to temporarily move the current position back by the fixed width and

989

then try to match. If there are insufficient characters before the cur-

990

rent position, the match is deemed to fail.

991

992

PCRE does not allow the \C escape (which matches a single byte in UTF-8

993

mode) to appear in lookbehind assertions, because it makes it impossi-

994

ble to calculate the length of the lookbehind.

995

996

Atomic groups can be used in conjunction with lookbehind assertions to

997

specify efficient matching at the end of the subject string. Consider a

998

simple pattern such as

abcd$

when applied to a long string that does not match. Because matching

1003

proceeds from left to right, PCRE will look for each "a" in the subject

1004

and then see if what follows matches the rest of the pattern. If the

1005

pattern is specified as

^.*abcd$

the initial .* matches the entire string at first, but when this fails

1010

(because there is no following "a"), it backtracks to match all but the

1011

last character, then all but the last two characters, and so on. Once

1012

again the search for "a" covers the entire string, from right to left,

1013

so we are no better off. However, if the pattern is written as

^(?>.*)(?<=abcd)

or, equivalently,

^.*+(?<=abcd)

there can be no backtracking for the .* item; it can match only the

1022

entire string. The subsequent lookbehind assertion does a single test

1023

on the last four characters. If it fails, the match fails immediately.

1024

For long strings, this approach makes a significant difference to the

1025

processing time.

1026

1027

Several assertions (of any sort) may occur in succession. For example,

1028

1029

(?<=\d{3})(?<!999)foo

1030

1031

matches "foo" preceded by three digits that are not "999". Notice that

1032

each of the assertions is applied independently at the same point in

1033

the subject string. First there is a check that the previous three

1034

characters are all digits, and then there is a check that the same

1035

three characters are not "999". This pattern does not match "foo" pre-

1036

ceded by six characters, the first of which are digits and the last

1037

three of which are not "999". For example, it doesn't match "123abc-

1038

foo". A pattern to do that is

1039

1040

(?<=\d{3}...)(?<!999)foo

1041

1042

This time the first assertion looks at the preceding six characters,

1043

checking that the first three are digits, and then the second assertion

1044

checks that the preceding three characters are not "999".

1045

1046

Assertions can be nested in any combination. For example,

(?<=(?<!foo)bar)baz

matches an occurrence of "baz" that is preceded by "bar" which in turn

1051

is not preceded by "foo", while

1052

1053

(?<=\d{3}(?!999)...)foo

1054

1055

is another pattern which matches "foo" preceded by three digits and any

1056

three characters that are not "999".

1057

1058

Assertion subpatterns are not capturing subpatterns, and may not be

1059

repeated, because it makes no sense to assert the same thing several

1060

times. If any kind of assertion contains capturing subpatterns within

1061

it, these are counted for the purposes of numbering the capturing sub-

1062

patterns in the whole pattern. However, substring capturing is carried

1063

out only for positive assertions, because it does not make sense for

negative assertions.

CONDITIONAL SUBPATTERNS

1068

1069

It is possible to cause the matching process to obey a subpattern con-

1070

ditionally or to choose between two alternative subpatterns, depending

1071

on the result of an assertion, or whether a previous capturing

1072

subpattern matched or not. The two possible forms of conditional sub-

1073

pattern are

1074

1075

(?(condition)yes-pattern)

1076

(?(condition)yes-pattern|no-pattern)

1077

1078

If the condition is satisfied, the yes-pattern is used; otherwise the

1079

no-pattern (if present) is used. If there are more than two alterna-

1080

tives in the subpattern, a compile-time error occurs.

1081

1082

There are three kinds of condition. If the text between the parentheses

1083

consists of a sequence of digits, the condition is satisfied if the

1084

capturing subpattern of that number has previously matched. The number

1085

must be greater than zero. Consider the following pattern, which con-

1086

tains non-significant white space to make it more readable (assume the

1087

PCRE_EXTENDED option) and to divide it into three parts for ease of

1088

discussion:

1089

1090

( $ )? [^()]+ (?(1) $ )

1091

1092

The first part matches an optional opening parenthesis, and if that

1093

character is present, sets it as the first captured substring. The sec-

1094

ond part matches one or more characters that are not parentheses. The

1095

third part is a conditional subpattern that tests whether the first set

1096

of parentheses matched or not. If they did, that is, if subject started

1097

with an opening parenthesis, the condition is true, and so the yes-pat-

1098

tern is executed and a closing parenthesis is required. Otherwise,

1099

since no-pattern is not present, the subpattern matches nothing. In

1100

other words, this pattern matches a sequence of non-parentheses,

1101

optionally enclosed in parentheses.

1102

1103

If the condition is the string (R), it is satisfied if a recursive call

1104

to the pattern or subpattern has been made. At "top level", the condi-

1105

tion is false. This is a PCRE extension. Recursive patterns are

1106

described in the next section.

1107

1108

If the condition is not a sequence of digits or (R), it must be an

1109

assertion. This may be a positive or negative lookahead or lookbehind

1110

assertion. Consider this pattern, again containing non-significant

1111

white space, and with the two alternatives on the second line:

1112

1113

(?(?=[^a-z]*[a-z])

1114

\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )

1115

1116

The condition is a positive lookahead assertion that matches an

1117

optional sequence of non-letters followed by a letter. In other words,

1118

it tests for the presence of at least one letter in the subject. If a

1119

letter is found, the subject is matched against the first alternative;

1120

otherwise it is matched against the second. This pattern matches

1121

strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are

1122

letters and dd are digits.

COMMENTS

The sequence (?# marks the start of a comment which continues up to the

1128

next closing parenthesis. Nested parentheses are not permitted. The

1129

characters that make up a comment play no part in the pattern matching

1130

at all.

1131

1132

If the PCRE_EXTENDED option is set, an unescaped # character outside a

1133

character class introduces a comment that continues up to the next new-

1134

line character in the pattern.

RECURSIVE PATTERNS

Consider the problem of matching a string in parentheses, allowing for

1140

unlimited nested parentheses. Without the use of recursion, the best

1141

that can be done is to use a pattern that matches up to some fixed

1142

depth of nesting. It is not possible to handle an arbitrary nesting

1143

depth. Perl has provided an experimental facility that allows regular

1144

expressions to recurse (amongst other things). It does this by interpo-

1145

lating Perl code in the expression at run time, and the code can refer

1146

to the expression itself. A Perl pattern to solve the parentheses prob-

1147

lem can be created like this:

1148

1149

$re = qr{$ (?: (?>[^()]+) | (?p{$re}) )* $}x;

1150

1151

The (?p{...}) item interpolates Perl code at run time, and in this case

1152

refers recursively to the pattern in which it appears. Obviously, PCRE

1153

cannot support the interpolation of Perl code. Instead, it supports

1154

some special syntax for recursion of the entire pattern, and also for

1155

individual subpattern recursion.

1156

1157

The special item that consists of (? followed by a number greater than

1158

zero and a closing parenthesis is a recursive call of the subpattern of

1159

the given number, provided that it occurs inside that subpattern. (If

1160

not, it is a "subroutine" call, which is described in the next sec-

1161

tion.) The special item (?R) is a recursive call of the entire regular

1162

expression.

1163

1164

For example, this PCRE pattern solves the nested parentheses problem

1165

(assume the PCRE_EXTENDED option is set so that white space is

1166

ignored):

1167

1168

$ ( (?>[^()]+) | (?R) )* $

1169

1170

First it matches an opening parenthesis. Then it matches any number of

1171

substrings which can either be a sequence of non-parentheses, or a

1172

recursive match of the pattern itself (that is a correctly parenthe-

1173

sized substring). Finally there is a closing parenthesis.

1174

1175

If this were part of a larger pattern, you would not want to recurse

1176

the entire pattern, so instead you could use this:

1177

1178

( $ ( (?>[^()]+) | (?1) )* $ )

1179

1180

We have put the pattern into parentheses, and caused the recursion to

1181

refer to them instead of the whole pattern. In a larger pattern, keep-

1182

ing track of parenthesis numbers can be tricky. It may be more conve-

1183

nient to use named parentheses instead. For this, PCRE uses (?P>name),

1184

which is an extension to the Python syntax that PCRE uses for named

1185

parentheses (Perl does not provide named parentheses). We could rewrite

1186

the above example as follows:

1187

1188

(?P<pn> $ ( (?>[^()]+) | (?P>pn) )* $ )

1189

1190

This particular example pattern contains nested unlimited repeats, and

1191

so the use of atomic grouping for matching strings of non-parentheses

1192

is important when applying the pattern to strings that do not match.

1193

For example, when this pattern is applied to

1194

1195

(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()

1196

1197

it yields "no match" quickly. However, if atomic grouping is not used,

1198

the match runs for a very long time indeed because there are so many

1199

different ways the + and * repeats can carve up the subject, and all

1200

have to be tested before failure can be reported.

1201

1202

At the end of a match, the values set for any capturing subpatterns are

1203

those from the outermost level of the recursion at which the subpattern

1204

value is set. If you want to obtain intermediate values, a callout

1205

function can be used (see below and the pcrecallout documentation). If

1206

the pattern above is matched against

(ab(cd)ef)

the value for the capturing parentheses is "ef", which is the last

1211

value taken on at the top level. If additional parentheses are added,

1212

giving

1213

1214

$ ( ( (?>[^()]+) | (?R) )* ) $

^ ^

^ ^

the string they capture is "ab(cd)ef", the contents of the top level

1219

parentheses. If there are more than 15 capturing parentheses in a pat-

1220

tern, PCRE has to obtain extra memory to store data during a recursion,

1221

which it does by using pcre_malloc, freeing it via pcre_free after-

1222

wards. If no memory can be obtained, the match fails with the

1223

PCRE_ERROR_NOMEMORY error.

1224

1225

Do not confuse the (?R) item with the condition (R), which tests for

1226

recursion. Consider this pattern, which matches text in angle brack-

1227

ets, allowing for arbitrary nesting. Only digits are allowed in nested

1228

brackets (that is, when recursing), whereas any characters are permit-

1229

ted at the outer level.

1230

1231

< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >

1232

1233

In this pattern, (?(R) is the start of a conditional subpattern, with

1234

two different alternatives for the recursive and non-recursive cases.

1235

The (?R) item is the actual recursive call.

1236

1237

1238

SUBPATTERNS AS SUBROUTINES

1239

1240

If the syntax for a recursive subpattern reference (either by number or

1241

by name) is used outside the parentheses to which it refers, it oper-

1242

ates like a subroutine in a programming language. An earlier example

1243

pointed out that the pattern

1244

1245

(sens|respons)e and \1ibility

1246

1247

matches "sense and sensibility" and "response and responsibility", but

1248

not "sense and responsibility". If instead the pattern

1249

1250

(sens|respons)e and (?1)ibility

1251

1252

is used, it does match "sense and responsibility" as well as the other

1253

two strings. Such references must, however, follow the subpattern to

which they refer.

CALLOUTS

Perl has a feature whereby using the sequence (?{...}) causes arbitrary

1260

Perl code to be obeyed in the middle of matching a regular expression.

1261

This makes it possible, amongst other things, to extract different sub-

1262

strings that match the same pair of parentheses when there is a repeti-

1263

tion.

1264

1265

PCRE provides a similar feature, but of course it cannot obey arbitrary

1266

Perl code. The feature is called "callout". The caller of PCRE provides

1267

an external function by putting its entry point in the global variable

1268

pcre_callout. By default, this variable contains NULL, which disables

1269

all calling out.

1270

1271

Within a regular expression, (?C) indicates the points at which the

1272

external function is to be called. If you want to identify different

1273

callout points, you can put a number less than 256 after the letter C.

1274

The default value is zero. For example, this pattern has two callout

points:

(?C1)abc(?C2)def

During matching, when PCRE reaches a callout point (and pcre_callout is

1280

set), the external function is called. It is provided with the number

1281

of the callout, and, optionally, one item of data originally supplied

1282

by the caller of pcre_exec(). The callout function may cause matching

1283

to backtrack, or to fail altogether. A complete description of the

1284

interface to the callout function is given in the pcrecallout documen-

tation.

DIFFERENCES FROM PERL

1289

This section escribes the differences in the ways that PCRE and Perl

1290

handle regular expressions. The differences described here are with

1291

respect to Perl 5.8.

1292

1293

1. PCRE does not have full UTF-8 support. Details of what it does have

1294

are given in the section on UTF-8 support in the main pcre page.

1295

1296

2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl

1297

permits them, but they do not mean what you might think. For example,

1298

(?!a){3} does not assert that the next three characters are not "a". It

1299

just asserts that the next character is not "a" three times.

1300

1301

3. Capturing subpatterns that occur inside negative lookahead asser-

1302

tions are counted, but their entries in the offsets vector are never

1303

set. Perl sets its numerical variables from any such patterns that are

1304

matched before the assertion fails to match something (thereby succeed-

1305

ing), but only if the negative lookahead assertion contains just one

1306

branch.

1307

1308

4. Though binary zero characters are supported in the subject string,

1309

they are not allowed in a pattern string because it is passed as a nor-

1310

mal C string, terminated by zero. The escape sequence "\0" can be used

1311

in the pattern to represent a binary zero.

1312

1313

5. The following Perl escape sequences are not supported: \l, \u, \L,

1314

\U, \P, \p, \N, and \X. In fact these are implemented by Perl's general

1315

string-handling and are not part of its pattern matching engine. If any

1316

of these are encountered by PCRE, an error is generated.

1317

1318

6. PCRE does support the \Q...\E escape for quoting substrings. Charac-

1319

ters in between are treated as literals. This is slightly different

1320

from Perl in that $ and @ are also handled as literals inside the

1321

quotes. In Perl, they cause variable interpolation (but of course PCRE

1322

does not have variables). Note the following examples:

1323

1324

Pattern PCRE matches Perl matches

1325

1326

\Qabc$xyz\E abc$xyz abc followed by the

1327

contents of $xyz

1328

\Qabc\$xyz\E abc\$xyz abc\$xyz

1329

\Qabc\E\$\Qxyz\E abc$xyz abc$xyz

1330

1331

The \Q...\E sequence is recognized both inside and outside character

1332

classes.

1333

1334

7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})

1335

constructions. However, there is some experimental support for recur-

1336

sive patterns using the non-Perl items (?R), (?number) and (?P>name).

1337

Also, the PCRE "callout" feature allows an external function to be

1338

called during pattern matching.

1339

1340

8. There are some differences that are concerned with the settings of

1341

captured strings when part of a pattern is repeated. For example,

1342

matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2

1343

unset, but in PCRE it is set to "b".

1344

1345

9. PCRE provides some extensions to the Perl regular expression

1346

facilities:

1347

1348

(a) Although lookbehind assertions must match fixed length strings,

1349

each alternative branch of a lookbehind assertion can match a different

1350

length of string. Perl requires them all to have the same length.

1351

1352

(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $

1353

meta-character matches only at the very end of the string.

1354

1355

(c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-

1356

cial meaning is faulted.

1357

1358

(d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti-

1359

fiers is inverted, that is, by default they are not greedy, but if fol-

1360

lowed by a question mark they are.

1361

1362

(e) PCRE_ANCHORED can be used to force a pattern to be tried only at

1363

the first matching position in the subject string.

1364

1365

(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-

1366

TURE options for pcre_exec() have no Perl equivalents.

1367

1368

(g) The (?R), (?number), and (?P>name) constructs allows for recursive

1369

pattern matching (Perl can do this using the (?p{code}) construct,

1370

which PCRE cannot support.)

1371

1372

(h) PCRE supports named capturing substrings, using the Python syntax.

1373

1374

(i) PCRE supports the possessive quantifier "++" syntax, taken from

1375

Sun's Java package.

1376

1377

(j) The (R) condition, for testing recursion, is a PCRE extension.

1378

1379

(k) The callout facility is PCRE-specific.

NOTES

The \< and \> metacharacters from Henry Spencers package

1385

are not available in PCRE, but can be emulate with \b,

1386

as required, also in conjunction with \W or \w.

1387

1388

In LDMud, backtracks are limited by the EVAL_COST runtime

1389

limit, to avoid freezing the driver with a match

1390

like regexp(({"=XX==================="}), "X(.+)+X").

1391

1392

LDMud doesn't support PCRE callouts.

LIMITATIONS

There are some size limitations in PCRE but it is hoped that

1397

they will never in practice be relevant. The maximum length

1398

of a compiled pattern is 65539 (sic) bytes. All values in

1399

repeating quantifiers must be less than 65536. There max-

1400

imum number of capturing subpatterns is 65535. There is no

1401

limit to the number of non-capturing subpatterns, but the

1402

maximum depth of nesting of all kinds of parenthesized sub-

1403

pattern, including capturing subpatterns, assertions, and

1404

other types of subpattern, is 200.

1405

1406

The maximum length of a subject string is the largest posi-

1407

tive number that an integer variable can hold. However, PCRE

1408

uses recursion to handle subpatterns and indefinite repeti-

1409

tion. This means that the available stack space may limit

1410

the size of a subject string that can be processed by cer-

tain patterns.

AUTHOR

Philip Hazel <ph10@cam.ac.uk>

1416

University Computing Service,

1417

New Museums Site,

1418

Cambridge CB2 3QG, England.

1419

Phone: +44 1223 334714

1420

1421