cheat sheet

PCRE

PCRE2 syntax reference — character classes, quantifiers, anchors, groups, lookarounds, backreferences, flags, and advanced features used in grep, sed, nginx, PHP, and dozens of other tools.

PCRE — Perl Compatible Regular Expressions

What it is

PCRE (Perl Compatible Regular Expressions) is a regular expression library written by Philip Hazel that implements the syntax and semantics of Perl 5 regexes. It is the engine behind grep -P, ripgrep, nginx, Apache, PHP, Python's re module, and dozens of other tools — making it the most widely deployed regex dialect in server-side and CLI work. Where POSIX ERE (used by default grep, awk, sed) lacks features like lookaheads, named groups, and non-greedy quantifiers, PCRE provides them all.

PCRE1 vs PCRE2 — version status

The original PCRE library (PCRE1, the 8.xx series) reached end of life with 8.45 on 15-June-2021 and is no longer maintained. All current development happens in the PCRE2 10.xx series, which has a different API (pcre2_* symbols, distinct headers, 8/16/32-bit code-unit variants). When a tool says "PCRE" today it almost always means PCRE2 under the hood — grep -P, ripgrep --pcre2, pcregrep, PHP 7.3+, and recent nginx all link against PCRE2.

LibraryLatest versionDateStatus
PCRE1 (8.xx)8.452021-06-15End of life — bug-fix mode only, no new features
PCRE2 (10.xx)10.472025-10-21Active — semi-annual releases

If you maintain code that links the old libpcre, plan a port to libpcre2 (pcre2-config --cflags, swap PCRE_* for PCRE2_* constants, replace pcre_compile/pcre_exec with pcre2_compile_8/pcre2_match_8, and pass an explicit code-unit width). Distributions are gradually dropping PCRE1 from default installs.

Recent PCRE2 features (10.43 → 10.47)

PCRE2's release cadence is roughly semi-annual. Highlights from the last four releases that have user-visible regex syntax or matcher behavior:

VersionDateHeadline additions
10.432024-02-16New caseless ASCII/non-ASCII restriction flags; pcre2_get_match_data_heapframes_size(); JIT drops ARMv5
10.442024-06-07pcre2_set_max_pattern_compiled_length(); max group name length raised from 32 to 128
10.452025-02-05(*scan_substring:...) assertion; UTS#18 character-class operators (&&, --, ~~); Perl-style (?[...]); new AArch64 SIMD JIT codegen; pcre2_set_optimize()
10.472025-10-21Pattern recursion with group returns; new pcre2_next_match() API for iterating matches

Scan substring (*scan_substring:(N)...) — PCRE2 10.45+

A scan-substring assertion re-runs a sub-pattern against the text already captured by group N, without re-consuming input. Use it to enforce a secondary constraint on a previously matched span — for example, "match a word, then assert the captured word contains rh somewhere after its first character".

regex
\b(\w++)(*scan_substring:(1).+rh)

UTS#18 character-class operators — PCRE2 10.45+ (with PCRE2_ALT_EXTENDED_CLASS)

When the host enables PCRE2_ALT_EXTENDED_CLASS, character classes accept three new infix operators: && (intersection), -- (subtraction), and ~~ (symmetric difference). This brings PCRE2 in line with Unicode TR#18 and avoids the older (?=[...])[^...] workarounds.

regex
[\p{L}--\p{Lu}]         # any letter except uppercase
[\d&&[2-7]]             # digits intersected with 2..7
[a-z~~aeiou]            # consonants (symmetric difference)

Perl-style extended class (?[...]) — PCRE2 10.45+

(?[...]) is an alternative class-algebra syntax modeled on Perl's experimental extended classes. It uses + for union and & for intersection, and tolerates internal whitespace and comments for readability.

regex
(?[ \p{Greek} & \p{L} ])      # Greek letters only
(?[ [a-z] + [0-9] - [aeiou] ]) # lowercase + digits, minus vowels

Larger group names — PCRE2 10.44+

The maximum name length for (?<name>...), (?P<name>...), and \k<name> is now 128 characters (was 32). Useful for code-generated patterns where the names encode pipeline context.

Pattern-compilation size limit — PCRE2 10.44+

Hosts can call pcre2_set_max_pattern_compiled_length() to cap how large a compiled pattern is allowed to become, hardening against pathological user-supplied regexes that compile into very large state machines.

Quick tool reference

ToolPCRE engineFlag/syntax
grepyesgrep -P 'pattern'
ripgrepyes (Rust regex, PCRE2 with --pcre2)rg -P 'pattern'
sedno (uses POSIX ERE with -E)workaround: perl -pe
perlnativeperl -ne '/pattern/ && print'
pcregrepyespcregrep 'pattern' file
PHPyespreg_match('/pattern/', $str)
Python renear-PCREre.search(r'pattern', s)
nginxyes~ / ~* directives

Character classes

A character class matches exactly one character from a defined set. PCRE supports literal ranges [a-z], negation [^...], POSIX named classes [[:alpha:]], shorthand escapes \d/\w/\s, and Unicode property classes \p{L} — richer than anything POSIX BRE or ERE offers.

Literal and dot

The dot . is PCRE's wildcard for any single character except newline; enable the /s (dotall) flag to make it match newlines too. Escape any metacharacter with \ to match it literally.

sql
a         literal 'a'
.         any character except newline (unless /s flag)
\.        literal dot (escaped)

Built-in shorthand classes

Shorthand escapes are compact alternatives to bracket expressions — \d equals [0-9], \w equals [a-zA-Z0-9_], and \s covers common whitespace. Their uppercase inverses (\D, \W, \S) match everything the lowercase version does not.

scss
\d        digit [0-9]
\D        non-digit
\w        word character [a-zA-Z0-9_]
\W        non-word character
\s        whitespace [ \t\r\n\f\v]
\S        non-whitespace
\h        horizontal whitespace [ \t]
\H        non-horizontal whitespace
\v        vertical whitespace [\n\v\f\r\x85\x{2028}\x{2029}]
\V        non-vertical whitespace
\N        any character except newline (regardless of /s)

Unicode properties (PCRE2 / \p{})

\p{PROPERTY} matches any Unicode code point with the named property; \P{PROPERTY} is its negation. Requires PCRE2 (not plain PCRE) and is essential for internationalized text where \w would miss non-ASCII word characters.

css
\p{L}     any Unicode letter
\p{Lu}    uppercase letter
\p{Ll}    lowercase letter
\p{Nd}    decimal digit
\p{Z}     separator
\p{P}     punctuation
\P{L}     NOT a letter (negation)
\X        Unicode extended grapheme cluster

Character class syntax

lua
[abc]     a, b, or c
[^abc]    anything except a, b, c
[a-z]     lowercase a through z
[A-Z0-9]  uppercase or digit
[[:alpha:]]  POSIX alpha class
[[:digit:]]  POSIX digit class
[[:alnum:]]  POSIX alphanumeric
[[:space:]]  POSIX whitespace
[[:upper:]]  uppercase
[[:lower:]]  lowercase
[[:punct:]]  punctuation

Nesting classes (PCRE2 only — [...] inside [...]):

lua
[[a-z]&&[^aeiou]]   lowercase consonants

Anchors and boundaries

Anchors assert a position in the string without consuming any characters. ^ and $ match line boundaries in multiline mode; \A and \z always match the absolute string edges regardless of flags. \b matches the zero-width boundary between a word character and a non-word character.

sql
^         start of string (or line with /m)
$         end of string (or line with /m)
\A        absolute start of string (ignores /m)
\Z        end of string before optional final newline
\z        absolute end of string
\b        word boundary (between \w and \W)
\B        non-word boundary
\G        position where last match left off

Quantifiers

Quantifiers specify how many times the preceding element must match. PCRE supports three modes: greedy (default, match as much as possible), lazy (add ?, match as little as possible), and possessive (add +, match greedily but give nothing back on backtrack failure — faster but can change results).

Greedy (default — consume as much as possible)

Greedy quantifiers expand as far as possible, then backtrack if the overall match requires it. This is the default and works well in most cases; switch to lazy when the greedy match overshoots the intended boundary.

python
*         0 or more
+         1 or more
?         0 or 1
{n}       exactly n times
{n,}      n or more times
{n,m}     between n and m times

Lazy / non-greedy (add ? — consume as little as possible)

Lazy quantifiers match the minimum needed, then expand only if the rest of the pattern fails. Use them when you need to match up to the first occurrence of a delimiter rather than the last — for example <.*?> to extract individual HTML tags.

swift
*?        0 or more, lazy
+?        1 or more, lazy
??        0 or 1, lazy
{n,m}?    between n and m, lazy

Possessive (add + — no backtracking)

Possessive quantifiers are like greedy ones but never give characters back during backtracking. They can make matching faster by preventing catastrophic backtracking, but will cause the match to fail if the possessive part consumed characters that the rest of the pattern needs.

python
*+        0 or more, possessive
++        1 or more, possessive
?+        0 or 1, possessive
{n,m}+    between n and m, possessive

Greedy vs lazy example:

php-template
Input:   <b>bold</b> and <i>italic</i>
Greedy:  <.*>   matches  <b>bold</b> and <i>italic</i>  (entire thing)
Lazy:    <.*?>  matches  <b>                             (stops at first >)

Groups

Groups cluster part of a pattern and control alternation scope. PCRE offers four group types: capturing (numbered), non-capturing, named, and atomic — each with different trade-offs between feature access and performance.

Capturing group

A capturing group records the text it matches and makes it available as a back-reference (\1, \2, …) or in substitution strings. Each opening ( increments the group counter left to right.

rust
(abc)         captures 'abc'; back-reference: \1, \2, ...

Non-capturing group

Use (?:...) when you need to group for alternation or quantifier scope but do not need to refer back to the matched text. It is slightly faster than a capturing group and keeps the group-number sequence clean.

sql
(?:abc)       groups without capturing

Named capturing group

Named groups attach a label to the captured text, making patterns more readable and substitutions more maintainable than numbered back-references. Both the Perl-style (?P<name>...) and PCRE2-style (?<name>...) syntaxes are widely supported.

scss
(?P<name>abc)    named capture (Perl style)
(?<name>abc)     named capture (PCRE2 style)
\k<name>         back-reference to named group

Atomic group (no backtracking into the group)

An atomic group ((?>...)) matches like a normal group but, once exited, refuses to give up any of its matched characters during backtracking. Functionally equivalent to a possessive quantifier applied to a sub-expression; useful for eliminating catastrophic backtracking in complex patterns.

php
(?>abc)

Branch reset group (PCRE2)

scss
(?|(\d+)|(\w+))   both alternatives share capture group 1

Backreferences

A back-reference reuses the exact text captured by an earlier group, not just the pattern. This distinguishes them from repeating the sub-pattern: \1 requires the same characters, so they are the primary way to detect repeated or balanced tokens.

vbnet
\1        back-reference to capture group 1
\2        back-reference to capture group 2
\k<name>  back-reference to named group

Match a repeated word:

css
\b(\w+)\s+\1\b

Example:

bash
echo "the the problem" | grep -P '\b(\w+)\s+\1\b'

Output:

text
the the problem

Lookaheads and lookbehinds

Lookarounds assert context without consuming characters — the match position does not advance.

Positive lookahead (?=...)

Match "foo" only when followed by "bar":

scss
foo(?=bar)
bash
echo "foobar foobaz" | grep -oP 'foo(?=bar)'

Output:

text
foo

Negative lookahead (?!...)

Match "foo" NOT followed by "bar":

scss
foo(?!bar)

Positive lookbehind (?<=...)

Match "bar" only when preceded by "foo":

ruby
(?<=foo)bar
bash
echo "foobar" | grep -oP '(?<=foo)bar'

Output:

text
bar

Negative lookbehind (?<!...)

Match "bar" NOT preceded by "foo":

php-template
(?<!foo)bar

Variable-length lookbehind (PCRE2 only)

kotlin
(?<=foo|foobar)baz      # alternation of different lengths — PCRE2 supports this

Flags / modifiers

Flags can be set inline with (?flags) or in a group (?flags:pattern) rather than requiring the tool to support a command-line flag.

FlagInlineMeaning
i(?i)Case-insensitive
m(?m)Multiline — ^/$ match line start/end
s(?s)Dotall — . matches newline
x(?x)Extended — whitespace and #comments ignored
u(?u)Unicode strings (PCRE2)
gn/aGlobal (tool-level flag, not PCRE)
xx(?xx)Extra-extended — spaces in character classes also ignored

Inline flag examples:

sql
(?i)hello         case-insensitive match for "hello"
(?im)^start       multiline + case-insensitive
(?s).+            dot matches newline
(?x)              free-spacing mode; # comments allowed
  \d{4}           # year
  [-/]            # separator
  \d{2}           # month

Alternation

| separates alternatives that are tried left to right; the first branch that allows the overall match to succeed wins, regardless of length. Wrap alternatives in a group to limit their scope — without a group, | spans the entire surrounding expression.

bash
cat|dog           matches "cat" or "dog"
(cat|dog)s?       matches "cat", "cats", "dog", "dogs"
(?:yes|no|maybe)  non-capturing alternation

Alternation is left-to-right: the first matching branch wins (no longest-match unlike some engines).

Escape sequences

Escape sequences represent characters that cannot be typed directly in a regex, or that would otherwise be interpreted as metacharacters. \Q...\E is particularly useful for quoting a block of user-supplied text verbatim inside a larger pattern.

css
\t     tab
\n     newline
\r     carriage return
\f     form feed
\a     bell
\e     escape (0x1B)
\0     null
\xHH   hex character (e.g. \x41 = A)
\x{HHHH}  Unicode code point (e.g. \x{1F600})
\cX    control character (e.g. \cM = CR)
\Q...\E  quote literal string (disable metacharacters inside)

Subroutines and recursive patterns

PCRE lets a group call itself or another group as a subroutine, enabling patterns that match recursive structures like nested parentheses or balanced tags — impossible with standard non-recursive regex.

Call by group number:

sql
(?1)    recurse into capture group 1
(?R)    recurse into entire pattern

Call by group name:

csharp
(?&name)    recurse into named group

Match arbitrarily nested parentheses:

regex
\((?:[^()]*|(?R))*\)
bash
echo "(a (b (c) d) e)" | grep -oP '\((?:[^()]*|(?R))*\)'

Output:

text
(a (b (c) d) e)

Conditionals

A conditional (?(condition)yes|no) selects between two sub-patterns based on whether a capture group has matched. This allows a single pattern to handle input that can appear in two forms, such as quoted or unquoted values.

Test whether group N matched, then choose pattern:

sql
(?(N)yes|no)           if group N captured, use 'yes' pattern, else 'no'
(?(name)yes|no)        same for named group
(?(DEFINE)...)         define subroutines only — never matches directly

Conditional example — match "..." or '...' with the correct closing quote:

regex
(['"])(.*?)\1

DEFINE block (subroutine library)

A (?(DEFINE)...) block declares named subroutines without participating in the match itself, acting as a library of reusable sub-patterns. This keeps complex patterns readable by naming components and calling them with (?&name).

Group all named subroutines in a (?(DEFINE)...) block that never participates in the match:

regex
(?(DEFINE)
  (?<digits>\d+)
  (?<word>[A-Za-z]+)
)
\b(?&digits)-(?&word)\b

Verbs (PCRE2 control verbs)

Control verbs are special tokens embedded in a pattern that direct the PCRE2 engine's backtracking behavior. They are an advanced optimization and control mechanism — most patterns never need them, but they are essential for preventing catastrophic backtracking or building pattern-based tokenizers.

java
(*FAIL)   force failure at this point (synonym: (*F))
(*ACCEPT) force success, end matching
(*SKIP)   skip to this position on backtrack
(*PRUNE)  prune the backtrack tree
(*COMMIT) prevent any backtracking past this point
(*THEN)   try next alternative in enclosing group
(*UTF)    enable UTF mode
(*UCP)    use Unicode properties for \w, \d etc.
(*NOTEMPTY)  fail if match is empty

Practical patterns

Email (simplified)

regex
[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}

URL

regex
https?://[^\s/$.?#].[^\s]*

IPv4 address

regex
(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)

IPv6 address (abbreviated)

regex
(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}

ISO date (YYYY-MM-DD)

regex
\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])

UUID v4

regex
[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}

Semantic version

regex
v?(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)
(?:-(?P<pre>[0-9A-Za-z\-]+(?:\.[0-9A-Za-z\-]+)*))?
(?:\+(?P<build>[0-9A-Za-z\-]+(?:\.[0-9A-Za-z\-]+)*))?

HTML tag (capture tag name and attributes)

regex
<(?P<tag>[a-zA-Z][a-zA-Z0-9]*)(?P<attrs>[^>]*)>

Extract key=value pairs

regex
(?P<key>\w+)=(?P<value>"[^"]*"|\S+)

Multiline block between delimiters (dotall)

bash
grep -Pzo '(?s)BEGIN.*?END' file.txt

Output: (none — exits 0 on success)

Using PCRE with common tools

grep -P

bash
# Extract all email addresses from a file
grep -oP '[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}' file.txt

# Match lines containing a repeated word
grep -P '\b(\w+)\s+\1\b' file.txt

# Named capture group (prints whole match — groups not available in grep output)
grep -oP '(?<=version: )\d+\.\d+\.\d+' package.json

Output: (none — exits 0 on success)

ripgrep --pcre2 / -P

bash
# Lookbehind (requires --pcre2)
rg --pcre2 '(?<=foo)\d+'

# Named group (rg supports group output with --replace)
rg -P '(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})' --replace '$year/$month/$day' dates.txt

Output: (none — exits 0 on success)

perl one-liners

bash
# In-place substitution with lookahead (sed cannot do lookaheads)
perl -i -pe 's/foo(?=bar)/FOO/g' file.txt

# Multi-line match across lines
perl -0777 -pe 's/BEGIN.*?END/REPLACED/gs' file.txt

# Print lines matching named group
perl -ne 'if (/(?P<ip>\d+\.\d+\.\d+\.\d+)/) { print "$+{ip}\n" }' access.log

Output: (none — exits 0 on success)

pcregrep

bash
# Multi-line match (like grep -P but with -M for multiline)
pcregrep -M 'start.*?\nend' file.txt

# Print only the matching portion
pcregrep -o '\d+\.\d+\.\d+\.\d+' access.log

Output: (none — exits 0 on success)

PCRE vs POSIX BRE/ERE comparison

FeaturePOSIX BRE (grep)POSIX ERE (grep -E)PCRE (grep -P)
Non-greedy +?nonoyes
Lookahead (?=...)nonoyes
Lookbehind (?<=...)nonoyes
Named groupsnonoyes
\d \w \snonoyes
Back-references\1\1\1 and \k<name>
Alternation|||
Recursive patternsnonoyes
Unicode \p{...}nonoyes (PCRE2)

Sources