cheat sheet
PCRE
PCRE2 syntax reference — character classes, quantifiers, anchors, groups, lookarounds, backreferences, flags, and advanced features used in grep, sed, nginx, PHP, and dozens of other tools.
PCRE — Perl Compatible Regular Expressions
What it is
PCRE (Perl Compatible Regular Expressions) is a regular expression library written by Philip Hazel that implements the syntax and semantics of Perl 5 regexes. It is the engine behind grep -P, ripgrep, nginx, Apache, PHP, Python's re module, and dozens of other tools — making it the most widely deployed regex dialect in server-side and CLI work. Where POSIX ERE (used by default grep, awk, sed) lacks features like lookaheads, named groups, and non-greedy quantifiers, PCRE provides them all.
PCRE1 vs PCRE2 — version status
The original PCRE library (PCRE1, the 8.xx series) reached end of life with 8.45 on 15-June-2021 and is no longer maintained. All current development happens in the PCRE2 10.xx series, which has a different API (pcre2_* symbols, distinct headers, 8/16/32-bit code-unit variants). When a tool says "PCRE" today it almost always means PCRE2 under the hood — grep -P, ripgrep --pcre2, pcregrep, PHP 7.3+, and recent nginx all link against PCRE2.
| Library | Latest version | Date | Status |
|---|---|---|---|
| PCRE1 (8.xx) | 8.45 | 2021-06-15 | End of life — bug-fix mode only, no new features |
| PCRE2 (10.xx) | 10.47 | 2025-10-21 | Active — semi-annual releases |
If you maintain code that links the old libpcre, plan a port to libpcre2 (pcre2-config --cflags, swap PCRE_* for PCRE2_* constants, replace pcre_compile/pcre_exec with pcre2_compile_8/pcre2_match_8, and pass an explicit code-unit width). Distributions are gradually dropping PCRE1 from default installs.
Recent PCRE2 features (10.43 → 10.47)
PCRE2's release cadence is roughly semi-annual. Highlights from the last four releases that have user-visible regex syntax or matcher behavior:
| Version | Date | Headline additions |
|---|---|---|
| 10.43 | 2024-02-16 | New caseless ASCII/non-ASCII restriction flags; pcre2_get_match_data_heapframes_size(); JIT drops ARMv5 |
| 10.44 | 2024-06-07 | pcre2_set_max_pattern_compiled_length(); max group name length raised from 32 to 128 |
| 10.45 | 2025-02-05 | (*scan_substring:...) assertion; UTS#18 character-class operators (&&, --, ~~); Perl-style (?[...]); new AArch64 SIMD JIT codegen; pcre2_set_optimize() |
| 10.47 | 2025-10-21 | Pattern recursion with group returns; new pcre2_next_match() API for iterating matches |
Scan substring (*scan_substring:(N)...) — PCRE2 10.45+
A scan-substring assertion re-runs a sub-pattern against the text already captured by group N, without re-consuming input. Use it to enforce a secondary constraint on a previously matched span — for example, "match a word, then assert the captured word contains rh somewhere after its first character".
\b(\w++)(*scan_substring:(1).+rh)
UTS#18 character-class operators — PCRE2 10.45+ (with PCRE2_ALT_EXTENDED_CLASS)
When the host enables PCRE2_ALT_EXTENDED_CLASS, character classes accept three new infix operators: && (intersection), -- (subtraction), and ~~ (symmetric difference). This brings PCRE2 in line with Unicode TR#18 and avoids the older (?=[...])[^...] workarounds.
[\p{L}--\p{Lu}] # any letter except uppercase
[\d&&[2-7]] # digits intersected with 2..7
[a-z~~aeiou] # consonants (symmetric difference)
Perl-style extended class (?[...]) — PCRE2 10.45+
(?[...]) is an alternative class-algebra syntax modeled on Perl's experimental extended classes. It uses + for union and & for intersection, and tolerates internal whitespace and comments for readability.
(?[ \p{Greek} & \p{L} ]) # Greek letters only
(?[ [a-z] + [0-9] - [aeiou] ]) # lowercase + digits, minus vowels
Larger group names — PCRE2 10.44+
The maximum name length for (?<name>...), (?P<name>...), and \k<name> is now 128 characters (was 32). Useful for code-generated patterns where the names encode pipeline context.
Pattern-compilation size limit — PCRE2 10.44+
Hosts can call pcre2_set_max_pattern_compiled_length() to cap how large a compiled pattern is allowed to become, hardening against pathological user-supplied regexes that compile into very large state machines.
Quick tool reference
| Tool | PCRE engine | Flag/syntax |
|---|---|---|
grep | yes | grep -P 'pattern' |
ripgrep | yes (Rust regex, PCRE2 with --pcre2) | rg -P 'pattern' |
sed | no (uses POSIX ERE with -E) | workaround: perl -pe |
perl | native | perl -ne '/pattern/ && print' |
pcregrep | yes | pcregrep 'pattern' file |
| PHP | yes | preg_match('/pattern/', $str) |
Python re | near-PCRE | re.search(r'pattern', s) |
| nginx | yes | ~ / ~* directives |
Character classes
A character class matches exactly one character from a defined set. PCRE supports literal ranges [a-z], negation [^...], POSIX named classes [[:alpha:]], shorthand escapes \d/\w/\s, and Unicode property classes \p{L} — richer than anything POSIX BRE or ERE offers.
Literal and dot
The dot . is PCRE's wildcard for any single character except newline; enable the /s (dotall) flag to make it match newlines too. Escape any metacharacter with \ to match it literally.
a literal 'a'
. any character except newline (unless /s flag)
\. literal dot (escaped)
Built-in shorthand classes
Shorthand escapes are compact alternatives to bracket expressions — \d equals [0-9], \w equals [a-zA-Z0-9_], and \s covers common whitespace. Their uppercase inverses (\D, \W, \S) match everything the lowercase version does not.
\d digit [0-9]
\D non-digit
\w word character [a-zA-Z0-9_]
\W non-word character
\s whitespace [ \t\r\n\f\v]
\S non-whitespace
\h horizontal whitespace [ \t]
\H non-horizontal whitespace
\v vertical whitespace [\n\v\f\r\x85\x{2028}\x{2029}]
\V non-vertical whitespace
\N any character except newline (regardless of /s)
Unicode properties (PCRE2 / \p{})
\p{PROPERTY} matches any Unicode code point with the named property; \P{PROPERTY} is its negation. Requires PCRE2 (not plain PCRE) and is essential for internationalized text where \w would miss non-ASCII word characters.
\p{L} any Unicode letter
\p{Lu} uppercase letter
\p{Ll} lowercase letter
\p{Nd} decimal digit
\p{Z} separator
\p{P} punctuation
\P{L} NOT a letter (negation)
\X Unicode extended grapheme cluster
Character class syntax
[abc] a, b, or c
[^abc] anything except a, b, c
[a-z] lowercase a through z
[A-Z0-9] uppercase or digit
[[:alpha:]] POSIX alpha class
[[:digit:]] POSIX digit class
[[:alnum:]] POSIX alphanumeric
[[:space:]] POSIX whitespace
[[:upper:]] uppercase
[[:lower:]] lowercase
[[:punct:]] punctuation
Nesting classes (PCRE2 only — [...] inside [...]):
[[a-z]&&[^aeiou]] lowercase consonants
Anchors and boundaries
Anchors assert a position in the string without consuming any characters. ^ and $ match line boundaries in multiline mode; \A and \z always match the absolute string edges regardless of flags. \b matches the zero-width boundary between a word character and a non-word character.
^ start of string (or line with /m)
$ end of string (or line with /m)
\A absolute start of string (ignores /m)
\Z end of string before optional final newline
\z absolute end of string
\b word boundary (between \w and \W)
\B non-word boundary
\G position where last match left off
Quantifiers
Quantifiers specify how many times the preceding element must match. PCRE supports three modes: greedy (default, match as much as possible), lazy (add ?, match as little as possible), and possessive (add +, match greedily but give nothing back on backtrack failure — faster but can change results).
Greedy (default — consume as much as possible)
Greedy quantifiers expand as far as possible, then backtrack if the overall match requires it. This is the default and works well in most cases; switch to lazy when the greedy match overshoots the intended boundary.
* 0 or more
+ 1 or more
? 0 or 1
{n} exactly n times
{n,} n or more times
{n,m} between n and m times
Lazy / non-greedy (add ? — consume as little as possible)
Lazy quantifiers match the minimum needed, then expand only if the rest of the pattern fails. Use them when you need to match up to the first occurrence of a delimiter rather than the last — for example <.*?> to extract individual HTML tags.
*? 0 or more, lazy
+? 1 or more, lazy
?? 0 or 1, lazy
{n,m}? between n and m, lazy
Possessive (add + — no backtracking)
Possessive quantifiers are like greedy ones but never give characters back during backtracking. They can make matching faster by preventing catastrophic backtracking, but will cause the match to fail if the possessive part consumed characters that the rest of the pattern needs.
*+ 0 or more, possessive
++ 1 or more, possessive
?+ 0 or 1, possessive
{n,m}+ between n and m, possessive
Greedy vs lazy example:
Input: <b>bold</b> and <i>italic</i>
Greedy: <.*> matches <b>bold</b> and <i>italic</i> (entire thing)
Lazy: <.*?> matches <b> (stops at first >)
Groups
Groups cluster part of a pattern and control alternation scope. PCRE offers four group types: capturing (numbered), non-capturing, named, and atomic — each with different trade-offs between feature access and performance.
Capturing group
A capturing group records the text it matches and makes it available as a back-reference (\1, \2, …) or in substitution strings. Each opening ( increments the group counter left to right.
(abc) captures 'abc'; back-reference: \1, \2, ...
Non-capturing group
Use (?:...) when you need to group for alternation or quantifier scope but do not need to refer back to the matched text. It is slightly faster than a capturing group and keeps the group-number sequence clean.
(?:abc) groups without capturing
Named capturing group
Named groups attach a label to the captured text, making patterns more readable and substitutions more maintainable than numbered back-references. Both the Perl-style (?P<name>...) and PCRE2-style (?<name>...) syntaxes are widely supported.
(?P<name>abc) named capture (Perl style)
(?<name>abc) named capture (PCRE2 style)
\k<name> back-reference to named group
Atomic group (no backtracking into the group)
An atomic group ((?>...)) matches like a normal group but, once exited, refuses to give up any of its matched characters during backtracking. Functionally equivalent to a possessive quantifier applied to a sub-expression; useful for eliminating catastrophic backtracking in complex patterns.
(?>abc)
Branch reset group (PCRE2)
(?|(\d+)|(\w+)) both alternatives share capture group 1
Backreferences
A back-reference reuses the exact text captured by an earlier group, not just the pattern. This distinguishes them from repeating the sub-pattern: \1 requires the same characters, so they are the primary way to detect repeated or balanced tokens.
\1 back-reference to capture group 1
\2 back-reference to capture group 2
\k<name> back-reference to named group
Match a repeated word:
\b(\w+)\s+\1\b
Example:
echo "the the problem" | grep -P '\b(\w+)\s+\1\b'
Output:
the the problem
Lookaheads and lookbehinds
Lookarounds assert context without consuming characters — the match position does not advance.
Positive lookahead (?=...)
Match "foo" only when followed by "bar":
foo(?=bar)
echo "foobar foobaz" | grep -oP 'foo(?=bar)'
Output:
foo
Negative lookahead (?!...)
Match "foo" NOT followed by "bar":
foo(?!bar)
Positive lookbehind (?<=...)
Match "bar" only when preceded by "foo":
(?<=foo)bar
echo "foobar" | grep -oP '(?<=foo)bar'
Output:
bar
Negative lookbehind (?<!...)
Match "bar" NOT preceded by "foo":
(?<!foo)bar
Variable-length lookbehind (PCRE2 only)
(?<=foo|foobar)baz # alternation of different lengths — PCRE2 supports this
Flags / modifiers
Flags can be set inline with (?flags) or in a group (?flags:pattern) rather than requiring the tool to support a command-line flag.
| Flag | Inline | Meaning |
|---|---|---|
i | (?i) | Case-insensitive |
m | (?m) | Multiline — ^/$ match line start/end |
s | (?s) | Dotall — . matches newline |
x | (?x) | Extended — whitespace and #comments ignored |
u | (?u) | Unicode strings (PCRE2) |
g | n/a | Global (tool-level flag, not PCRE) |
xx | (?xx) | Extra-extended — spaces in character classes also ignored |
Inline flag examples:
(?i)hello case-insensitive match for "hello"
(?im)^start multiline + case-insensitive
(?s).+ dot matches newline
(?x) free-spacing mode; # comments allowed
\d{4} # year
[-/] # separator
\d{2} # month
Alternation
| separates alternatives that are tried left to right; the first branch that allows the overall match to succeed wins, regardless of length. Wrap alternatives in a group to limit their scope — without a group, | spans the entire surrounding expression.
cat|dog matches "cat" or "dog"
(cat|dog)s? matches "cat", "cats", "dog", "dogs"
(?:yes|no|maybe) non-capturing alternation
Alternation is left-to-right: the first matching branch wins (no longest-match unlike some engines).
Escape sequences
Escape sequences represent characters that cannot be typed directly in a regex, or that would otherwise be interpreted as metacharacters. \Q...\E is particularly useful for quoting a block of user-supplied text verbatim inside a larger pattern.
\t tab
\n newline
\r carriage return
\f form feed
\a bell
\e escape (0x1B)
\0 null
\xHH hex character (e.g. \x41 = A)
\x{HHHH} Unicode code point (e.g. \x{1F600})
\cX control character (e.g. \cM = CR)
\Q...\E quote literal string (disable metacharacters inside)
Subroutines and recursive patterns
PCRE lets a group call itself or another group as a subroutine, enabling patterns that match recursive structures like nested parentheses or balanced tags — impossible with standard non-recursive regex.
Call by group number:
(?1) recurse into capture group 1
(?R) recurse into entire pattern
Call by group name:
(?&name) recurse into named group
Match arbitrarily nested parentheses:
\((?:[^()]*|(?R))*\)
echo "(a (b (c) d) e)" | grep -oP '\((?:[^()]*|(?R))*\)'
Output:
(a (b (c) d) e)
Conditionals
A conditional (?(condition)yes|no) selects between two sub-patterns based on whether a capture group has matched. This allows a single pattern to handle input that can appear in two forms, such as quoted or unquoted values.
Test whether group N matched, then choose pattern:
(?(N)yes|no) if group N captured, use 'yes' pattern, else 'no'
(?(name)yes|no) same for named group
(?(DEFINE)...) define subroutines only — never matches directly
Conditional example — match "..." or '...' with the correct closing quote:
(['"])(.*?)\1
DEFINE block (subroutine library)
A (?(DEFINE)...) block declares named subroutines without participating in the match itself, acting as a library of reusable sub-patterns. This keeps complex patterns readable by naming components and calling them with (?&name).
Group all named subroutines in a (?(DEFINE)...) block that never participates in the match:
(?(DEFINE)
(?<digits>\d+)
(?<word>[A-Za-z]+)
)
\b(?&digits)-(?&word)\b
Verbs (PCRE2 control verbs)
Control verbs are special tokens embedded in a pattern that direct the PCRE2 engine's backtracking behavior. They are an advanced optimization and control mechanism — most patterns never need them, but they are essential for preventing catastrophic backtracking or building pattern-based tokenizers.
(*FAIL) force failure at this point (synonym: (*F))
(*ACCEPT) force success, end matching
(*SKIP) skip to this position on backtrack
(*PRUNE) prune the backtrack tree
(*COMMIT) prevent any backtracking past this point
(*THEN) try next alternative in enclosing group
(*UTF) enable UTF mode
(*UCP) use Unicode properties for \w, \d etc.
(*NOTEMPTY) fail if match is empty
Practical patterns
Email (simplified)
[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}
URL
https?://[^\s/$.?#].[^\s]*
IPv4 address
(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)
IPv6 address (abbreviated)
(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}
ISO date (YYYY-MM-DD)
\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])
UUID v4
[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}
Semantic version
v?(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)
(?:-(?P<pre>[0-9A-Za-z\-]+(?:\.[0-9A-Za-z\-]+)*))?
(?:\+(?P<build>[0-9A-Za-z\-]+(?:\.[0-9A-Za-z\-]+)*))?
HTML tag (capture tag name and attributes)
<(?P<tag>[a-zA-Z][a-zA-Z0-9]*)(?P<attrs>[^>]*)>
Extract key=value pairs
(?P<key>\w+)=(?P<value>"[^"]*"|\S+)
Multiline block between delimiters (dotall)
grep -Pzo '(?s)BEGIN.*?END' file.txt
Output: (none — exits 0 on success)
Using PCRE with common tools
grep -P
# Extract all email addresses from a file
grep -oP '[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}' file.txt
# Match lines containing a repeated word
grep -P '\b(\w+)\s+\1\b' file.txt
# Named capture group (prints whole match — groups not available in grep output)
grep -oP '(?<=version: )\d+\.\d+\.\d+' package.json
Output: (none — exits 0 on success)
ripgrep --pcre2 / -P
# Lookbehind (requires --pcre2)
rg --pcre2 '(?<=foo)\d+'
# Named group (rg supports group output with --replace)
rg -P '(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})' --replace '$year/$month/$day' dates.txt
Output: (none — exits 0 on success)
perl one-liners
# In-place substitution with lookahead (sed cannot do lookaheads)
perl -i -pe 's/foo(?=bar)/FOO/g' file.txt
# Multi-line match across lines
perl -0777 -pe 's/BEGIN.*?END/REPLACED/gs' file.txt
# Print lines matching named group
perl -ne 'if (/(?P<ip>\d+\.\d+\.\d+\.\d+)/) { print "$+{ip}\n" }' access.log
Output: (none — exits 0 on success)
pcregrep
# Multi-line match (like grep -P but with -M for multiline)
pcregrep -M 'start.*?\nend' file.txt
# Print only the matching portion
pcregrep -o '\d+\.\d+\.\d+\.\d+' access.log
Output: (none — exits 0 on success)
PCRE vs POSIX BRE/ERE comparison
| Feature | POSIX BRE (grep) | POSIX ERE (grep -E) | PCRE (grep -P) |
|---|---|---|---|
Non-greedy +? | no | no | yes |
Lookahead (?=...) | no | no | yes |
Lookbehind (?<=...) | no | no | yes |
| Named groups | no | no | yes |
\d \w \s | no | no | yes |
| Back-references | \1 | \1 | \1 and \k<name> |
| Alternation | | | | | | |
| Recursive patterns | no | no | yes |
Unicode \p{...} | no | no | yes (PCRE2) |
Sources
- PCRE2 project — news.txt (release history)
- PCRE2 10.45 NEWS — scan substring, UTS#18 classes, AArch64 SIMD JIT (2025-02-05)
- PCRE2 10.44 NEWS — max compiled length, 128-char group names (2024-06-07)
- PCRE2 10.43 release — caseless ASCII flags, heapframes API (2024-02-16)
- PCRE2 ChangeLog (main)
- PCRE1 8.45 — end-of-life release announcement
- Perl Compatible Regular Expressions — Wikipedia (PCRE1/PCRE2 status)