cheat sheet
re
Python's built-in regex module — compile vs not, match/search/findall/finditer/sub, named groups, lookarounds, verbose mode (re.X), and how it differs from PCRE.
re — Python Regular Expressions
What it is
re is Python's standard-library regular-expression engine. It implements a near-PCRE dialect — named groups, lookaround, non-greedy quantifiers, Unicode property classes — but with a small number of deliberate Python-specific divergences (no \K, no recursive patterns, named groups use (?P<name>…) instead of PCRE's (?<name>…)). It is the engine Django uses to route URLs, that pandas uses for str.contains, and that every Python script reaches for whenever a string in check is no longer enough.
The newer regex third-party module is API-compatible with re but adds variable-length lookbehind and atomic groups; consider it if you need those features. For comparison with the dialect used by grep -P, ripgrep, and nginx, see linux/pcre.
Install
re is part of the CPython standard library — available everywhere Python is, with no install step.
python -c "import re; print(re.__doc__.splitlines()[0])"
Output:
Support for regular expressions (RE).
API overview
The module exposes a small set of top-level functions that mirror methods on compiled pattern objects. The two are functionally identical — re.search(pat, s) is re.compile(pat).search(s) with internal caching.
| Function | Method | Returns | Notes |
|---|---|---|---|
re.match(pat, s) | p.match(s) | Match or None | anchors at start of string |
re.fullmatch(pat, s) | p.fullmatch(s) | Match or None | must match the entire string |
re.search(pat, s) | p.search(s) | Match or None | first match anywhere |
re.findall(pat, s) | p.findall(s) | list[str] or list[tuple] | all non-overlapping matches |
re.finditer(pat, s) | p.finditer(s) | iterator of Match | lazy — preferred for many matches |
re.sub(pat, repl, s) | p.sub(repl, s) | str | substitute |
re.subn(pat, repl, s) | p.subn(repl, s) | (str, count) | sub + count |
re.split(pat, s) | p.split(s) | list[str] | split by pattern |
re.compile(pat, flags) | — | Pattern | compile once, reuse |
re.escape(s) | — | str | escape regex metachars in s |
Compile vs not
re.compile(pattern, flags=0) parses the pattern once and returns a reusable Pattern object. Use it whenever the same regex is applied repeatedly — inside a loop, in a hot function, or as a module-level constant. For one-off uses the module-level functions are fine because re keeps an internal LRU cache (size 512) of recently compiled patterns.
import re
PHONE = re.compile(r'\d{3}-\d{3}-\d{4}')
for line in ['call 555-123-4567', 'no phone', 'two: 555-000-1111 and 555-222-3333']:
for match in PHONE.finditer(line):
print(match.group())
Output:
555-123-4567
555-000-1111
555-222-3333
For a function called millions of times, compiling explicitly is roughly 2× faster than relying on the cache because there's no dict lookup. For one-off matches, the difference is unmeasurable.
match vs search vs fullmatch vs findall vs finditer
These five differ in where the regex engine starts looking and what it returns. Mixing them up is the most common re bug.
| Function | Where it anchors | Returns |
|---|---|---|
match | start of string only | first match (or None) |
fullmatch | start AND end of string | first match (or None) — must cover all |
search | anywhere in string | first match (or None) |
findall | anywhere, all non-overlapping | list of strings (or tuples if there are groups) |
finditer | anywhere, all non-overlapping | iterator of Match objects |
import re
s = 'aaa 123 bbb 456'
print(re.match(r'\d+', s)) # None — string starts with 'a'
print(re.search(r'\d+', s)) # finds '123'
print(re.fullmatch(r'.*\d+', s)) # None — doesn't end with digits
print(re.findall(r'\d+', s)) # all numeric runs
print(list(re.finditer(r'\d+', s))) # same, as Match objects
Output:
None
<re.Match object; span=(4, 7), match='123'>
None
['123', '456']
[<re.Match object; span=(4, 7), match='123'>, <re.Match object; span=(12, 15), match='456'>]
matchdoesn't mean "match the whole string" — it means "match starting from position 0". For end-anchored full matches, usefullmatchor wrap the pattern with\A...\Z.
The Match object
A successful match, search, or fullmatch returns a Match object — never a string. To get the matched text use .group(), to get capture groups use .group(n) or .groups(), and to get the position use .span().
import re
m = re.search(r'(\w+)=(\d+)', 'value=42 elsewhere')
print(m.group()) # full match
print(m.group(0)) # same as .group()
print(m.group(1)) # first capture
print(m.group(2)) # second capture
print(m.groups()) # all captures as tuple
print(m.span()) # (start, end)
print(m.start(), m.end())
Output:
value=42
value=42
value
42
('value', '42')
(0, 8)
0 8
Named groups
(?P<name>...) captures into a named slot accessible via .group('name') or .groupdict(). Named groups make tracebacks and back-references in sub self-documenting — prefer them over numeric groups whenever there are more than two captures.
import re
DATE = re.compile(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})')
m = DATE.search('shipped on 2026-05-25 to customer 42')
print(m.group('year'), m.group('month'), m.group('day'))
print(m.groupdict())
Output:
2026 05 25
{'year': '2026', 'month': '05', 'day': '25'}
In a sub replacement, refer to named groups with \g<name> (or \1/\2 for numeric).
import re
print(re.sub(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})',
r'\g<day>/\g<month>/\g<year>',
'2026-05-25'))
Output:
25/05/2026
Backreferences
\1, \2, … inside the pattern match the same text that the corresponding capture group did. Use them to find repeats, palindromes, or matching delimiters.
import re
# Find adjacent duplicated words
print(re.findall(r'\b(\w+) \1\b', 'the the quick brown fox fox jumps'))
# Match content wrapped in matching quote chars (single OR double)
m = re.search(r"(['\"])(.*?)\1", 'set name="alice" and age=30')
print(m.group(2))
Output:
['the', 'fox']
alice
Lookaround — (?=…), (?!…), (?<=…), (?<!…)
Lookaround assertions test whether something does or does not appear adjacent to the current position without consuming characters. Python's re supports all four variants. The lookbehind must be fixed-length (use the regex library if you need variable-length).
import re
# Match price digits only when followed by USD (lookahead)
print(re.findall(r'\d+(?=\s*USD)', 'cost: 100 USD or 50 EUR or 30 USD'))
# Match digits NOT preceded by $ (negative lookbehind)
print(re.findall(r'(?<!\$)\d+', 'price $100 quantity 50 items'))
# Identifiers not followed by a paren — "variables, not function calls"
print(re.findall(r'\b\w+\b(?!\s*\()', 'foo() bar baz() qux'))
Output:
['100', '30']
['50']
['bar', 'qux']
Flags
Flags change how the regex engine interprets the pattern. They can be passed as the flags= keyword argument or embedded inline with (?aiLmsux) at the start of the pattern (or (?i:...) to scope to a group).
| Flag | Short | Effect |
|---|---|---|
re.IGNORECASE | re.I | case-insensitive matching |
re.MULTILINE | re.M | ^ and $ match at every line boundary |
re.DOTALL | re.S | . matches newlines too |
re.VERBOSE | re.X | ignore whitespace and # comments inside the pattern |
re.ASCII | re.A | \w, \d, \s match ASCII only (not Unicode) |
re.UNICODE | re.U | default in Python 3 — explicit is rarely needed |
re.DEBUG | — | print compiled-pattern debug info |
Combine flags with |:
import re
pat = re.compile(r'^error: (.+)$', re.I | re.M)
log = """
INFO: started
ERROR: out of memory
WARN: retrying
Error: connection lost
"""
print(pat.findall(log))
Output:
['out of memory', 'connection lost']
Verbose mode (re.X)
re.VERBOSE (a.k.a. re.X) tells the engine to ignore unescaped whitespace and #-comments inside the pattern, allowing you to format regexes like prose. This is the single most useful feature for keeping non-trivial regexes maintainable.
import re
# A semver-like version pattern, commented and formatted
VERSION = re.compile(r"""
^ # start of string
v? # optional leading 'v'
(?P<major>\d+) \.
(?P<minor>\d+) \.
(?P<patch>\d+)
(?: - (?P<pre>[\w.]+) )? # optional pre-release tag
(?: \+ (?P<build>[\w.]+) )? # optional build metadata
$
""", re.VERBOSE)
for s in ['1.2.3', 'v1.2.3-beta.4+exp.sha.5114f85', 'not a version']:
m = VERSION.match(s)
print(s, '→', m.groupdict() if m else None)
Output:
1.2.3 → {'major': '1', 'minor': '2', 'patch': '3', 'pre': None, 'build': None}
v1.2.3-beta.4+exp.sha.5114f85 → {'major': '1', 'minor': '2', 'patch': '3', 'pre': 'beta.4', 'build': 'exp.sha.5114f85'}
not a version → None
Inside a verbose pattern, write a literal whitespace as
\(escaped space) or[ ](a class), and a literal#as\#.
re.sub with a callback
re.sub(pattern, repl, string) accepts either a replacement string (with \1, \g<name> etc.) or a callable that takes a Match and returns a replacement string. The callable form is the cleanest way to do conditional, computed, or stateful substitutions.
import re
# Capitalize every word — but skip short ones
def capitalize_long(m):
word = m.group()
return word.upper() if len(word) > 3 else word
print(re.sub(r'\b\w+\b', capitalize_long, 'the quick brown fox'))
# Increment every number in a string
print(re.sub(r'\d+', lambda m: str(int(m.group()) + 1), 'a=1 b=2 c=99'))
# Mask emails
print(re.sub(
r'(?P<name>[\w.]+)@(?P<domain>[\w.]+)',
lambda m: f"{m['name'][0]}***@{m['domain']}",
'contact alice@example.com or bob@example.com',
))
Output:
the QUICK BROWN fox
a=2 b=3 c=100
contact a***@example.com or b***@example.com
re.subn and re.split
subn is sub that also returns a count of substitutions made — convenient for "did anything change?" checks. split is the regex-aware equivalent of str.split, useful when the delimiter is a pattern, not a literal.
import re
print(re.subn(r'foo', 'bar', 'foo foo baz foo'))
print(re.split(r'\s*,\s*', ' apple , banana, cherry, date'))
print(re.split(r'(\d+)', 'abc123def456ghi')) # keep delimiters by capturing them
Output:
('bar bar baz bar', 3)
['apple', 'banana', 'cherry', 'date']
['abc', '123', 'def', '456', 'ghi']
re.escape
re.escape(s) returns s with every regex metacharacter prefixed by a backslash — essential whenever you need to embed user-supplied text or an unknown string into a regex.
import re
needle = 'price: $5.99'
pattern = re.escape(needle)
print(pattern)
print(re.search(pattern, 'the price: $5.99 today'))
Output:
price:\ \$5\.99
<re.Match object; span=(4, 16), match='price: $5.99'>
Differences from PCRE
Python's re is very close to PCRE but not identical. The table below captures the differences you'll actually hit; for the full PCRE syntax reference see linux/pcre.
| Feature | Python re | PCRE |
|---|---|---|
| Named group syntax | (?P<name>…) | (?<name>…) |
| Named backreference (pattern) | (?P=name) | \k<name> |
| Named backreference (replacement) | \g<name> | $name or \k<name> |
\K (reset match start) | not supported | supported |
Recursive patterns (?R) / (?1) | not supported | supported |
Atomic groups (?>…) | not supported (3.10 added possessive *+, ++, ?+) | supported |
| Variable-length lookbehind | not supported (fixed-length only) | PCRE2 supports it |
Unicode property classes \p{…} | supported (3.7+) | supported |
Inline flag scoping (?i:…) | supported | supported |
Conditionals (?(name)yes|no) | supported | supported |
| Default Unicode | yes (\w matches Unicode letters) | depends on tool config |
If you need
\K, atomic groups, or variable-length lookbehind in Python, install theregexpackage (pip install regex) — it is a near-superset ofrewith the same API.
Performance tips
A short list of patterns that the engine optimizes well, plus the anti-patterns that cause catastrophic backtracking.
Anchor when you can
Patterns that start with ^, \A, or a literal prefix skip ahead quickly because the engine can fail-fast without trying every position.
import re, timeit
PAT_NAIVE = re.compile(r'.*error')
PAT_FAST = re.compile(r'^.*error')
text = 'a' * 10_000 + 'error'
print(timeit.timeit(lambda: PAT_NAIVE.search(text), number=1000))
print(timeit.timeit(lambda: PAT_FAST.search(text), number=1000))
Output:
0.0185
0.0142
Avoid nested quantifiers
(a+)+, (a|a)+, and (.*)+ cause exponential backtracking on non-matching input. Refactor to a single quantifier or use a possessive quantifier (3.11+).
import re
# DANGEROUS — exponential on non-match
# re.match(r'^(a+)+b$', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!')
# SAFE
print(re.match(r'^a+b$', 'aaaaaaaaaab'))
Output:
<re.Match object; span=(0, 11), match='aaaaaaaaaab'>
Use non-capturing groups (?:…)
If you only need grouping for alternation or quantification, use (?:…) — it avoids the overhead of remembering the capture for later retrieval.
import re
# capturing — slower, populates .groups()
print(re.findall(r'(foo|bar)+', 'foofoobar'))
# non-capturing — faster, returns full matches
print(re.findall(r'(?:foo|bar)+', 'foofoobar'))
Output:
['bar']
['foofoobar']
Common pitfalls
- Forgetting the raw-string prefix
r''—'\d'works by accident (\disn't a Python escape) but'\b'does not (\bis the backspace character). Always user'...'for regex literals. - Using
matchwhen you wantedsearch—matchonly checks the start of the string. Usesearchfor "does this appear anywhere" orfullmatchfor "is this the entire string". findallreturns tuples when there are groups — if your pattern has any capture groups,findallreturns the captures only, not the full match. Use non-capturing groups(?:…)to keep the full string, or switch tofinditer..doesn't match newlines by default — passre.S(DOTALL) when scanning across line boundaries.^and$only match string ends by default — passre.M(MULTILINE) for per-line anchoring.- Greedy vs non-greedy
*/+—<.+>on<a><b>matches<a><b>entire. Use<.+?>for non-greedy, or[^>]+for a negative class (faster). - Catastrophic backtracking — patterns like
(a+)+$lock up on non-matching input. Refactor with possessive quantifiers (a++) or atomic groups via theregexpackage. - Mixing bytes and str — a bytes pattern (
rb'\d') can only match a bytes input, and vice versa. Mismatching raisesTypeError. - Named groups must be unique —
(?P<a>x)(?P<a>y)raises are.error. Use unique names or numeric groups. - Variable-length lookbehind —
(?<=ab|abc)is rejected because the alternatives differ in length. Either split into two patterns or switch to theregexlibrary.
Real-world recipes
Parse a multi-line config file
A .ini-style config parser using verbose regex for clarity. Handles section headers, key=value pairs, and # comments.
import re
CONFIG = re.compile(r"""
^\s*
(?:
\[ (?P<section>[^\]]+) \] # section header
| (?P<key>[\w.]+) \s* = \s* (?P<value>.*?) # key = value
| \# .* # comment
)?
\s* $
""", re.VERBOSE | re.MULTILINE)
text = """
# database config
[database]
host = localhost
port = 5432
[logging]
level = INFO
"""
current = None
config = {}
for m in CONFIG.finditer(text):
if m['section']:
current = m['section']
config[current] = {}
elif m['key'] and current is not None:
config[current][m['key']] = m['value']
print(config)
Output:
{'database': {'host': 'localhost', 'port': '5432'}, 'logging': {'level': 'INFO'}}
Extract structured records from logs
Pull timestamp, level, and message from each line of a typical log file using named groups.
import re
LINE = re.compile(r"""
^
(?P<ts>\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2})
\s+
(?P<level>DEBUG|INFO|WARN|ERROR)
\s+
(?P<msg>.*)
$
""", re.VERBOSE)
logs = [
'2026-05-25 12:01:03 INFO starting service',
'2026-05-25 12:01:04 ERROR connection refused',
'malformed line',
]
for line in logs:
m = LINE.match(line)
if m:
print(m.groupdict())
Output:
{'ts': '2026-05-25 12:01:03', 'level': 'INFO', 'msg': 'starting service'}
{'ts': '2026-05-25 12:01:04', 'level': 'ERROR', 'msg': 'connection refused'}
Strip ANSI escape sequences
Normalize colored terminal output before saving to a file.
import re
ANSI = re.compile(r'\x1b\[[0-9;]*m')
colored = '\x1b[31mERROR\x1b[0m: \x1b[1mfile\x1b[0m missing'
print(ANSI.sub('', colored))
Output:
ERROR: file missing
URL slug from title
Lowercase, drop non-alphanumerics, collapse whitespace into a single hyphen.
import re
def slugify(title):
s = title.lower()
s = re.sub(r'[^a-z0-9\s-]', '', s) # strip punctuation
s = re.sub(r'\s+', '-', s) # spaces → hyphens
s = re.sub(r'-+', '-', s).strip('-') # collapse hyphens
return s
print(slugify('Hello, World! — Python 3.12 & re.X'))
Output:
hello-world-python-312-rex
Reformat phone numbers
Normalize many input shapes to a single canonical format.
import re
PHONE = re.compile(r"""
\D* # any leading non-digits
(?P<area>\d{3}) \D*
(?P<prefix>\d{3}) \D*
(?P<line>\d{4})
\D*$
""", re.VERBOSE)
for raw in ['(555) 123-4567', '555.123.4567', '5551234567', '+1 555 123 4567']:
m = PHONE.match(raw)
if m:
print(f"({m['area']}) {m['prefix']}-{m['line']}")
Output:
(555) 123-4567
(555) 123-4567
(555) 123-4567
(555) 123-4567
Find unmatched braces
Use a non-greedy quantifier and a negative lookahead to detect dangling { without a }.
import re
# Match `{...}` blocks; collect text outside them, look for stray `{`
text = '{ok} stray { and {nested {inner}} {ok}'
balanced = re.sub(r'\{[^{}]*\}', '', text)
print('residue:', repr(balanced))
print('stray opens:', balanced.count('{'))
Output:
residue: ' stray { and {nested } '
stray opens: 2
Replace with a counter
Number every occurrence of a pattern, using a closure as the sub callback.
import re
def numberer():
n = 0
def repl(m):
nonlocal n
n += 1
return f'[{n}]{m.group()}'
return repl
print(re.sub(r'\b\w+\b', numberer(), 'one two three four'))
Output:
[1]one [2]two [3]three [4]four
Tokenize source code
A toy tokenizer using re.Scanner (a hidden gem in re).
import re
scanner = re.Scanner([
(r'\d+', lambda s, t: ('NUM', int(t))),
(r'[+\-*/]', lambda s, t: ('OP', t)),
(r'[a-zA-Z_]\w*', lambda s, t: ('IDENT', t)),
(r'\s+', None),
])
tokens, remainder = scanner.scan('x = 10 + 20 * y')
print(tokens)
Output:
[('IDENT', 'x'), ('IDENT', 'e'), ('NUM', 10), ('OP', '+'), ('NUM', 20), ('OP', '*'), ('IDENT', 'y')]
re.Scanneris undocumented but stable since Python 2.4. For production tokenizers prefertokenize(for Python source) or a dedicated lexer likeplyorlark.
See also
- linux/pcre — the PCRE dialect used by
grep -P,ripgrep,nginx, PHP - linux/grep and linux/sed — shell tools that consume the same patterns
- javascript/regex — JavaScript regex differences, when you cross language boundaries
regex— third-party drop-in with variable-length lookbehind, atomic groups,\K