cheat sheet

re

Python's built-in regex module — compile vs not, match/search/findall/finditer/sub, named groups, lookarounds, verbose mode (re.X), and how it differs from PCRE.

updated 05-25-2026

re — Python Regular Expressions

What it is

re is Python's standard-library regular-expression engine. It implements a near-PCRE dialect — named groups, lookaround, non-greedy quantifiers, Unicode property classes — but with a small number of deliberate Python-specific divergences (no \K, no recursive patterns, named groups use (?P<name>…) instead of PCRE's (?<name>…)). It is the engine Django uses to route URLs, that pandas uses for str.contains, and that every Python script reaches for whenever a string in check is no longer enough.

The newer regex third-party module is API-compatible with re but adds variable-length lookbehind and atomic groups; consider it if you need those features. For comparison with the dialect used by grep -P, ripgrep, and nginx, see linux/pcre.

Install

re is part of the CPython standard library — available everywhere Python is, with no install step.

bash

python -c "import re; print(re.__doc__.splitlines()[0])"

Output:

text

Support for regular expressions (RE).

API overview

The module exposes a small set of top-level functions that mirror methods on compiled pattern objects. The two are functionally identical — re.search(pat, s) is re.compile(pat).search(s) with internal caching.

Function	Method	Returns	Notes
`re.match(pat, s)`	`p.match(s)`	`Match` or `None`	anchors at start of string
`re.fullmatch(pat, s)`	`p.fullmatch(s)`	`Match` or `None`	must match the entire string
`re.search(pat, s)`	`p.search(s)`	`Match` or `None`	first match anywhere
`re.findall(pat, s)`	`p.findall(s)`	`list[str]` or `list[tuple]`	all non-overlapping matches
`re.finditer(pat, s)`	`p.finditer(s)`	iterator of `Match`	lazy — preferred for many matches
`re.sub(pat, repl, s)`	`p.sub(repl, s)`	`str`	substitute
`re.subn(pat, repl, s)`	`p.subn(repl, s)`	`(str, count)`	sub + count
`re.split(pat, s)`	`p.split(s)`	`list[str]`	split by pattern
`re.compile(pat, flags)`	—	`Pattern`	compile once, reuse
`re.escape(s)`	—	`str`	escape regex metachars in `s`

Compile vs not

re.compile(pattern, flags=0) parses the pattern once and returns a reusable Pattern object. Use it whenever the same regex is applied repeatedly — inside a loop, in a hot function, or as a module-level constant. For one-off uses the module-level functions are fine because re keeps an internal LRU cache (size 512) of recently compiled patterns.

python

import re

PHONE = re.compile(r'\d{3}-\d{3}-\d{4}')

for line in ['call 555-123-4567', 'no phone', 'two: 555-000-1111 and 555-222-3333']:
    for match in PHONE.finditer(line):
        print(match.group())

Output:

text

555-123-4567
555-000-1111
555-222-3333

For a function called millions of times, compiling explicitly is roughly 2× faster than relying on the cache because there's no dict lookup. For one-off matches, the difference is unmeasurable.

match vs search vs fullmatch vs findall vs finditer

These five differ in where the regex engine starts looking and what it returns. Mixing them up is the most common re bug.

Function	Where it anchors	Returns
`match`	start of string only	first match (or `None`)
`fullmatch`	start AND end of string	first match (or `None`) — must cover all
`search`	anywhere in string	first match (or `None`)
`findall`	anywhere, all non-overlapping	list of strings (or tuples if there are groups)
`finditer`	anywhere, all non-overlapping	iterator of `Match` objects

python

import re
s = 'aaa 123 bbb 456'

print(re.match(r'\d+', s))                  # None — string starts with 'a'
print(re.search(r'\d+', s))                 # finds '123'
print(re.fullmatch(r'.*\d+', s))            # None — doesn't end with digits
print(re.findall(r'\d+', s))                # all numeric runs
print(list(re.finditer(r'\d+', s)))         # same, as Match objects

Output:

text

None
<re.Match object; span=(4, 7), match='123'>
None
['123', '456']
[<re.Match object; span=(4, 7), match='123'>, <re.Match object; span=(12, 15), match='456'>]

match doesn't mean "match the whole string" — it means "match starting from position 0". For end-anchored full matches, use fullmatch or wrap the pattern with \A...\Z.

The Match object

A successful match, search, or fullmatch returns a Match object — never a string. To get the matched text use .group(), to get capture groups use .group(n) or .groups(), and to get the position use .span().

python

import re
m = re.search(r'(\w+)=(\d+)', 'value=42 elsewhere')

print(m.group())              # full match
print(m.group(0))             # same as .group()
print(m.group(1))             # first capture
print(m.group(2))             # second capture
print(m.groups())             # all captures as tuple
print(m.span())               # (start, end)
print(m.start(), m.end())

Output:

text

value=42
value=42
value
42
('value', '42')
(0, 8)
0 8

Named groups

(?P<name>...) captures into a named slot accessible via .group('name') or .groupdict(). Named groups make tracebacks and back-references in sub self-documenting — prefer them over numeric groups whenever there are more than two captures.

python

import re

DATE = re.compile(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})')

m = DATE.search('shipped on 2026-05-25 to customer 42')
print(m.group('year'), m.group('month'), m.group('day'))
print(m.groupdict())

Output:

text

2026 05 25
{'year': '2026', 'month': '05', 'day': '25'}

In a sub replacement, refer to named groups with \g<name> (or \1/\2 for numeric).

python

import re
print(re.sub(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})',
             r'\g<day>/\g<month>/\g<year>',
             '2026-05-25'))

Output:

text

25/05/2026

Backreferences

\1, \2, … inside the pattern match the same text that the corresponding capture group did. Use them to find repeats, palindromes, or matching delimiters.

python

import re

# Find adjacent duplicated words
print(re.findall(r'\b(\w+) \1\b', 'the the quick brown fox fox jumps'))

# Match content wrapped in matching quote chars (single OR double)
m = re.search(r"(['\"])(.*?)\1", 'set name="alice" and age=30')
print(m.group(2))

Output:

text

['the', 'fox']
alice

Lookaround — `(?=…)`, `(?!…)`, `(?<=…)`, `(?<!…)`

Lookaround assertions test whether something does or does not appear adjacent to the current position without consuming characters. Python's re supports all four variants. The lookbehind must be fixed-length (use the regex library if you need variable-length).

python

import re

# Match price digits only when followed by USD (lookahead)
print(re.findall(r'\d+(?=\s*USD)', 'cost: 100 USD or 50 EUR or 30 USD'))

# Match digits NOT preceded by $ (negative lookbehind)
print(re.findall(r'(?<!\$)\d+', 'price $100 quantity 50 items'))

# Identifiers not followed by a paren — "variables, not function calls"
print(re.findall(r'\b\w+\b(?!\s*\()', 'foo() bar baz() qux'))

Output:

text

['100', '30']
['50']
['bar', 'qux']

Flags

Flags change how the regex engine interprets the pattern. They can be passed as the flags= keyword argument or embedded inline with (?aiLmsux) at the start of the pattern (or (?i:...) to scope to a group).

Flag	Short	Effect
`re.IGNORECASE`	`re.I`	case-insensitive matching
`re.MULTILINE`	`re.M`	`^` and `$` match at every line boundary
`re.DOTALL`	`re.S`	`.` matches newlines too
`re.VERBOSE`	`re.X`	ignore whitespace and `#` comments inside the pattern
`re.ASCII`	`re.A`	`\w`, `\d`, `\s` match ASCII only (not Unicode)
`re.UNICODE`	`re.U`	default in Python 3 — explicit is rarely needed
`re.DEBUG`	—	print compiled-pattern debug info

Combine flags with |:

python

import re

pat = re.compile(r'^error: (.+)$', re.I | re.M)
log = """
INFO: started
ERROR: out of memory
WARN: retrying
Error: connection lost
"""
print(pat.findall(log))

Output:

text

['out of memory', 'connection lost']

Verbose mode (`re.X`)

re.VERBOSE (a.k.a. re.X) tells the engine to ignore unescaped whitespace and #-comments inside the pattern, allowing you to format regexes like prose. This is the single most useful feature for keeping non-trivial regexes maintainable.

python

import re

# A semver-like version pattern, commented and formatted
VERSION = re.compile(r"""
    ^                       # start of string
    v?                      # optional leading 'v'
    (?P<major>\d+) \.
    (?P<minor>\d+) \.
    (?P<patch>\d+)
    (?: - (?P<pre>[\w.]+) )?   # optional pre-release tag
    (?: \+ (?P<build>[\w.]+) )?  # optional build metadata
    $
""", re.VERBOSE)

for s in ['1.2.3', 'v1.2.3-beta.4+exp.sha.5114f85', 'not a version']:
    m = VERSION.match(s)
    print(s, '→', m.groupdict() if m else None)

Output:

text

1.2.3 → {'major': '1', 'minor': '2', 'patch': '3', 'pre': None, 'build': None}
v1.2.3-beta.4+exp.sha.5114f85 → {'major': '1', 'minor': '2', 'patch': '3', 'pre': 'beta.4', 'build': 'exp.sha.5114f85'}
not a version → None

Inside a verbose pattern, write a literal whitespace as \ (escaped space) or [ ] (a class), and a literal # as \#.

re.sub with a callback

re.sub(pattern, repl, string) accepts either a replacement string (with \1, \g<name> etc.) or a callable that takes a Match and returns a replacement string. The callable form is the cleanest way to do conditional, computed, or stateful substitutions.

python

import re

# Capitalize every word — but skip short ones
def capitalize_long(m):
    word = m.group()
    return word.upper() if len(word) > 3 else word

print(re.sub(r'\b\w+\b', capitalize_long, 'the quick brown fox'))

# Increment every number in a string
print(re.sub(r'\d+', lambda m: str(int(m.group()) + 1), 'a=1 b=2 c=99'))

# Mask emails
print(re.sub(
    r'(?P<name>[\w.]+)@(?P<domain>[\w.]+)',
    lambda m: f"{m['name'][0]}***@{m['domain']}",
    'contact alice@example.com or bob@example.com',
))

Output:

text

the QUICK BROWN fox
a=2 b=3 c=100
contact a***@example.com or b***@example.com

re.subn and re.split

subn is sub that also returns a count of substitutions made — convenient for "did anything change?" checks. split is the regex-aware equivalent of str.split, useful when the delimiter is a pattern, not a literal.

python

import re

print(re.subn(r'foo', 'bar', 'foo foo baz foo'))
print(re.split(r'\s*,\s*', '  apple , banana, cherry,  date'))
print(re.split(r'(\d+)', 'abc123def456ghi'))      # keep delimiters by capturing them

Output:

text

('bar bar baz bar', 3)
['apple', 'banana', 'cherry', 'date']
['abc', '123', 'def', '456', 'ghi']

re.escape

re.escape(s) returns s with every regex metacharacter prefixed by a backslash — essential whenever you need to embed user-supplied text or an unknown string into a regex.

python

import re

needle = 'price: $5.99'
pattern = re.escape(needle)
print(pattern)
print(re.search(pattern, 'the price: $5.99 today'))

Output:

text

price:\ \$5\.99
<re.Match object; span=(4, 16), match='price: $5.99'>

Differences from PCRE

Python's re is very close to PCRE but not identical. The table below captures the differences you'll actually hit; for the full PCRE syntax reference see linux/pcre.

Feature	Python `re`	PCRE
Named group syntax	`(?P<name>…)`	`(?<name>…)`
Named backreference (pattern)	`(?P=name)`	`\k<name>`
Named backreference (replacement)	`\g<name>`	`$name` or `\k<name>`
`\K` (reset match start)	not supported	supported
Recursive patterns `(?R)` / `(?1)`	not supported	supported
Atomic groups `(?>…)`	not supported (3.10 added possessive `*+`, `++`, `?+`)	supported
Variable-length lookbehind	not supported (fixed-length only)	PCRE2 supports it
Unicode property classes `\p{…}`	supported (3.7+)	supported
Inline flag scoping `(?i:…)`	supported	supported
Conditionals `(?(name)yes\|no)`	supported	supported
Default Unicode	yes (`\w` matches Unicode letters)	depends on tool config

If you need \K, atomic groups, or variable-length lookbehind in Python, install the regex package (pip install regex) — it is a near-superset of re with the same API.

Performance tips

A short list of patterns that the engine optimizes well, plus the anti-patterns that cause catastrophic backtracking.

Anchor when you can

Patterns that start with ^, \A, or a literal prefix skip ahead quickly because the engine can fail-fast without trying every position.

python

import re, timeit

PAT_NAIVE  = re.compile(r'.*error')
PAT_FAST   = re.compile(r'^.*error')

text = 'a' * 10_000 + 'error'
print(timeit.timeit(lambda: PAT_NAIVE.search(text),  number=1000))
print(timeit.timeit(lambda: PAT_FAST.search(text),   number=1000))

Output:

text

0.0185
0.0142

Avoid nested quantifiers

(a+)+, (a|a)+, and (.*)+ cause exponential backtracking on non-matching input. Refactor to a single quantifier or use a possessive quantifier (3.11+).

python

import re

# DANGEROUS — exponential on non-match
# re.match(r'^(a+)+b$', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!')

# SAFE
print(re.match(r'^a+b$', 'aaaaaaaaaab'))

Output:

text

<re.Match object; span=(0, 11), match='aaaaaaaaaab'>

Use non-capturing groups `(?:…)`

If you only need grouping for alternation or quantification, use (?:…) — it avoids the overhead of remembering the capture for later retrieval.

python

import re
# capturing — slower, populates .groups()
print(re.findall(r'(foo|bar)+', 'foofoobar'))
# non-capturing — faster, returns full matches
print(re.findall(r'(?:foo|bar)+', 'foofoobar'))

Output:

text

['bar']
['foofoobar']

Common pitfalls

Forgetting the raw-string prefix r'' — '\d' works by accident (\d isn't a Python escape) but '\b' does not (\b is the backspace character). Always use r'...' for regex literals.
Using match when you wanted search — match only checks the start of the string. Use search for "does this appear anywhere" or fullmatch for "is this the entire string".
findall returns tuples when there are groups — if your pattern has any capture groups, findall returns the captures only, not the full match. Use non-capturing groups (?:…) to keep the full string, or switch to finditer.
. doesn't match newlines by default — pass re.S (DOTALL) when scanning across line boundaries.
^ and $ only match string ends by default — pass re.M (MULTILINE) for per-line anchoring.
Greedy vs non-greedy */+ — <.+> on <a><b> matches <a><b> entire. Use <.+?> for non-greedy, or [^>]+ for a negative class (faster).
Catastrophic backtracking — patterns like (a+)+$ lock up on non-matching input. Refactor with possessive quantifiers (a++) or atomic groups via the regex package.
Mixing bytes and str — a bytes pattern (rb'\d') can only match a bytes input, and vice versa. Mismatching raises TypeError.
Named groups must be unique — (?P<a>x)(?P<a>y) raises a re.error. Use unique names or numeric groups.
Variable-length lookbehind — (?<=ab|abc) is rejected because the alternatives differ in length. Either split into two patterns or switch to the regex library.

Real-world recipes

Parse a multi-line config file

A .ini-style config parser using verbose regex for clarity. Handles section headers, key=value pairs, and # comments.

python

import re

CONFIG = re.compile(r"""
    ^\s*
    (?:
        \[ (?P<section>[^\]]+) \]            # section header
      | (?P<key>[\w.]+) \s* = \s* (?P<value>.*?)  # key = value
      | \# .*                                # comment
    )?
    \s* $
""", re.VERBOSE | re.MULTILINE)

text = """
# database config
[database]
host = localhost
port = 5432

[logging]
level = INFO
"""

current = None
config = {}
for m in CONFIG.finditer(text):
    if m['section']:
        current = m['section']
        config[current] = {}
    elif m['key'] and current is not None:
        config[current][m['key']] = m['value']

print(config)

Output:

text

{'database': {'host': 'localhost', 'port': '5432'}, 'logging': {'level': 'INFO'}}

Extract structured records from logs

Pull timestamp, level, and message from each line of a typical log file using named groups.

python

import re

LINE = re.compile(r"""
    ^
    (?P<ts>\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2})
    \s+
    (?P<level>DEBUG|INFO|WARN|ERROR)
    \s+
    (?P<msg>.*)
    $
""", re.VERBOSE)

logs = [
    '2026-05-25 12:01:03 INFO  starting service',
    '2026-05-25 12:01:04 ERROR connection refused',
    'malformed line',
]
for line in logs:
    m = LINE.match(line)
    if m:
        print(m.groupdict())

Output:

text

{'ts': '2026-05-25 12:01:03', 'level': 'INFO', 'msg': 'starting service'}
{'ts': '2026-05-25 12:01:04', 'level': 'ERROR', 'msg': 'connection refused'}

Strip ANSI escape sequences

Normalize colored terminal output before saving to a file.

python

import re

ANSI = re.compile(r'\x1b\[[0-9;]*m')
colored = '\x1b[31mERROR\x1b[0m: \x1b[1mfile\x1b[0m missing'
print(ANSI.sub('', colored))

Output:

text

ERROR: file missing

URL slug from title

Lowercase, drop non-alphanumerics, collapse whitespace into a single hyphen.

python

import re

def slugify(title):
    s = title.lower()
    s = re.sub(r'[^a-z0-9\s-]', '', s)         # strip punctuation
    s = re.sub(r'\s+', '-', s)                  # spaces → hyphens
    s = re.sub(r'-+', '-', s).strip('-')        # collapse hyphens
    return s

print(slugify('Hello, World! — Python 3.12 & re.X'))

Output:

text

hello-world-python-312-rex

Reformat phone numbers

Normalize many input shapes to a single canonical format.

python

import re

PHONE = re.compile(r"""
    \D*                              # any leading non-digits
    (?P<area>\d{3}) \D*
    (?P<prefix>\d{3}) \D*
    (?P<line>\d{4})
    \D*$
""", re.VERBOSE)

for raw in ['(555) 123-4567', '555.123.4567', '5551234567', '+1 555 123 4567']:
    m = PHONE.match(raw)
    if m:
        print(f"({m['area']}) {m['prefix']}-{m['line']}")

Output:

text

(555) 123-4567
(555) 123-4567
(555) 123-4567
(555) 123-4567

Find unmatched braces

Use a non-greedy quantifier and a negative lookahead to detect dangling { without a }.

python

import re

# Match `{...}` blocks; collect text outside them, look for stray `{`
text = '{ok} stray { and {nested {inner}} {ok}'
balanced = re.sub(r'\{[^{}]*\}', '', text)
print('residue:', repr(balanced))
print('stray opens:', balanced.count('{'))

Output:

text

residue: ' stray { and {nested } '
stray opens: 2

Replace with a counter

Number every occurrence of a pattern, using a closure as the sub callback.

python

import re

def numberer():
    n = 0
    def repl(m):
        nonlocal n
        n += 1
        return f'[{n}]{m.group()}'
    return repl

print(re.sub(r'\b\w+\b', numberer(), 'one two three four'))

Output:

text

[1]one [2]two [3]three [4]four

Tokenize source code

A toy tokenizer using re.Scanner (a hidden gem in re).

python

import re

scanner = re.Scanner([
    (r'\d+',           lambda s, t: ('NUM', int(t))),
    (r'[+\-*/]',       lambda s, t: ('OP', t)),
    (r'[a-zA-Z_]\w*',  lambda s, t: ('IDENT', t)),
    (r'\s+',           None),
])

tokens, remainder = scanner.scan('x = 10 + 20 * y')
print(tokens)

Output:

text

[('IDENT', 'x'), ('IDENT', 'e'), ('NUM', 10), ('OP', '+'), ('NUM', 20), ('OP', '*'), ('IDENT', 'y')]

re.Scanner is undocumented but stable since Python 2.4. For production tokenizers prefer tokenize (for Python source) or a dedicated lexer like ply or lark.

re — Python Regular Expressions

What it is

Install

API overview

Compile vs not

match vs search vs fullmatch vs findall vs finditer

The Match object

Named groups

Backreferences

Lookaround — (?=…), (?!…), (?<=…), (?<!…)

Flags

Verbose mode (re.X)

re.sub with a callback

re.subn and re.split

re.escape

Differences from PCRE

Performance tips

Anchor when you can

Avoid nested quantifiers

Use non-capturing groups (?:…)

Common pitfalls

Real-world recipes

Parse a multi-line config file

Extract structured records from logs

Strip ANSI escape sequences

URL slug from title

Reformat phone numbers

Find unmatched braces

Replace with a counter

Tokenize source code

See also

Lookaround — `(?=…)`, `(?!…)`, `(?<=…)`, `(?<!…)`

Verbose mode (`re.X`)

Use non-capturing groups `(?:…)`