cheat sheet

re

Python's built-in regex module — compile vs not, match/search/findall/finditer/sub, named groups, lookarounds, verbose mode (re.X), and how it differs from PCRE.

re — Python Regular Expressions

What it is

re is Python's standard-library regular-expression engine. It implements a near-PCRE dialect — named groups, lookaround, non-greedy quantifiers, Unicode property classes — but with a small number of deliberate Python-specific divergences (no \K, no recursive patterns, named groups use (?P<name>…) instead of PCRE's (?<name>…)). It is the engine Django uses to route URLs, that pandas uses for str.contains, and that every Python script reaches for whenever a string in check is no longer enough.

The newer regex third-party module is API-compatible with re but adds variable-length lookbehind and atomic groups; consider it if you need those features. For comparison with the dialect used by grep -P, ripgrep, and nginx, see linux/pcre.

Install

re is part of the CPython standard library — available everywhere Python is, with no install step.

bash
python -c "import re; print(re.__doc__.splitlines()[0])"

Output:

text
Support for regular expressions (RE).

API overview

The module exposes a small set of top-level functions that mirror methods on compiled pattern objects. The two are functionally identical — re.search(pat, s) is re.compile(pat).search(s) with internal caching.

FunctionMethodReturnsNotes
re.match(pat, s)p.match(s)Match or Noneanchors at start of string
re.fullmatch(pat, s)p.fullmatch(s)Match or Nonemust match the entire string
re.search(pat, s)p.search(s)Match or Nonefirst match anywhere
re.findall(pat, s)p.findall(s)list[str] or list[tuple]all non-overlapping matches
re.finditer(pat, s)p.finditer(s)iterator of Matchlazy — preferred for many matches
re.sub(pat, repl, s)p.sub(repl, s)strsubstitute
re.subn(pat, repl, s)p.subn(repl, s)(str, count)sub + count
re.split(pat, s)p.split(s)list[str]split by pattern
re.compile(pat, flags)Patterncompile once, reuse
re.escape(s)strescape regex metachars in s

Compile vs not

re.compile(pattern, flags=0) parses the pattern once and returns a reusable Pattern object. Use it whenever the same regex is applied repeatedly — inside a loop, in a hot function, or as a module-level constant. For one-off uses the module-level functions are fine because re keeps an internal LRU cache (size 512) of recently compiled patterns.

python
import re

PHONE = re.compile(r'\d{3}-\d{3}-\d{4}')

for line in ['call 555-123-4567', 'no phone', 'two: 555-000-1111 and 555-222-3333']:
    for match in PHONE.finditer(line):
        print(match.group())

Output:

text
555-123-4567
555-000-1111
555-222-3333

For a function called millions of times, compiling explicitly is roughly 2× faster than relying on the cache because there's no dict lookup. For one-off matches, the difference is unmeasurable.

match vs search vs fullmatch vs findall vs finditer

These five differ in where the regex engine starts looking and what it returns. Mixing them up is the most common re bug.

FunctionWhere it anchorsReturns
matchstart of string onlyfirst match (or None)
fullmatchstart AND end of stringfirst match (or None) — must cover all
searchanywhere in stringfirst match (or None)
findallanywhere, all non-overlappinglist of strings (or tuples if there are groups)
finditeranywhere, all non-overlappingiterator of Match objects
python
import re
s = 'aaa 123 bbb 456'

print(re.match(r'\d+', s))                  # None — string starts with 'a'
print(re.search(r'\d+', s))                 # finds '123'
print(re.fullmatch(r'.*\d+', s))            # None — doesn't end with digits
print(re.findall(r'\d+', s))                # all numeric runs
print(list(re.finditer(r'\d+', s)))         # same, as Match objects

Output:

text
None
<re.Match object; span=(4, 7), match='123'>
None
['123', '456']
[<re.Match object; span=(4, 7), match='123'>, <re.Match object; span=(12, 15), match='456'>]

match doesn't mean "match the whole string" — it means "match starting from position 0". For end-anchored full matches, use fullmatch or wrap the pattern with \A...\Z.

The Match object

A successful match, search, or fullmatch returns a Match object — never a string. To get the matched text use .group(), to get capture groups use .group(n) or .groups(), and to get the position use .span().

python
import re
m = re.search(r'(\w+)=(\d+)', 'value=42 elsewhere')

print(m.group())              # full match
print(m.group(0))             # same as .group()
print(m.group(1))             # first capture
print(m.group(2))             # second capture
print(m.groups())             # all captures as tuple
print(m.span())               # (start, end)
print(m.start(), m.end())

Output:

text
value=42
value=42
value
42
('value', '42')
(0, 8)
0 8

Named groups

(?P<name>...) captures into a named slot accessible via .group('name') or .groupdict(). Named groups make tracebacks and back-references in sub self-documenting — prefer them over numeric groups whenever there are more than two captures.

python
import re

DATE = re.compile(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})')

m = DATE.search('shipped on 2026-05-25 to customer 42')
print(m.group('year'), m.group('month'), m.group('day'))
print(m.groupdict())

Output:

text
2026 05 25
{'year': '2026', 'month': '05', 'day': '25'}

In a sub replacement, refer to named groups with \g<name> (or \1/\2 for numeric).

python
import re
print(re.sub(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})',
             r'\g<day>/\g<month>/\g<year>',
             '2026-05-25'))

Output:

text
25/05/2026

Backreferences

\1, \2, … inside the pattern match the same text that the corresponding capture group did. Use them to find repeats, palindromes, or matching delimiters.

python
import re

# Find adjacent duplicated words
print(re.findall(r'\b(\w+) \1\b', 'the the quick brown fox fox jumps'))

# Match content wrapped in matching quote chars (single OR double)
m = re.search(r"(['\"])(.*?)\1", 'set name="alice" and age=30')
print(m.group(2))

Output:

text
['the', 'fox']
alice

Lookaround — (?=…), (?!…), (?<=…), (?<!…)

Lookaround assertions test whether something does or does not appear adjacent to the current position without consuming characters. Python's re supports all four variants. The lookbehind must be fixed-length (use the regex library if you need variable-length).

python
import re

# Match price digits only when followed by USD (lookahead)
print(re.findall(r'\d+(?=\s*USD)', 'cost: 100 USD or 50 EUR or 30 USD'))

# Match digits NOT preceded by $ (negative lookbehind)
print(re.findall(r'(?<!\$)\d+', 'price $100 quantity 50 items'))

# Identifiers not followed by a paren — "variables, not function calls"
print(re.findall(r'\b\w+\b(?!\s*\()', 'foo() bar baz() qux'))

Output:

text
['100', '30']
['50']
['bar', 'qux']

Flags

Flags change how the regex engine interprets the pattern. They can be passed as the flags= keyword argument or embedded inline with (?aiLmsux) at the start of the pattern (or (?i:...) to scope to a group).

FlagShortEffect
re.IGNORECASEre.Icase-insensitive matching
re.MULTILINEre.M^ and $ match at every line boundary
re.DOTALLre.S. matches newlines too
re.VERBOSEre.Xignore whitespace and # comments inside the pattern
re.ASCIIre.A\w, \d, \s match ASCII only (not Unicode)
re.UNICODEre.Udefault in Python 3 — explicit is rarely needed
re.DEBUGprint compiled-pattern debug info

Combine flags with |:

python
import re

pat = re.compile(r'^error: (.+)$', re.I | re.M)
log = """
INFO: started
ERROR: out of memory
WARN: retrying
Error: connection lost
"""
print(pat.findall(log))

Output:

text
['out of memory', 'connection lost']

Verbose mode (re.X)

re.VERBOSE (a.k.a. re.X) tells the engine to ignore unescaped whitespace and #-comments inside the pattern, allowing you to format regexes like prose. This is the single most useful feature for keeping non-trivial regexes maintainable.

python
import re

# A semver-like version pattern, commented and formatted
VERSION = re.compile(r"""
    ^                       # start of string
    v?                      # optional leading 'v'
    (?P<major>\d+) \.
    (?P<minor>\d+) \.
    (?P<patch>\d+)
    (?: - (?P<pre>[\w.]+) )?   # optional pre-release tag
    (?: \+ (?P<build>[\w.]+) )?  # optional build metadata
    $
""", re.VERBOSE)

for s in ['1.2.3', 'v1.2.3-beta.4+exp.sha.5114f85', 'not a version']:
    m = VERSION.match(s)
    print(s, '→', m.groupdict() if m else None)

Output:

text
1.2.3 → {'major': '1', 'minor': '2', 'patch': '3', 'pre': None, 'build': None}
v1.2.3-beta.4+exp.sha.5114f85 → {'major': '1', 'minor': '2', 'patch': '3', 'pre': 'beta.4', 'build': 'exp.sha.5114f85'}
not a version → None

Inside a verbose pattern, write a literal whitespace as \ (escaped space) or [ ] (a class), and a literal # as \#.

re.sub with a callback

re.sub(pattern, repl, string) accepts either a replacement string (with \1, \g<name> etc.) or a callable that takes a Match and returns a replacement string. The callable form is the cleanest way to do conditional, computed, or stateful substitutions.

python
import re

# Capitalize every word — but skip short ones
def capitalize_long(m):
    word = m.group()
    return word.upper() if len(word) > 3 else word

print(re.sub(r'\b\w+\b', capitalize_long, 'the quick brown fox'))

# Increment every number in a string
print(re.sub(r'\d+', lambda m: str(int(m.group()) + 1), 'a=1 b=2 c=99'))

# Mask emails
print(re.sub(
    r'(?P<name>[\w.]+)@(?P<domain>[\w.]+)',
    lambda m: f"{m['name'][0]}***@{m['domain']}",
    'contact alice@example.com or bob@example.com',
))

Output:

text
the QUICK BROWN fox
a=2 b=3 c=100
contact a***@example.com or b***@example.com

re.subn and re.split

subn is sub that also returns a count of substitutions made — convenient for "did anything change?" checks. split is the regex-aware equivalent of str.split, useful when the delimiter is a pattern, not a literal.

python
import re

print(re.subn(r'foo', 'bar', 'foo foo baz foo'))
print(re.split(r'\s*,\s*', '  apple , banana, cherry,  date'))
print(re.split(r'(\d+)', 'abc123def456ghi'))      # keep delimiters by capturing them

Output:

text
('bar bar baz bar', 3)
['apple', 'banana', 'cherry', 'date']
['abc', '123', 'def', '456', 'ghi']

re.escape

re.escape(s) returns s with every regex metacharacter prefixed by a backslash — essential whenever you need to embed user-supplied text or an unknown string into a regex.

python
import re

needle = 'price: $5.99'
pattern = re.escape(needle)
print(pattern)
print(re.search(pattern, 'the price: $5.99 today'))

Output:

text
price:\ \$5\.99
<re.Match object; span=(4, 16), match='price: $5.99'>

Differences from PCRE

Python's re is very close to PCRE but not identical. The table below captures the differences you'll actually hit; for the full PCRE syntax reference see linux/pcre.

FeaturePython rePCRE
Named group syntax(?P<name>…)(?<name>…)
Named backreference (pattern)(?P=name)\k<name>
Named backreference (replacement)\g<name>$name or \k<name>
\K (reset match start)not supportedsupported
Recursive patterns (?R) / (?1)not supportedsupported
Atomic groups (?>…)not supported (3.10 added possessive *+, ++, ?+)supported
Variable-length lookbehindnot supported (fixed-length only)PCRE2 supports it
Unicode property classes \p{…}supported (3.7+)supported
Inline flag scoping (?i:…)supportedsupported
Conditionals (?(name)yes|no)supportedsupported
Default Unicodeyes (\w matches Unicode letters)depends on tool config

If you need \K, atomic groups, or variable-length lookbehind in Python, install the regex package (pip install regex) — it is a near-superset of re with the same API.

Performance tips

A short list of patterns that the engine optimizes well, plus the anti-patterns that cause catastrophic backtracking.

Anchor when you can

Patterns that start with ^, \A, or a literal prefix skip ahead quickly because the engine can fail-fast without trying every position.

python
import re, timeit

PAT_NAIVE  = re.compile(r'.*error')
PAT_FAST   = re.compile(r'^.*error')

text = 'a' * 10_000 + 'error'
print(timeit.timeit(lambda: PAT_NAIVE.search(text),  number=1000))
print(timeit.timeit(lambda: PAT_FAST.search(text),   number=1000))

Output:

text
0.0185
0.0142

Avoid nested quantifiers

(a+)+, (a|a)+, and (.*)+ cause exponential backtracking on non-matching input. Refactor to a single quantifier or use a possessive quantifier (3.11+).

python
import re

# DANGEROUS — exponential on non-match
# re.match(r'^(a+)+b$', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!')

# SAFE
print(re.match(r'^a+b$', 'aaaaaaaaaab'))

Output:

text
<re.Match object; span=(0, 11), match='aaaaaaaaaab'>

Use non-capturing groups (?:…)

If you only need grouping for alternation or quantification, use (?:…) — it avoids the overhead of remembering the capture for later retrieval.

python
import re
# capturing — slower, populates .groups()
print(re.findall(r'(foo|bar)+', 'foofoobar'))
# non-capturing — faster, returns full matches
print(re.findall(r'(?:foo|bar)+', 'foofoobar'))

Output:

text
['bar']
['foofoobar']

Common pitfalls

  1. Forgetting the raw-string prefix r'''\d' works by accident (\d isn't a Python escape) but '\b' does not (\b is the backspace character). Always use r'...' for regex literals.
  2. Using match when you wanted searchmatch only checks the start of the string. Use search for "does this appear anywhere" or fullmatch for "is this the entire string".
  3. findall returns tuples when there are groups — if your pattern has any capture groups, findall returns the captures only, not the full match. Use non-capturing groups (?:…) to keep the full string, or switch to finditer.
  4. . doesn't match newlines by default — pass re.S (DOTALL) when scanning across line boundaries.
  5. ^ and $ only match string ends by default — pass re.M (MULTILINE) for per-line anchoring.
  6. Greedy vs non-greedy */+<.+> on <a><b> matches <a><b> entire. Use <.+?> for non-greedy, or [^>]+ for a negative class (faster).
  7. Catastrophic backtracking — patterns like (a+)+$ lock up on non-matching input. Refactor with possessive quantifiers (a++) or atomic groups via the regex package.
  8. Mixing bytes and str — a bytes pattern (rb'\d') can only match a bytes input, and vice versa. Mismatching raises TypeError.
  9. Named groups must be unique(?P<a>x)(?P<a>y) raises a re.error. Use unique names or numeric groups.
  10. Variable-length lookbehind(?<=ab|abc) is rejected because the alternatives differ in length. Either split into two patterns or switch to the regex library.

Real-world recipes

Parse a multi-line config file

A .ini-style config parser using verbose regex for clarity. Handles section headers, key=value pairs, and # comments.

python
import re

CONFIG = re.compile(r"""
    ^\s*
    (?:
        \[ (?P<section>[^\]]+) \]            # section header
      | (?P<key>[\w.]+) \s* = \s* (?P<value>.*?)  # key = value
      | \# .*                                # comment
    )?
    \s* $
""", re.VERBOSE | re.MULTILINE)

text = """
# database config
[database]
host = localhost
port = 5432

[logging]
level = INFO
"""

current = None
config = {}
for m in CONFIG.finditer(text):
    if m['section']:
        current = m['section']
        config[current] = {}
    elif m['key'] and current is not None:
        config[current][m['key']] = m['value']

print(config)

Output:

text
{'database': {'host': 'localhost', 'port': '5432'}, 'logging': {'level': 'INFO'}}

Extract structured records from logs

Pull timestamp, level, and message from each line of a typical log file using named groups.

python
import re

LINE = re.compile(r"""
    ^
    (?P<ts>\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2})
    \s+
    (?P<level>DEBUG|INFO|WARN|ERROR)
    \s+
    (?P<msg>.*)
    $
""", re.VERBOSE)

logs = [
    '2026-05-25 12:01:03 INFO  starting service',
    '2026-05-25 12:01:04 ERROR connection refused',
    'malformed line',
]
for line in logs:
    m = LINE.match(line)
    if m:
        print(m.groupdict())

Output:

text
{'ts': '2026-05-25 12:01:03', 'level': 'INFO', 'msg': 'starting service'}
{'ts': '2026-05-25 12:01:04', 'level': 'ERROR', 'msg': 'connection refused'}

Strip ANSI escape sequences

Normalize colored terminal output before saving to a file.

python
import re

ANSI = re.compile(r'\x1b\[[0-9;]*m')
colored = '\x1b[31mERROR\x1b[0m: \x1b[1mfile\x1b[0m missing'
print(ANSI.sub('', colored))

Output:

text
ERROR: file missing

URL slug from title

Lowercase, drop non-alphanumerics, collapse whitespace into a single hyphen.

python
import re

def slugify(title):
    s = title.lower()
    s = re.sub(r'[^a-z0-9\s-]', '', s)         # strip punctuation
    s = re.sub(r'\s+', '-', s)                  # spaces → hyphens
    s = re.sub(r'-+', '-', s).strip('-')        # collapse hyphens
    return s

print(slugify('Hello, World! — Python 3.12 & re.X'))

Output:

text
hello-world-python-312-rex

Reformat phone numbers

Normalize many input shapes to a single canonical format.

python
import re

PHONE = re.compile(r"""
    \D*                              # any leading non-digits
    (?P<area>\d{3}) \D*
    (?P<prefix>\d{3}) \D*
    (?P<line>\d{4})
    \D*$
""", re.VERBOSE)

for raw in ['(555) 123-4567', '555.123.4567', '5551234567', '+1 555 123 4567']:
    m = PHONE.match(raw)
    if m:
        print(f"({m['area']}) {m['prefix']}-{m['line']}")

Output:

text
(555) 123-4567
(555) 123-4567
(555) 123-4567
(555) 123-4567

Find unmatched braces

Use a non-greedy quantifier and a negative lookahead to detect dangling { without a }.

python
import re

# Match `{...}` blocks; collect text outside them, look for stray `{`
text = '{ok} stray { and {nested {inner}} {ok}'
balanced = re.sub(r'\{[^{}]*\}', '', text)
print('residue:', repr(balanced))
print('stray opens:', balanced.count('{'))

Output:

text
residue: ' stray { and {nested } '
stray opens: 2

Replace with a counter

Number every occurrence of a pattern, using a closure as the sub callback.

python
import re

def numberer():
    n = 0
    def repl(m):
        nonlocal n
        n += 1
        return f'[{n}]{m.group()}'
    return repl

print(re.sub(r'\b\w+\b', numberer(), 'one two three four'))

Output:

text
[1]one [2]two [3]three [4]four

Tokenize source code

A toy tokenizer using re.Scanner (a hidden gem in re).

python
import re

scanner = re.Scanner([
    (r'\d+',           lambda s, t: ('NUM', int(t))),
    (r'[+\-*/]',       lambda s, t: ('OP', t)),
    (r'[a-zA-Z_]\w*',  lambda s, t: ('IDENT', t)),
    (r'\s+',           None),
])

tokens, remainder = scanner.scan('x = 10 + 20 * y')
print(tokens)

Output:

text
[('IDENT', 'x'), ('IDENT', 'e'), ('NUM', 10), ('OP', '+'), ('NUM', 20), ('OP', '*'), ('IDENT', 'y')]

re.Scanner is undocumented but stable since Python 2.4. For production tokenizers prefer tokenize (for Python source) or a dedicated lexer like ply or lark.

See also

  • linux/pcre — the PCRE dialect used by grep -P, ripgrep, nginx, PHP
  • linux/grep and linux/sed — shell tools that consume the same patterns
  • javascript/regex — JavaScript regex differences, when you cross language boundaries
  • regex — third-party drop-in with variable-length lookbehind, atomic groups, \K