concept · weight 6

Regular Expressions

A pattern-matching mini-language for searching, validating, and rewriting text — implemented (with subtly different dialects) by every modern language and CLI tool.

Regular Expressions

Definition

A regular expression (regex, regexp) is a compact mini-language for describing a set of strings — a pattern that an engine can use to test whether text matches, find where it matches, extract sub-parts, or rewrite it. Formally a regex describes a regular language, the class of string sets recognisable by a finite automaton; in practice, modern "regex" dialects (PCRE, ECMAScript, Python's re, .NET, Go's RE2) extend that core with shorthand classes, anchors, lookarounds, named captures, and backreferences — most of which push the language beyond what is strictly regular.

Why it matters

Regex is the lingua franca of text manipulation. Every grep, sed, awk, find, log-tail, log-grep, IDE search-and-replace, URL router, JSON-path validator, lexer, and template engine on a modern computer is, somewhere, parsing a regex. Knowing the syntax lets you replace 20 lines of if/split/startswith with one expressive pattern, and reading regex fluently lets you understand the configuration of nginx, Apache, CI pipelines, ESLint rules, ripgrep queries, and SIEM detections without translating each one back into prose. Conversely, not knowing regex pitfalls — most famously catastrophic backtracking — is how a one-line input-validation pattern turns into a denial-of-service vulnerability (ReDoS) that freezes a production server.

How it works

A regex is compiled by an engine into a state machine; matching then walks that machine over the input. The two engine families behave differently enough that the distinction matters in production:

  • Backtracking engines (PCRE2, Perl, Python re, JavaScript RegExp, .NET, Ruby, Java) build a non-deterministic finite automaton (NFA) and try alternatives recursively, backing up when a branch fails. This buys you backreferences and lookbehinds, but worst-case time is exponential in the input length.
  • Linear-time engines (Google's RE2, Rust's regex crate, ripgrep, Go's regexp) compile to a deterministic finite automaton (DFA) simulated in Thompson's NFA-parallel style. They guarantee O(n) match time at the cost of forbidding backreferences (and historically lookarounds, though RE2 added linear-time lookbehinds in 2023).

The dialect landscape itself has shifted. PCRE1 (the 8.xx series) hit end-of-life on 15-June-2021 at 8.45 — when a tool says "PCRE" today it almost always means PCRE2 (10.xx) under the hood. PCRE2 releases on a roughly semi-annual cadence; recent additions include (*scan_substring:...) for re-asserting against an already-captured group (10.45), UTS#18 character-class operators && / -- / ~~ and Perl-style (?[...]) extended classes (10.45, under PCRE2_ALT_EXTENDED_CLASS), 128-character group names (10.44, up from 32), and — in 10.47 (Oct 2025) — pattern-recursion with group return values plus a new pcre2_next_match() iteration API. If you maintain code linking the old libpcre, plan a port to libpcre2 (PCRE_*PCRE2_*, explicit code-unit width, distinct headers).

A pattern is built from a small set of primitives:

ConstructMeaningExample
Literal charmatches itselfa matches a
Metacharspecial meaning. ^ $ * + ? ( ) [ ] { } \ `
Character classone char from a set[A-Za-z0-9_]
Shorthand classpredefined set\d digit, \w word, \s whitespace
Anchorzero-width position^ start, $ end, \b word boundary
Quantifierrepetition* 0+, + 1+, ? 0–1, {n,m} bounded
Groupsub-pattern + capture(abc), (?:abc) non-capturing, (?P<name>abc) named
AlternationA or B`cat
Backreferencere-match captured text\1, (?P=name)
Lookaroundzero-width assertion(?=...) ahead, (?<=...) behind

Dialects you will actually meet, ordered roughly from least to most featureful:

DialectWhere you find itNotes
Windows findstr /Rcmd.exeStrict subset of POSIX BRE; no +, ?, {n,m}
POSIX BREdefault grep, sed, vi\+, \?, \(, \) must be escaped
POSIX EREgrep -E, egrep, awk, sed -EUnescaped + ? ( ); no shorthand classes
ECMAScriptJavaScript /pat/flags, TypeScriptu for Unicode, v (ES2024) for unicodeSets + set-notation classes, d (ES2022) for indices, s (ES2018) for dotAll
Python reevery import reNear-PCRE; (?P<name>…) for named groups. regex 3rd-party module adds variable-length lookbehind + atomic groups
PCRE2 (10.x)grep -P, ripgrep --pcre2, PHP 7.3+, nginx, pcregrepFull feature set: recursion w/ return values (10.47), atomic groups, \K, UTS#18 class operators (10.45)
RE2 / Rust regexGo, ripgrep default, Cloudflare WAF, ugrepLinear-time DFA; no backreferences

Flags (or inline (?i)-style modifiers) tweak the same pattern: i case-insensitive, m multiline (^/$ per line), s/dotAll (. matches \n), x/verbose (whitespace and comments ignored), g/global (find all). Most engines also expose non-greedy quantifiers (*?, +?, ??) that match the shortest substring instead of the longest — essential for <.*?> patterns over HTML-ish text.

Common pitfalls

  1. Catastrophic backtracking → ReDoS. Patterns with nested quantifiers over an overlapping subexpression — (a+)+$, (a|a)*b, (.*)+$ — explode to exponential time on near-matching input. Fix: rewrite to remove the ambiguity (a+$), use atomic groups (?>...) or possessive ++, or switch to RE2/Rust regex for untrusted input.
  2. Greedy .* over-matches. <h1>.*</h1> against <h1>A</h1><h1>B</h1> captures the whole line. Use .*? (non-greedy), a negated class ([^<]*), or — for real markup — an actual parser.
  3. Dialect drift. \d works in PCRE/Python/JS but is undefined in POSIX BRE/ERE; (?<=...) lookbehind is unsupported in Go's RE2 and in old JavaScript (< ES2018). Always know which engine the tool actually runs.
  4. Forgetting to escape user input. Concatenating an untrusted string into a regex turns user-supplied dots and brackets into wildcards. Use re.escape() (Python), RegExp.escape() (TC39 Stage-4), or \Q…\E (PCRE/Java) to quote a literal.
  5. JavaScript lastIndex state with /g or /y. A RegExp object with the global or sticky flag mutates on each .exec()/.test(), so calling it twice on the same input gives different answers. Reset re.lastIndex = 0 between unrelated calls, or use str.match(re) which does not stash state.
  6. Anchors and multiline mode. ^ and $ match string boundaries by default; in many engines you need the m flag to make them match every line. \A and \z (PCRE/Python) are the unambiguous string-only anchors.
  7. Unicode is opt-in — and grep -P reverted to ASCII-only. JavaScript needs the u (or modern v) flag for proper surrogate-pair handling and \p{...} properties; Python uses re.UNICODE (the default on str, off on bytes); grep needs a UTF-8 locale. Since GNU grep 3.11 (May 2023), -P shorthand classes \w/\d/\s and the \b boundary are ASCII-only by default — to get Unicode behaviour you must opt in with (*UCP) at the start of the pattern (e.g. grep -P '(*UCP)\w+'). POSIX classes like [[:alpha:]] remain locale-aware and are the portable alternative. Without any of these, . may match half a code point.
  8. Parsing what you should not parse. Regex cannot match balanced structures (HTML tags, JSON, source code) — those are context-free, not regular. PCRE's recursive subroutines fake it, but the right answer is a real parser.
  9. Reaching for the wrong tool entirely. For codebase-wide search, line-oriented regex tools (grep, rg, ugrep) are the wrong abstraction when you actually want to match syntactic structures — function calls, class declarations, JSX nodes. ast-grep matches tree-sitter ASTs across two-dozen languages and can rewrite structurally rather than textually; reach for it when a regex would need lookarounds plus negative-lookahead gymnastics to express "any call to foo that isn't already wrapped in try". For grep-flag muscle memory plus boolean queries, search inside zip/7z/tar archives and PDFs, or an interactive TUI, ugrep is the modern drop-in worth knowing.

Where to go next

Companion cheat sheets that put this concept to work:

  • /sections/linux/pcre — the canonical PCRE2 syntax reference: every class, quantifier, anchor, group, and assertion, plus the 10.43 → 10.47 feature timeline (scan-substring, UTS#18 class operators, recursion-with-returns).
  • /sections/linux/grep — three regex engines in one tool (BRE / ERE / PCRE) and how the -E / -P flags swap dialects; includes the GNU grep 3.11 UCP/Unicode quirk under -P.
  • /sections/linux/ripgrep — Rust-regex by default for linear-time guarantees, -P to opt into PCRE2 for lookaround/backreferences, plus ripgrep 14's --hyperlink-format for clickable paths and 15.x's first-class .jj (Jujutsu) ignore-discovery.
  • /sections/linux/sed — the s/pattern/replacement/ workhorse: addressing, capture-group backreferences (\1\9), and in-place edits.
  • /sections/python/re — Python's near-PCRE module: compile, match vs search vs fullmatch, named groups, verbose mode.
  • /sections/javascript/regex — ECMAScript regex: literal vs constructor, flags (g/y/u/v/d/s), named captures, and the lastIndex trap.
  • /sections/windows/findstr — the minimal regex dialect built into cmd.exe, plus PowerShell Select-String as the modern alternative.

Sources

References consulted while writing this concept page. Links open in a new tab.

  • Wikipedia — Regular expression — Canonical definition, Kleene/Thompson history, and the formal-vs-modern distinction that frames the Definition section.
  • Wikipedia — ReDoS — Authoritative summary of catastrophic backtracking, the three conditions that produce it, and the standard mitigations (RE2, atomic groups, timeouts).
  • PCRE2 — pcre2pattern specification — The reference for the PCRE2 dialect cited throughout the dialect table.
  • PCRE2 — news.txt (10.43 → 10.47 release history) — Authoritative release notes for the recent feature additions (scan-substring, UTS#18 class operators, recursion-with-returns, pcre2_next_match) called out in the "How it works" PCRE2 paragraph.
  • GNU grep manual (3.12) — Source for the 3.11 ASCII-only revert under -P and the (*UCP) opt-in that pitfall #7 documents.
  • ripgrep CHANGELOG — Tracks the 14.x hyperlink-format work and 15.x Jujutsu (.jj) ignore discovery referenced under "Where to go next".
  • Russ Cox — Regular Expression Matching Can Be Simple And Fast — The classic write-up on Thompson NFA simulation and why backtracking engines blow up; underpins the engine-families paragraph in "How it works".
  • ast-grep — Tool comparison — Why structural AST search complements (rather than replaces) regex-line tools; source for pitfall #9's "wrong tool entirely" framing.