concept · weight 6
Regular Expressions
A pattern-matching mini-language for searching, validating, and rewriting text — implemented (with subtly different dialects) by every modern language and CLI tool.
Regular Expressions
Definition
A regular expression (regex, regexp) is a compact mini-language for describing a set of strings — a pattern that an engine can use to test whether text matches, find where it matches, extract sub-parts, or rewrite it. Formally a regex describes a regular language, the class of string sets recognisable by a finite automaton; in practice, modern "regex" dialects (PCRE, ECMAScript, Python's re, .NET, Go's RE2) extend that core with shorthand classes, anchors, lookarounds, named captures, and backreferences — most of which push the language beyond what is strictly regular.
Why it matters
Regex is the lingua franca of text manipulation. Every grep, sed, awk, find, log-tail, log-grep, IDE search-and-replace, URL router, JSON-path validator, lexer, and template engine on a modern computer is, somewhere, parsing a regex. Knowing the syntax lets you replace 20 lines of if/split/startswith with one expressive pattern, and reading regex fluently lets you understand the configuration of nginx, Apache, CI pipelines, ESLint rules, ripgrep queries, and SIEM detections without translating each one back into prose. Conversely, not knowing regex pitfalls — most famously catastrophic backtracking — is how a one-line input-validation pattern turns into a denial-of-service vulnerability (ReDoS) that freezes a production server.
How it works
A regex is compiled by an engine into a state machine; matching then walks that machine over the input. The two engine families behave differently enough that the distinction matters in production:
- Backtracking engines (PCRE2, Perl, Python
re, JavaScriptRegExp, .NET, Ruby, Java) build a non-deterministic finite automaton (NFA) and try alternatives recursively, backing up when a branch fails. This buys you backreferences and lookbehinds, but worst-case time is exponential in the input length. - Linear-time engines (Google's RE2, Rust's
regexcrate, ripgrep, Go'sregexp) compile to a deterministic finite automaton (DFA) simulated in Thompson's NFA-parallel style. They guarantee O(n) match time at the cost of forbidding backreferences (and historically lookarounds, though RE2 added linear-time lookbehinds in 2023).
The dialect landscape itself has shifted. PCRE1 (the 8.xx series) hit end-of-life on 15-June-2021 at 8.45 — when a tool says "PCRE" today it almost always means PCRE2 (10.xx) under the hood. PCRE2 releases on a roughly semi-annual cadence; recent additions include (*scan_substring:...) for re-asserting against an already-captured group (10.45), UTS#18 character-class operators && / -- / ~~ and Perl-style (?[...]) extended classes (10.45, under PCRE2_ALT_EXTENDED_CLASS), 128-character group names (10.44, up from 32), and — in 10.47 (Oct 2025) — pattern-recursion with group return values plus a new pcre2_next_match() iteration API. If you maintain code linking the old libpcre, plan a port to libpcre2 (PCRE_* → PCRE2_*, explicit code-unit width, distinct headers).
A pattern is built from a small set of primitives:
| Construct | Meaning | Example |
|---|---|---|
| Literal char | matches itself | a matches a |
| Metachar | special meaning | . ^ $ * + ? ( ) [ ] { } \ ` |
| Character class | one char from a set | [A-Za-z0-9_] |
| Shorthand class | predefined set | \d digit, \w word, \s whitespace |
| Anchor | zero-width position | ^ start, $ end, \b word boundary |
| Quantifier | repetition | * 0+, + 1+, ? 0–1, {n,m} bounded |
| Group | sub-pattern + capture | (abc), (?:abc) non-capturing, (?P<name>abc) named |
| Alternation | A or B | `cat |
| Backreference | re-match captured text | \1, (?P=name) |
| Lookaround | zero-width assertion | (?=...) ahead, (?<=...) behind |
Dialects you will actually meet, ordered roughly from least to most featureful:
| Dialect | Where you find it | Notes |
|---|---|---|
Windows findstr /R | cmd.exe | Strict subset of POSIX BRE; no +, ?, {n,m} |
| POSIX BRE | default grep, sed, vi | \+, \?, \(, \) must be escaped |
| POSIX ERE | grep -E, egrep, awk, sed -E | Unescaped + ? ( ); no shorthand classes |
| ECMAScript | JavaScript /pat/flags, TypeScript | u for Unicode, v (ES2024) for unicodeSets + set-notation classes, d (ES2022) for indices, s (ES2018) for dotAll |
Python re | every import re | Near-PCRE; (?P<name>…) for named groups. regex 3rd-party module adds variable-length lookbehind + atomic groups |
| PCRE2 (10.x) | grep -P, ripgrep --pcre2, PHP 7.3+, nginx, pcregrep | Full feature set: recursion w/ return values (10.47), atomic groups, \K, UTS#18 class operators (10.45) |
RE2 / Rust regex | Go, ripgrep default, Cloudflare WAF, ugrep | Linear-time DFA; no backreferences |
Flags (or inline (?i)-style modifiers) tweak the same pattern: i case-insensitive, m multiline (^/$ per line), s/dotAll (. matches \n), x/verbose (whitespace and comments ignored), g/global (find all). Most engines also expose non-greedy quantifiers (*?, +?, ??) that match the shortest substring instead of the longest — essential for <.*?> patterns over HTML-ish text.
Common pitfalls
- Catastrophic backtracking → ReDoS. Patterns with nested quantifiers over an overlapping subexpression —
(a+)+$,(a|a)*b,(.*)+$— explode to exponential time on near-matching input. Fix: rewrite to remove the ambiguity (a+$), use atomic groups(?>...)or possessive++, or switch to RE2/Rustregexfor untrusted input. - Greedy
.*over-matches.<h1>.*</h1>against<h1>A</h1><h1>B</h1>captures the whole line. Use.*?(non-greedy), a negated class ([^<]*), or — for real markup — an actual parser. - Dialect drift.
\dworks in PCRE/Python/JS but is undefined in POSIX BRE/ERE;(?<=...)lookbehind is unsupported in Go's RE2 and in old JavaScript (< ES2018). Always know which engine the tool actually runs. - Forgetting to escape user input. Concatenating an untrusted string into a regex turns user-supplied dots and brackets into wildcards. Use
re.escape()(Python),RegExp.escape()(TC39 Stage-4), or\Q…\E(PCRE/Java) to quote a literal. - JavaScript
lastIndexstate with/gor/y. ARegExpobject with the global or sticky flag mutates on each.exec()/.test(), so calling it twice on the same input gives different answers. Resetre.lastIndex = 0between unrelated calls, or usestr.match(re)which does not stash state. - Anchors and multiline mode.
^and$match string boundaries by default; in many engines you need themflag to make them match every line.\Aand\z(PCRE/Python) are the unambiguous string-only anchors. - Unicode is opt-in — and
grep -Preverted to ASCII-only. JavaScript needs theu(or modernv) flag for proper surrogate-pair handling and\p{...}properties; Python usesre.UNICODE(the default onstr, off onbytes);grepneeds a UTF-8 locale. Since GNU grep 3.11 (May 2023),-Pshorthand classes\w/\d/\sand the\bboundary are ASCII-only by default — to get Unicode behaviour you must opt in with(*UCP)at the start of the pattern (e.g.grep -P '(*UCP)\w+'). POSIX classes like[[:alpha:]]remain locale-aware and are the portable alternative. Without any of these,.may match half a code point. - Parsing what you should not parse. Regex cannot match balanced structures (HTML tags, JSON, source code) — those are context-free, not regular. PCRE's recursive subroutines fake it, but the right answer is a real parser.
- Reaching for the wrong tool entirely. For codebase-wide search, line-oriented regex tools (
grep,rg,ugrep) are the wrong abstraction when you actually want to match syntactic structures — function calls, class declarations, JSX nodes.ast-grepmatches tree-sitter ASTs across two-dozen languages and can rewrite structurally rather than textually; reach for it when a regex would need lookarounds plus negative-lookahead gymnastics to express "any call tofoothat isn't already wrapped intry". For grep-flag muscle memory plus boolean queries, search inside zip/7z/tar archives and PDFs, or an interactive TUI,ugrepis the modern drop-in worth knowing.
Where to go next
Companion cheat sheets that put this concept to work:
- /sections/linux/pcre — the canonical PCRE2 syntax reference: every class, quantifier, anchor, group, and assertion, plus the 10.43 → 10.47 feature timeline (scan-substring, UTS#18 class operators, recursion-with-returns).
- /sections/linux/grep — three regex engines in one tool (BRE / ERE / PCRE) and how the
-E/-Pflags swap dialects; includes the GNU grep 3.11 UCP/Unicode quirk under-P. - /sections/linux/ripgrep — Rust-regex by default for linear-time guarantees,
-Pto opt into PCRE2 for lookaround/backreferences, plus ripgrep 14's--hyperlink-formatfor clickable paths and 15.x's first-class.jj(Jujutsu) ignore-discovery. - /sections/linux/sed — the
s/pattern/replacement/workhorse: addressing, capture-group backreferences (\1–\9), and in-place edits. - /sections/python/re — Python's near-PCRE module:
compile,matchvssearchvsfullmatch, named groups, verbose mode. - /sections/javascript/regex — ECMAScript regex: literal vs constructor, flags (
g/y/u/v/d/s), named captures, and thelastIndextrap. - /sections/windows/findstr — the minimal regex dialect built into
cmd.exe, plus PowerShellSelect-Stringas the modern alternative.
Sources
References consulted while writing this concept page. Links open in a new tab.
- Wikipedia — Regular expression — Canonical definition, Kleene/Thompson history, and the formal-vs-modern distinction that frames the Definition section.
- Wikipedia — ReDoS — Authoritative summary of catastrophic backtracking, the three conditions that produce it, and the standard mitigations (RE2, atomic groups, timeouts).
- PCRE2 — pcre2pattern specification — The reference for the PCRE2 dialect cited throughout the dialect table.
- PCRE2 — news.txt (10.43 → 10.47 release history) — Authoritative release notes for the recent feature additions (scan-substring, UTS#18 class operators, recursion-with-returns,
pcre2_next_match) called out in the "How it works" PCRE2 paragraph. - GNU grep manual (3.12) — Source for the 3.11 ASCII-only revert under
-Pand the(*UCP)opt-in that pitfall #7 documents. - ripgrep CHANGELOG — Tracks the 14.x hyperlink-format work and 15.x Jujutsu (
.jj) ignore discovery referenced under "Where to go next". - Russ Cox — Regular Expression Matching Can Be Simple And Fast — The classic write-up on Thompson NFA simulation and why backtracking engines blow up; underpins the engine-families paragraph in "How it works".
- ast-grep — Tool comparison — Why structural AST search complements (rather than replaces) regex-line tools; source for pitfall #9's "wrong tool entirely" framing.