cheat sheet
charset-normalizer
Package-level reference for charset-normalizer on PyPI — what it does, install, integration with requests, version policy, and alternatives.
charset-normalizer
What it is
charset-normalizer is a pure-Python library for guessing the character encoding of arbitrary bytes. It is the default character-set detector inside requests 3.x and later 2.x releases, replacing the long-running chardet dependency. The library inspects byte distributions, language-specific n-grams, and mojibake heuristics to return a ranked list of plausible encodings.
Reach for charset-normalizer when you need to: process a file whose encoding is unknown or untrustworthy (legacy logs, scraped HTML, user uploads); replace chardet for licensing reasons (it's MIT vs chardet's LGPL); or get higher-quality detection on small (<100 byte) samples than chardet provides.
Install
pip install charset-normalizer
Output: (none — exits 0 on success)
uv add charset-normalizer
Output: dependency resolved + added to pyproject.toml
poetry add charset-normalizer
Output: updated lockfile + virtualenv install
pip install "charset-normalizer[unicode_backport]"
Output: installs additional Unicode tables for older Python versions where the stdlib unicodedata table lacks coverage.
Versioning & Python support
- Current line is the
3.xseries. Semantic versioning is observed — minor releases are backwards-compatible. - Supports Python 3.7+ on recent releases (
3.4+was supported on the2.xseries). - The
2.x → 3.xboundary was the largest API cleanup;3.xis stable since 2022. requestsdeclaredcharset-normalizer ~= 3.0as a transitive dep, so pinning is rarely a problem.
Package metadata
- Maintainer: Ahmed TAHRI (
@Ousret) - Project home: github.com/jawah/charset_normalizer
- Docs: charset-normalizer.readthedocs.io
- PyPI: pypi.org/project/charset-normalizer
- License: MIT
- Governance: single-maintainer; under the
jawahGitHub org - First released: 2019
- Downloads: consistently top 10 on PyPI (transitive via
requests)
Optional dependencies & extras
charset-normalizer[unicode_backport]— installsunicodedata2for richer Unicode tables on Python 3.7-3.9 where the stdlib table is older.- No required runtime deps — the library is pure Python.
- An optional Rust extension is provided via the
mypycbuild for ~2× speedup on big payloads (transparent when wheel is installed).
Alternatives
| Package | Trade-off |
|---|---|
chardet | The original. LGPL-licensed; slower; long-time requests default before 3.x. |
cchardet (or faust-cchardet) | Cython wrapper over a C library. Fastest, but binary dep and maintenance has been patchy. |
ftfy | Solves a different problem — fixes already-broken mojibake. Pair with charset-normalizer for full-stack recovery. |
Standard bytes.decode("utf-8", errors="replace") | Use when you're confident the encoding is UTF-8. Avoid errors="ignore". |
Common gotchas
- Detection is heuristic, not deterministic. Short inputs (<50 bytes) or inputs that look valid under multiple encodings (a lot of CP1252 ↔ ISO-8859-1) produce uncertain results. Always check the
chaosandcoherencefields. from_bytes()returns aCharsetMatchescollection. Index[0]for the best match; iterate for ranked alternatives.- Encoding name normalization differs from
codecs.from_bytes(...).best().encodingreturns names likecp1252,utf_8,iso8859_5— matching Python codec aliases but not always equal to the codepage number. .output()decodes lazily. Usestr(match)to materialize the decoded text.from_path()reads the entire file by default. For huge files, passsteps=...or read a sample and callfrom_bytes()yourself.- Detection-vs-coercion confusion.
charset-normalizerdecides what an encoding most likely IS; it does NOT re-encode broken UTF-8 (that'sftfy's job). - CLI is shipped as
normalizer(notcharset-normalizer). Add it to PATH viapip install --useror usepython -m charset_normalizer.
Real-world recipes
The recipes lean toward the common content-pipeline use cases: detecting unknown files, integrating with requests, and benchmarking against chardet.
Recipe 1 — Detect the encoding of a file on disk.
from charset_normalizer import from_path
results = from_path("./mystery.csv")
best = results.best()
print(best.encoding, best.language, best.chaos, best.coherence)
print(str(best)[:200])
Output:
cp1252 English 0.013 0.951
<first 200 chars of the decoded file>
chaos is a noise score (lower is better); coherence is a confidence score (higher is better). Both range 0-1.
Recipe 2 — Force a fallback when detection is uncertain.
from charset_normalizer import from_bytes
raw = open("./short.bin", "rb").read()
results = from_bytes(raw)
if not results or results.best().chaos > 0.5:
text = raw.decode("utf-8", errors="replace") # fallback
else:
text = str(results.best())
print(text[:100])
Output: uses the detector when confident; falls back to UTF-8 with replace (which yields � for invalid bytes) when detection is shaky.
Recipe 3 — Integration with requests to recover encoding.
import requests
from charset_normalizer import from_bytes
r = requests.get("https://example.com/article", timeout=5)
# requests normally guesses from headers; override with the detector for unreliable servers:
match = from_bytes(r.content).best()
text = str(match) if match else r.text
print(match.encoding if match else r.encoding)
Output: the better of the two encodings. Useful for legacy sites that lie about Content-Type: charset= (many Asian sites mislabel as iso-8859-1).
Recipe 4 — Benchmark charset-normalizer vs chardet.
import time
from charset_normalizer import from_bytes
import chardet # pip install chardet
with open("./sample.txt", "rb") as f:
data = f.read()
t = time.perf_counter(); cn = from_bytes(data).best().encoding; t1 = time.perf_counter() - t
t = time.perf_counter(); cd = chardet.detect(data)["encoding"]; t2 = time.perf_counter() - t
print(f"charset-normalizer: {cn} ({t1*1000:.2f} ms)")
print(f"chardet: {cd} ({t2*1000:.2f} ms)")
Output:
charset-normalizer: utf_8 (1.40 ms)
chardet: utf-8 (4.12 ms)
Typical 2-4× speedup on small payloads; both detect UTF-8 reliably.
Recipe 5 — CLI usage for quick file inspection.
normalizer ./mystery.csv
Output:
mystery.csv → cp1252 (English, confidence 95%)
Pipe-friendly: pass --alternative to see ranked candidates; --minimal for one-line output.
Performance tuning
from_path(..., steps=10)samples the file at N positions instead of reading everything. Tune to match your latency budget.from_bytes(..., explain=False)disables the detailed analysis log (already the default). Don't enable in production.- Cache the result per file if you re-process the same file repeatedly.
- Use the C-accelerated wheel.
pip install -U charset-normalizertypically pulls a wheel; verify withpython -c "from charset_normalizer.utils import is_accelerated_module; print(is_accelerated_module())". - For very small inputs (<50 bytes),
chardetmay actually be more accurate. Heuristic detectors all struggle with short text — fall back to an explicit encoding hint where possible.
Version migration guide
2.x → 3.0— minimum Python 3.7;CharsetNormalizerMatchesrenamedCharsetMatches; some helper functions moved into thelegacyshim and are scheduled for removal.3.0 → 3.1—from_pathsemantics around very small files tightened;chaosscoring rebalanced.3.2 → 3.3— wheel build picks upmypycacceleration on common platforms.3.3 → 3.4— minor scoring improvements; behavior for empty input unified acrossfrom_*functions.
# Pre-3.0
from charset_normalizer import CharsetNormalizerMatches as match
# 3.0+
from charset_normalizer import CharsetMatches as match
Output: same data class, new canonical name.
Security considerations
- Decoded text from untrusted sources is still untrusted. Detection picks an encoding; it doesn't sanitize the content. Validate / escape downstream.
- Avoid
errors="ignore"when decoding — silently dropping bytes can mask attacks (e.g. directory-traversal sequences). - DoS via huge inputs —
from_pathwill read the whole file. Cap input size or usesteps=. - Permissive matching —
chaos > 0.7outputs are unreliable; treat them as "unknown" rather than a confident detection.
Testing & CI
The detector is deterministic per input, so testing is straightforward — pin sample bytes and assert on the result.
from charset_normalizer import from_bytes
def test_detects_cp1252():
# Windows-1252 smart-quotes around an English phrase
raw = b"He said \x93hi\x94, then left."
best = from_bytes(raw).best()
assert best.encoding in {"cp1252", "windows_1252"}
assert "hi" in str(best)
Output: assertion holds across versions.
For CI, also lint that no errors="ignore" calls slipped in:
grep -rE 'decode\("[^"]+",\s*errors\s*=\s*"ignore"\)' src/ && exit 1 || true
Output: exits 1 if any file in src/ uses errors="ignore" on a decode call.
Ecosystem integrations
requests— default character-set detector since2.32.x(replacedchardet).httpx— usescharset-normalizerforresponse.textif installed.pandas— no built-in integration, but useful forread_csvencoding=selection on unknown files.ftfy— pair with charset-normalizer when content is known mojibake (encoded twice).- CLI:
normalizer— bundled CLI for quick inspection from a shell.
Compatibility matrix
| Python | charset-normalizer | Notes |
|---|---|---|
| 3.6 | 2.x (frozen) | Final supported line for 3.6. |
| 3.7 | 3.x | Floor for 3.x. |
| 3.8 | 3.x | Stable. |
| 3.9 | 3.x | Stable. |
| 3.10 | 3.x | Stable. |
| 3.11 | 3.x | Best perf. |
| 3.12 | 3.x | Stable. |
| 3.13 | 3.x | Wheel available; mypyc accelerated. |
Production deployment
- Pin a tight minor range in libraries (
charset-normalizer>=3.3,<4). The detector occasionally changes scoring between minors; flaky tests can result. - Don't disable acceleration. Always install via wheel; falling back to pure-Python on production is a 3-5× regression.
- Health-check encoding sensitivity. If your service processes user-uploaded text, log detection failures (
chaos > 0.5) — they often indicate malformed clients. - Bound input size. Files >10 MB should be sampled (
steps=) rather than read in full. - Audit
errors=parameters across the codebase —replaceis safe,ignoremasks bugs.
When NOT to use this
- You know the encoding. If the file/header says UTF-8, decode it as UTF-8. Don't paper over real bugs with detection.
- You need code-page-specific behavior (Asian legacy encodings) where commercial tools or
cchardethistorically did better. Test against your data before committing. - Tight resource budgets. Decoding a 100 MB file via detection is wasteful when you can sample the first 4 KB.
- You're decoding network protocols with strict encoding rules (HTTP headers must be ASCII; DNS must be ASCII via IDNA). Use the protocol's defined encoding instead.
Troubleshooting common errors
| Error / Symptom | Likely cause | Fix |
|---|---|---|
| Wrong encoding detected on short input | Heuristics need more data | Sample more bytes; supply a cp_isolation hint. |
MemoryError on a huge file | from_path read everything | Use from_path(..., steps=10) or read a fixed-size sample. |
UnicodeDecodeError after str(match) | Used output() raw bytes path | Use str(match) directly; it handles re-encoding to UTF-8. |
normalizer: command not found | Bin not on PATH | Use python -m charset_normalizer or fix PATH. |
| Slower than expected | Pure-Python wheel installed | Reinstall (pip install -U --force-reinstall charset-normalizer) to get the accelerated build. |
| Conflicting transitive pin | requests and another lib disagree on range | Resolve via pip install "charset-normalizer~=3.3". |
See also
- Concept: HTTP — protocol fundamentals
- Packages: pip-requests — primary consumer
- Python: requests — high-level HTTP usage
- Official charset-normalizer repo