cheat sheet

charset-normalizer

Package-level reference for charset-normalizer on PyPI — what it does, install, integration with requests, version policy, and alternatives.

charset-normalizer

What it is

charset-normalizer is a pure-Python library for guessing the character encoding of arbitrary bytes. It is the default character-set detector inside requests 3.x and later 2.x releases, replacing the long-running chardet dependency. The library inspects byte distributions, language-specific n-grams, and mojibake heuristics to return a ranked list of plausible encodings.

Reach for charset-normalizer when you need to: process a file whose encoding is unknown or untrustworthy (legacy logs, scraped HTML, user uploads); replace chardet for licensing reasons (it's MIT vs chardet's LGPL); or get higher-quality detection on small (<100 byte) samples than chardet provides.

Install

bash
pip install charset-normalizer

Output: (none — exits 0 on success)

bash
uv add charset-normalizer

Output: dependency resolved + added to pyproject.toml

bash
poetry add charset-normalizer

Output: updated lockfile + virtualenv install

bash
pip install "charset-normalizer[unicode_backport]"

Output: installs additional Unicode tables for older Python versions where the stdlib unicodedata table lacks coverage.

Versioning & Python support

  • Current line is the 3.x series. Semantic versioning is observed — minor releases are backwards-compatible.
  • Supports Python 3.7+ on recent releases (3.4+ was supported on the 2.x series).
  • The 2.x → 3.x boundary was the largest API cleanup; 3.x is stable since 2022.
  • requests declared charset-normalizer ~= 3.0 as a transitive dep, so pinning is rarely a problem.

Package metadata

  • Maintainer: Ahmed TAHRI (@Ousret)
  • Project home: github.com/jawah/charset_normalizer
  • Docs: charset-normalizer.readthedocs.io
  • PyPI: pypi.org/project/charset-normalizer
  • License: MIT
  • Governance: single-maintainer; under the jawah GitHub org
  • First released: 2019
  • Downloads: consistently top 10 on PyPI (transitive via requests)

Optional dependencies & extras

  • charset-normalizer[unicode_backport] — installs unicodedata2 for richer Unicode tables on Python 3.7-3.9 where the stdlib table is older.
  • No required runtime deps — the library is pure Python.
  • An optional Rust extension is provided via the mypyc build for ~2× speedup on big payloads (transparent when wheel is installed).

Alternatives

PackageTrade-off
chardetThe original. LGPL-licensed; slower; long-time requests default before 3.x.
cchardet (or faust-cchardet)Cython wrapper over a C library. Fastest, but binary dep and maintenance has been patchy.
ftfySolves a different problem — fixes already-broken mojibake. Pair with charset-normalizer for full-stack recovery.
Standard bytes.decode("utf-8", errors="replace")Use when you're confident the encoding is UTF-8. Avoid errors="ignore".

Common gotchas

  1. Detection is heuristic, not deterministic. Short inputs (<50 bytes) or inputs that look valid under multiple encodings (a lot of CP1252 ↔ ISO-8859-1) produce uncertain results. Always check the chaos and coherence fields.
  2. from_bytes() returns a CharsetMatches collection. Index [0] for the best match; iterate for ranked alternatives.
  3. Encoding name normalization differs from codecs. from_bytes(...).best().encoding returns names like cp1252, utf_8, iso8859_5 — matching Python codec aliases but not always equal to the codepage number.
  4. .output() decodes lazily. Use str(match) to materialize the decoded text.
  5. from_path() reads the entire file by default. For huge files, pass steps=... or read a sample and call from_bytes() yourself.
  6. Detection-vs-coercion confusion. charset-normalizer decides what an encoding most likely IS; it does NOT re-encode broken UTF-8 (that's ftfy's job).
  7. CLI is shipped as normalizer (not charset-normalizer). Add it to PATH via pip install --user or use python -m charset_normalizer.

Real-world recipes

The recipes lean toward the common content-pipeline use cases: detecting unknown files, integrating with requests, and benchmarking against chardet.

Recipe 1 — Detect the encoding of a file on disk.

python
from charset_normalizer import from_path

results = from_path("./mystery.csv")
best = results.best()
print(best.encoding, best.language, best.chaos, best.coherence)
print(str(best)[:200])

Output:

csharp
cp1252 English 0.013 0.951
<first 200 chars of the decoded file>

chaos is a noise score (lower is better); coherence is a confidence score (higher is better). Both range 0-1.

Recipe 2 — Force a fallback when detection is uncertain.

python
from charset_normalizer import from_bytes

raw = open("./short.bin", "rb").read()
results = from_bytes(raw)
if not results or results.best().chaos > 0.5:
    text = raw.decode("utf-8", errors="replace")    # fallback
else:
    text = str(results.best())
print(text[:100])

Output: uses the detector when confident; falls back to UTF-8 with replace (which yields for invalid bytes) when detection is shaky.

Recipe 3 — Integration with requests to recover encoding.

python
import requests
from charset_normalizer import from_bytes

r = requests.get("https://example.com/article", timeout=5)
# requests normally guesses from headers; override with the detector for unreliable servers:
match = from_bytes(r.content).best()
text = str(match) if match else r.text
print(match.encoding if match else r.encoding)

Output: the better of the two encodings. Useful for legacy sites that lie about Content-Type: charset= (many Asian sites mislabel as iso-8859-1).

Recipe 4 — Benchmark charset-normalizer vs chardet.

python
import time
from charset_normalizer import from_bytes
import chardet  # pip install chardet

with open("./sample.txt", "rb") as f:
    data = f.read()

t = time.perf_counter(); cn = from_bytes(data).best().encoding; t1 = time.perf_counter() - t
t = time.perf_counter(); cd = chardet.detect(data)["encoding"]; t2 = time.perf_counter() - t

print(f"charset-normalizer: {cn} ({t1*1000:.2f} ms)")
print(f"chardet:            {cd} ({t2*1000:.2f} ms)")

Output:

makefile
charset-normalizer: utf_8 (1.40 ms)
chardet:            utf-8 (4.12 ms)

Typical 2-4× speedup on small payloads; both detect UTF-8 reliably.

Recipe 5 — CLI usage for quick file inspection.

bash
normalizer ./mystery.csv

Output:

scss
mystery.csv → cp1252 (English, confidence 95%)

Pipe-friendly: pass --alternative to see ranked candidates; --minimal for one-line output.

Performance tuning

  • from_path(..., steps=10) samples the file at N positions instead of reading everything. Tune to match your latency budget.
  • from_bytes(..., explain=False) disables the detailed analysis log (already the default). Don't enable in production.
  • Cache the result per file if you re-process the same file repeatedly.
  • Use the C-accelerated wheel. pip install -U charset-normalizer typically pulls a wheel; verify with python -c "from charset_normalizer.utils import is_accelerated_module; print(is_accelerated_module())".
  • For very small inputs (<50 bytes), chardet may actually be more accurate. Heuristic detectors all struggle with short text — fall back to an explicit encoding hint where possible.

Version migration guide

  • 2.x → 3.0 — minimum Python 3.7; CharsetNormalizerMatches renamed CharsetMatches; some helper functions moved into the legacy shim and are scheduled for removal.
  • 3.0 → 3.1from_path semantics around very small files tightened; chaos scoring rebalanced.
  • 3.2 → 3.3 — wheel build picks up mypyc acceleration on common platforms.
  • 3.3 → 3.4 — minor scoring improvements; behavior for empty input unified across from_* functions.
python
# Pre-3.0
from charset_normalizer import CharsetNormalizerMatches as match
# 3.0+
from charset_normalizer import CharsetMatches as match

Output: same data class, new canonical name.

Security considerations

  • Decoded text from untrusted sources is still untrusted. Detection picks an encoding; it doesn't sanitize the content. Validate / escape downstream.
  • Avoid errors="ignore" when decoding — silently dropping bytes can mask attacks (e.g. directory-traversal sequences).
  • DoS via huge inputsfrom_path will read the whole file. Cap input size or use steps=.
  • Permissive matchingchaos > 0.7 outputs are unreliable; treat them as "unknown" rather than a confident detection.

Testing & CI

The detector is deterministic per input, so testing is straightforward — pin sample bytes and assert on the result.

python
from charset_normalizer import from_bytes

def test_detects_cp1252():
    # Windows-1252 smart-quotes around an English phrase
    raw = b"He said \x93hi\x94, then left."
    best = from_bytes(raw).best()
    assert best.encoding in {"cp1252", "windows_1252"}
    assert "hi" in str(best)

Output: assertion holds across versions.

For CI, also lint that no errors="ignore" calls slipped in:

bash
grep -rE 'decode\("[^"]+",\s*errors\s*=\s*"ignore"\)' src/ && exit 1 || true

Output: exits 1 if any file in src/ uses errors="ignore" on a decode call.

Ecosystem integrations

  • requests — default character-set detector since 2.32.x (replaced chardet).
  • httpx — uses charset-normalizer for response.text if installed.
  • pandas — no built-in integration, but useful for read_csv encoding= selection on unknown files.
  • ftfy — pair with charset-normalizer when content is known mojibake (encoded twice).
  • CLI: normalizer — bundled CLI for quick inspection from a shell.

Compatibility matrix

Pythoncharset-normalizerNotes
3.62.x (frozen)Final supported line for 3.6.
3.73.xFloor for 3.x.
3.83.xStable.
3.93.xStable.
3.103.xStable.
3.113.xBest perf.
3.123.xStable.
3.133.xWheel available; mypyc accelerated.

Production deployment

  • Pin a tight minor range in libraries (charset-normalizer>=3.3,<4). The detector occasionally changes scoring between minors; flaky tests can result.
  • Don't disable acceleration. Always install via wheel; falling back to pure-Python on production is a 3-5× regression.
  • Health-check encoding sensitivity. If your service processes user-uploaded text, log detection failures (chaos > 0.5) — they often indicate malformed clients.
  • Bound input size. Files >10 MB should be sampled (steps=) rather than read in full.
  • Audit errors= parameters across the codebase — replace is safe, ignore masks bugs.

When NOT to use this

  • You know the encoding. If the file/header says UTF-8, decode it as UTF-8. Don't paper over real bugs with detection.
  • You need code-page-specific behavior (Asian legacy encodings) where commercial tools or cchardet historically did better. Test against your data before committing.
  • Tight resource budgets. Decoding a 100 MB file via detection is wasteful when you can sample the first 4 KB.
  • You're decoding network protocols with strict encoding rules (HTTP headers must be ASCII; DNS must be ASCII via IDNA). Use the protocol's defined encoding instead.

Troubleshooting common errors

Error / SymptomLikely causeFix
Wrong encoding detected on short inputHeuristics need more dataSample more bytes; supply a cp_isolation hint.
MemoryError on a huge filefrom_path read everythingUse from_path(..., steps=10) or read a fixed-size sample.
UnicodeDecodeError after str(match)Used output() raw bytes pathUse str(match) directly; it handles re-encoding to UTF-8.
normalizer: command not foundBin not on PATHUse python -m charset_normalizer or fix PATH.
Slower than expectedPure-Python wheel installedReinstall (pip install -U --force-reinstall charset-normalizer) to get the accelerated build.
Conflicting transitive pinrequests and another lib disagree on rangeResolve via pip install "charset-normalizer~=3.3".

See also