cheat sheet

charset-normalizer

Package-level reference for charset-normalizer on PyPI — what it does, install, integration with requests, version policy, and alternatives.

updated 05-31-2026

charset-normalizer

What it is

charset-normalizer is a pure-Python library for guessing the character encoding of arbitrary bytes. It is the default character-set detector inside requests 3.x and later 2.x releases, replacing the long-running chardet dependency. The library inspects byte distributions, language-specific n-grams, and mojibake heuristics to return a ranked list of plausible encodings.

Reach for charset-normalizer when you need to: process a file whose encoding is unknown or untrustworthy (legacy logs, scraped HTML, user uploads); replace chardet for licensing reasons (it's MIT vs chardet's LGPL); or get higher-quality detection on small (<100 byte) samples than chardet provides.

Install

bash

pip install charset-normalizer

Output: (none — exits 0 on success)

bash

uv add charset-normalizer

Output: dependency resolved + added to pyproject.toml

bash

poetry add charset-normalizer

Output: updated lockfile + virtualenv install

bash

pip install "charset-normalizer[unicode_backport]"

Output: installs additional Unicode tables for older Python versions where the stdlib unicodedata table lacks coverage.

Versioning & Python support

Current line is the 3.x series. Semantic versioning is observed — minor releases are backwards-compatible.
Supports Python 3.7+ on recent releases (3.4+ was supported on the 2.x series).
The 2.x → 3.x boundary was the largest API cleanup; 3.x is stable since 2022.
requests declared charset-normalizer ~= 3.0 as a transitive dep, so pinning is rarely a problem.

Package metadata

Maintainer: Ahmed TAHRI (@Ousret)
Project home: github.com/jawah/charset_normalizer
Docs: charset-normalizer.readthedocs.io
PyPI: pypi.org/project/charset-normalizer
License: MIT
Governance: single-maintainer; under the jawah GitHub org
First released: 2019
Downloads: consistently top 10 on PyPI (transitive via requests)

Optional dependencies & extras

charset-normalizer[unicode_backport] — installs unicodedata2 for richer Unicode tables on Python 3.7-3.9 where the stdlib table is older.
No required runtime deps — the library is pure Python.
An optional Rust extension is provided via the mypyc build for ~2× speedup on big payloads (transparent when wheel is installed).

Alternatives

Package	Trade-off
`chardet`	The original. LGPL-licensed; slower; long-time `requests` default before `3.x`.
`cchardet` (or `faust-cchardet`)	Cython wrapper over a C library. Fastest, but binary dep and maintenance has been patchy.
`ftfy`	Solves a different problem — fixes already-broken mojibake. Pair with `charset-normalizer` for full-stack recovery.
Standard `bytes.decode("utf-8", errors="replace")`	Use when you're confident the encoding is UTF-8. Avoid `errors="ignore"`.

Common gotchas

Detection is heuristic, not deterministic. Short inputs (<50 bytes) or inputs that look valid under multiple encodings (a lot of CP1252 ↔ ISO-8859-1) produce uncertain results. Always check the chaos and coherence fields.
from_bytes() returns a CharsetMatches collection. Index [0] for the best match; iterate for ranked alternatives.
Encoding name normalization differs from codecs. from_bytes(...).best().encoding returns names like cp1252, utf_8, iso8859_5 — matching Python codec aliases but not always equal to the codepage number.
.output() decodes lazily. Use str(match) to materialize the decoded text.
from_path() reads the entire file by default. For huge files, pass steps=... or read a sample and call from_bytes() yourself.
Detection-vs-coercion confusion. charset-normalizer decides what an encoding most likely IS; it does NOT re-encode broken UTF-8 (that's ftfy's job).
CLI is shipped as normalizer (not charset-normalizer). Add it to PATH via pip install --user or use python -m charset_normalizer.

Real-world recipes

The recipes lean toward the common content-pipeline use cases: detecting unknown files, integrating with requests, and benchmarking against chardet.

Recipe 1 — Detect the encoding of a file on disk.

python

from charset_normalizer import from_path

results = from_path("./mystery.csv")
best = results.best()
print(best.encoding, best.language, best.chaos, best.coherence)
print(str(best)[:200])

Output:

csharp

cp1252 English 0.013 0.951
<first 200 chars of the decoded file>

chaos is a noise score (lower is better); coherence is a confidence score (higher is better). Both range 0-1.

Recipe 2 — Force a fallback when detection is uncertain.

python

from charset_normalizer import from_bytes

raw = open("./short.bin", "rb").read()
results = from_bytes(raw)
if not results or results.best().chaos > 0.5:
    text = raw.decode("utf-8", errors="replace")    # fallback
else:
    text = str(results.best())
print(text[:100])

Output: uses the detector when confident; falls back to UTF-8 with replace (which yields � for invalid bytes) when detection is shaky.

Recipe 3 — Integration with requests to recover encoding.

python

import requests
from charset_normalizer import from_bytes

r = requests.get("https://example.com/article", timeout=5)
# requests normally guesses from headers; override with the detector for unreliable servers:
match = from_bytes(r.content).best()
text = str(match) if match else r.text
print(match.encoding if match else r.encoding)

Output: the better of the two encodings. Useful for legacy sites that lie about Content-Type: charset= (many Asian sites mislabel as iso-8859-1).

Recipe 4 — Benchmark charset-normalizer vs chardet.

python

import time
from charset_normalizer import from_bytes
import chardet  # pip install chardet

with open("./sample.txt", "rb") as f:
    data = f.read()

t = time.perf_counter(); cn = from_bytes(data).best().encoding; t1 = time.perf_counter() - t
t = time.perf_counter(); cd = chardet.detect(data)["encoding"]; t2 = time.perf_counter() - t

print(f"charset-normalizer: {cn} ({t1*1000:.2f} ms)")
print(f"chardet:            {cd} ({t2*1000:.2f} ms)")

Output:

makefile

charset-normalizer: utf_8 (1.40 ms)
chardet:            utf-8 (4.12 ms)

Typical 2-4× speedup on small payloads; both detect UTF-8 reliably.

Recipe 5 — CLI usage for quick file inspection.

bash

normalizer ./mystery.csv

Output:

scss

mystery.csv → cp1252 (English, confidence 95%)

Pipe-friendly: pass --alternative to see ranked candidates; --minimal for one-line output.

Performance tuning

from_path(..., steps=10) samples the file at N positions instead of reading everything. Tune to match your latency budget.
from_bytes(..., explain=False) disables the detailed analysis log (already the default). Don't enable in production.
Cache the result per file if you re-process the same file repeatedly.
Use the C-accelerated wheel. pip install -U charset-normalizer typically pulls a wheel; verify with python -c "from charset_normalizer.utils import is_accelerated_module; print(is_accelerated_module())".
For very small inputs (<50 bytes), chardet may actually be more accurate. Heuristic detectors all struggle with short text — fall back to an explicit encoding hint where possible.

Version migration guide

2.x → 3.0 — minimum Python 3.7; CharsetNormalizerMatches renamed CharsetMatches; some helper functions moved into the legacy shim and are scheduled for removal.
3.0 → 3.1 — from_path semantics around very small files tightened; chaos scoring rebalanced.
3.2 → 3.3 — wheel build picks up mypyc acceleration on common platforms.
3.3 → 3.4 — minor scoring improvements; behavior for empty input unified across from_* functions.

python

# Pre-3.0
from charset_normalizer import CharsetNormalizerMatches as match
# 3.0+
from charset_normalizer import CharsetMatches as match

Output: same data class, new canonical name.

Security considerations

Decoded text from untrusted sources is still untrusted. Detection picks an encoding; it doesn't sanitize the content. Validate / escape downstream.
Avoid errors="ignore" when decoding — silently dropping bytes can mask attacks (e.g. directory-traversal sequences).
DoS via huge inputs — from_path will read the whole file. Cap input size or use steps=.
Permissive matching — chaos > 0.7 outputs are unreliable; treat them as "unknown" rather than a confident detection.

Testing & CI

The detector is deterministic per input, so testing is straightforward — pin sample bytes and assert on the result.

python

from charset_normalizer import from_bytes

def test_detects_cp1252():
    # Windows-1252 smart-quotes around an English phrase
    raw = b"He said \x93hi\x94, then left."
    best = from_bytes(raw).best()
    assert best.encoding in {"cp1252", "windows_1252"}
    assert "hi" in str(best)

Output: assertion holds across versions.

For CI, also lint that no errors="ignore" calls slipped in:

bash

grep -rE 'decode\("[^"]+",\s*errors\s*=\s*"ignore"\)' src/ && exit 1 || true

Output: exits 1 if any file in src/ uses errors="ignore" on a decode call.

Ecosystem integrations

requests — default character-set detector since 2.32.x (replaced chardet).
httpx — uses charset-normalizer for response.text if installed.
pandas — no built-in integration, but useful for read_csv encoding= selection on unknown files.
ftfy — pair with charset-normalizer when content is known mojibake (encoded twice).
CLI: normalizer — bundled CLI for quick inspection from a shell.

Compatibility matrix

Python	`charset-normalizer`	Notes
3.6	`2.x` (frozen)	Final supported line for 3.6.
3.7	`3.x`	Floor for 3.x.
3.8	`3.x`	Stable.
3.9	`3.x`	Stable.
3.10	`3.x`	Stable.
3.11	`3.x`	Best perf.
3.12	`3.x`	Stable.
3.13	`3.x`	Wheel available; mypyc accelerated.

Production deployment

Pin a tight minor range in libraries (charset-normalizer>=3.3,<4). The detector occasionally changes scoring between minors; flaky tests can result.
Don't disable acceleration. Always install via wheel; falling back to pure-Python on production is a 3-5× regression.
Health-check encoding sensitivity. If your service processes user-uploaded text, log detection failures (chaos > 0.5) — they often indicate malformed clients.
Bound input size. Files >10 MB should be sampled (steps=) rather than read in full.
Audit errors= parameters across the codebase — replace is safe, ignore masks bugs.

When NOT to use this

You know the encoding. If the file/header says UTF-8, decode it as UTF-8. Don't paper over real bugs with detection.
You need code-page-specific behavior (Asian legacy encodings) where commercial tools or cchardet historically did better. Test against your data before committing.
Tight resource budgets. Decoding a 100 MB file via detection is wasteful when you can sample the first 4 KB.
You're decoding network protocols with strict encoding rules (HTTP headers must be ASCII; DNS must be ASCII via IDNA). Use the protocol's defined encoding instead.

Troubleshooting common errors

Error / Symptom	Likely cause	Fix
Wrong encoding detected on short input	Heuristics need more data	Sample more bytes; supply a `cp_isolation` hint.
`MemoryError` on a huge file	`from_path` read everything	Use `from_path(..., steps=10)` or read a fixed-size sample.
`UnicodeDecodeError` after `str(match)`	Used `output()` raw bytes path	Use `str(match)` directly; it handles re-encoding to UTF-8.
`normalizer: command not found`	Bin not on PATH	Use `python -m charset_normalizer` or fix `PATH`.
Slower than expected	Pure-Python wheel installed	Reinstall (`pip install -U --force-reinstall charset-normalizer`) to get the accelerated build.
Conflicting transitive pin	`requests` and another lib disagree on range	Resolve via `pip install "charset-normalizer~=3.3"`.

charset-normalizer

What it is

Install

Versioning & Python support

Package metadata

Optional dependencies & extras

Alternatives

Common gotchas

Real-world recipes

Performance tuning

Version migration guide

Security considerations

Testing & CI

Ecosystem integrations

Compatibility matrix

Production deployment

When NOT to use this

Troubleshooting common errors

See also