cheat sheet
beautifulsoup4
Package-level reference for beautifulsoup4 on PyPI — install variants, parser-backend selection (lxml/html5lib/html.parser), and alternatives.
beautifulsoup4
What it is
beautifulsoup4 (PyPI name; imported as bs4) is a Python library by Leonard Richardson for parsing HTML and XML into a navigable tree, then searching and mutating it. It does not fetch pages and does not execute JavaScript — pair it with requests or httpx to fetch, and with playwright or selenium when the page needs JS to render.
The library is a façade over one of three parser backends: the stdlib html.parser, the C-based lxml, or the spec-compliant html5lib. Picking the right backend matters more than picking BeautifulSoup itself.
The PyPI distribution is
beautifulsoup4. The import name isbs4. The olderBeautifulSoup(no 4) package is the abandoned 3.x line — do not install it.
Install
pip install beautifulsoup4
Output: (none — exits 0 on success). Installs bs4 + soupsieve. No parser — falls back to stdlib html.parser.
pip install beautifulsoup4 lxml
Output: installs the fast C-backed parser. Recommended for production scraping.
pip install beautifulsoup4 html5lib
Output: installs the spec-compliant Python parser. Slower but handles malformed HTML the way browsers do.
uv add beautifulsoup4 lxml
Output: added to pyproject.toml
poetry add beautifulsoup4 lxml
Output: updated lockfile + virtualenv install
Versioning & Python support
- Current line is
4.x— the4in the package name is the major version. Has been on4.xsince 2012. - Minor-release cadence is irregular — a few releases per year.
- Recent releases support Python 3.7+; the project drops one Python minor per minor release roughly.
- Loose semver —
4.13(2025) shipped somefind_allbehaviour tweaks; check the changelog before upgrading on a production scraper. - The
BeautifulSoup3.x line is abandoned — do not install theBeautifulSoup(no-4) package.
Package metadata
- Maintainer: Leonard Richardson (
leonardr) — original author, still primary maintainer - Project home: crummy.com/software/BeautifulSoup
- Source: code.launchpad.net/beautifulsoup (Bazaar) — unusual; not GitHub-hosted
- Docs: crummy.com/software/BeautifulSoup/bs4/doc
- PyPI: pypi.org/project/beautifulsoup4
- License: MIT
- Governance: single maintainer, very long-running project
- First released: 2004 (3.x line), 2012 (current 4.x line)
- Downloads: tens of millions per month
Optional dependencies & extras
Beautifulsoup4 declares no PyPI extras — you install parser backends as separate packages. The choice matters:
| Parser | Install | Speed | Lenient? | Notes |
|---|---|---|---|---|
html.parser | stdlib | Slow | Moderate | Default fallback if no other parser is installed. Stricter on malformed HTML. |
lxml | pip install lxml | Fast (C) | Yes | Recommended for production. Requires libxml2 system libraries — wheels usually ship them, but exotic platforms can require apt install libxml2-dev libxslt-dev first. |
html5lib | pip install html5lib | Slow (pure Python) | Most lenient | Parses the way modern browsers do — best for the worst-broken HTML. |
lxml-xml / xml | pip install lxml | Fast | n/a | Use for actual XML, not HTML. |
soupsieve is a hard dependency (pulled in automatically) — it provides the CSS-selector engine for soup.select(...). SoupSieve was added in BeautifulSoup 4.7 (2018); before that, CSS selectors were partially supported in-tree.
Alternatives
| Package | Trade-off |
|---|---|
lxml (direct) | Use lxml.html directly when you need raw speed and don't need BeautifulSoup's API. ~2× faster on large documents. |
selectolax | Modern, very fast C-backed HTML parser. CSS-selector first, no tree-mutation API. Use for read-only scraping at scale. |
parsel | The Scrapy team's selector library. Wraps lxml with CSS + XPath. Use inside Scrapy pipelines. |
pyquery | jQuery-style API over lxml. Fading; pick parsel or selectolax instead. |
html5lib (direct) | Spec-compliant tokeniser. Slower; use only when you need exact browser behaviour. |
playwright / selenium | For JS-rendered pages — fetch the rendered HTML and then feed it to BeautifulSoup. |
Common gotchas
pip install BeautifulSoup(no4) installs the abandoned 3.x line. The correct package name isbeautifulsoup4and the correct import isfrom bs4 import BeautifulSoup. The wrong package still resolves on PyPI but hasn't shipped a release in years.- No parser specified → silent warning +
html.parserfallback.BeautifulSoup(html)emits a warning then uses the stdlib parser. Always passfeatures="lxml"explicitly to get deterministic behaviour across environments. lxmlrequireslibxml2/libxslt. Wheels ship for common platforms (x86_64/arm64 Linux, macOS, Windows). On Alpine (musl), some BSDs, or older ARM platforms you fall back to source and need the system libraries pre-installed.html.parseris stricter thanhtml5lib. Malformed HTML that browsers render fine may parse differently — closing tags may be inserted at unexpected points, missing tags may not be inferred. If your scraper works in a browser but not in BeautifulSoup, tryfeatures="html5lib"..find_allreturns a list,.selectreturns a list,.findreturns first match (orNone). Forgetting theNonecase crashes scrapers on the one page where the element is missing. Always guard or use.select_one()+if.- Pickling a parsed tree doesn't round-trip cleanly. The tree holds back-references to the parser. Serialise with
str(soup)and re-parse instead. - The SoupSieve CSS engine was added in 4.7 (2018). Code targeting older BS4 that uses
select()for complex selectors may behave differently — pinbeautifulsoup4>=4.7if you rely on:has()or pseudo-classes. - Source is on Launchpad, not GitHub. Filing issues requires a Launchpad account — not the usual GitHub Issues flow. PRs are accepted via email patches or Launchpad merge proposals.
Real-world recipes
Paginated scraping with rate limiting
import time
import httpx
from bs4 import BeautifulSoup
BASE = "https://example.com/articles"
def scrape_listing(page: int) -> list[dict]:
r = httpx.get(BASE, params={"page": page}, timeout=10.0)
r.raise_for_status()
soup = BeautifulSoup(r.text, "lxml")
return [
{
"title": a.get_text(strip=True),
"url": a["href"],
"date": item.select_one(".date").get_text(strip=True),
}
for item in soup.select("article.post")
for a in [item.select_one("h2 > a")]
if a is not None
]
results = []
for page in range(1, 11):
results.extend(scrape_listing(page))
time.sleep(1.0) # respect rate limit
The [a in [item.select_one("h2 > a")]] idiom is a one-line guard against missing children — select_one returns None when the selector matches nothing, and dereferencing a["href"] on None crashes the scraper. Always guard.
Structured data extraction (JSON-LD)
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string)
except json.JSONDecodeError:
continue
if data.get("@type") == "Article":
print(data["headline"], data["author"])
JSON-LD blocks are the cleanest data source on most modern sites — they sidestep DOM scraping entirely. Parse the wrapping <script> text as JSON; the schema follows schema.org conventions.
Sitemap parsing
soup = BeautifulSoup(open("sitemap.xml"), "xml") # note: "xml" parser
urls = [loc.text for loc in soup.find_all("loc")]
Pass "xml" to BeautifulSoup() to use lxml's XML mode (preserves case-sensitive tag names, doesn't treat tags as HTML). "lxml" would lowercase tag names and treat <loc> as HTML.
Modifying and writing back
for img in soup.find_all("img", src=lambda v: v and v.startswith("http://")):
img["src"] = img["src"].replace("http://", "https://")
html_out = str(soup)
soup.find_all accepts callables for attribute filters. After mutation, str(soup) serialises back to HTML. The default serialisation re-wraps text nodes; use soup.encode("utf-8") for byte-stable output.
Pairing with httpx.AsyncClient
import asyncio
import httpx
from bs4 import BeautifulSoup
async def fetch_and_parse(client, url):
r = await client.get(url)
return BeautifulSoup(r.text, "lxml")
async def main(urls):
async with httpx.AsyncClient(timeout=10) as client:
soups = await asyncio.gather(*(fetch_and_parse(client, u) for u in urls))
for soup in soups:
print(soup.title.string if soup.title else "(no title)")
BeautifulSoup is sync — the parse happens after the fetch. For pure-async scraping, gather the fetches concurrently, then parse sequentially or in a thread pool (asyncio.to_thread).
Performance tuning
Parsing dominates the cost of scraping. Levers:
- Pick
lxmloverhtml.parser. ~5× faster on typical HTML. The C extension is the single biggest performance win. html5libis the slowest. Use only when you need exact browser behaviour on broken HTML.SoupStrainerparses only matching subtrees. Passparse_only=SoupStrainer("article")toBeautifulSoup()— the parser builds only the matching parts of the tree. Major savings on large pages where you need a small slice.select()(CSS) vsfind_all().selectusessoupsievewhich is generally faster for complex queries.find_allis faster for simplename="tag"lookups.- Don't re-parse. Parsing is expensive; reuse the soup across many queries on the same document.
features="lxml-xml"for XML — XML parsing is faster than HTML because the rules are simpler.- Stream parsing isn't supported. BeautifulSoup loads the full document. For huge documents (>50 MB), use
lxml.etree.iterparsedirectly.
For pure read-only scraping at scale, selectolax (Modest engine, C) outperforms BeautifulSoup by 3–10×, but lacks the mutation API.
Version migration guide
beautifulsoup4 has been on 4.x since 2012; bumps are small and infrequent. Notable:
4.13 (2025)
find_allbehaviour tweaks around attribute matching.- Better handling of namespace-prefixed tags in XML mode.
4.12 (2023)
- Improved support for HTML5 spec edge cases.
soupsieve>=2.4required (modern CSS-selector features like:has()).
4.7 (2018) — historical
- Added
soupsieveas a hard dep forselect()CSS support. - Code targeting older BS4 may use a partial in-tree CSS selector implementation; behaviour diverges on complex selectors.
Upgrade pattern:
- Pin
beautifulsoup4>=4.13inpyproject.toml. soupsieveandlxmlfloor versions matter too — pin them defensively if you rely on advanced CSS or XML features.- Test against the exact HTML you scrape — minor changes in tag-inference behaviour can shift the tree shape for malformed inputs.
Plugin & rule ecosystem
BeautifulSoup has no plugin API — extension is by parser choice and soupsieve selectors:
| Component | Role |
|---|---|
lxml | Fast C-backed parser. Recommended for HTML and XML. |
html5lib | Pure-Python spec-compliant parser. Slow but most lenient. |
html.parser (stdlib) | Default fallback. Strictest. |
soupsieve | CSS-selector engine (select, select_one). Pulled in automatically. |
cchardet / chardet | Encoding detection. BS4 falls back to these for byte-input. cchardet is the C-based fast version. |
The features= argument selects parser per parse. Common values: "lxml", "html5lib", "html.parser", "lxml-xml", "xml" (alias for lxml-xml).
Configuration & layout patterns
BeautifulSoup itself has no global config. Conventions for keeping a scraper maintainable:
- Always pass
features=explicitly. Default fallback tohtml.parsermakes behaviour env-dependent.BeautifulSoup(html, "lxml")is the safe form. - Wrap fetching + parsing in a function. Don't open files or HTTP responses inline with parsing — separation makes each layer testable.
- Selector constants at module top.
ARTICLE_SEL = "article.post",TITLE_SEL = "h2 > a". Easier to update when site markup shifts. - Use
select_oneoverfindfor new code. CSS selectors are more readable and composable than thefind(name=, class_=, id=)keyword soup. - Guard
Noneeverywhere.select_oneandfindreturnNoneon miss.elem.get_text() if elem else ""is the canonical pattern. - Centralise encoding — pass
from_encoding="utf-8"if you know the source encoding; otherwise BS4 sniffs (usingcchardetif available).
Troubleshooting common errors
| Symptom | Cause | Fix |
|---|---|---|
FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml | lxml not installed | pip install lxml. |
UserWarning: No parser was explicitly specified | Falling back to html.parser | Pass features="lxml" (or another) explicitly. |
AttributeError: 'NoneType' object has no attribute 'get_text' | find / select_one returned None | Guard with if elem:. |
select() returns nothing for a selector that looks correct | Selector mismatch on case, namespace, or whitespace | Inspect with print(soup.prettify()[:500]) to verify the tree. Try a simpler selector first. |
| Encoding shows as mojibake | Source page mis-declared encoding | Pass from_encoding="utf-8" explicitly; or fetch with r.text (httpx/requests handle declared encoding) before passing to BS4. |
| XML parsing treats tags as case-insensitive | Used features="lxml" instead of features="lxml-xml" | Switch to "lxml-xml" or "xml". |
select(":has(...)") doesn't work | Old soupsieve version | Upgrade to soupsieve>=2.4. |
| BS4 modifies whitespace on serialisation | Default formatting prettifies output | Pass formatter="minimal" to str(soup) or use soup.encode(). |
| Tag found in browser but not in BS4 | JavaScript rendered the element after page load | BS4 can't execute JS — fetch via Playwright/Selenium first, then parse the rendered HTML. |
RecursionError on deep trees | Python's default recursion limit | sys.setrecursionlimit(5000) (cautiously); or use lxml.etree directly which is iterative. |
The prettify() method is the diagnostic — print(soup.prettify()) shows the parsed tree exactly as BS4 sees it, surfacing missing elements or unexpected nesting from a malformed input.
Ecosystem integrations
BeautifulSoup is the parsing layer. The fetch and (sometimes) render layers below it have many options:
requests— synchronous HTTP, the original pairing. Mature, stable.httpx— modern, both sync and async, HTTP/2 support. The preferred choice in 2026.aiohttp— async-only HTTP client. Use when the whole stack is async.playwright/selenium— for JS-rendered pages. Fetch the rendered DOM, feedpage.content()to BeautifulSoup.scrapy— full crawling framework. Usesparsel(similar API) instead of BeautifulSoup. Choose Scrapy for thousands of pages; BS4 for one-off or small jobs.pandas.read_html()— wraps BS4 + lxml. Useful for table extraction; reads all<table>tags on a page.mechanicalsoup— session-aware browser-like wrapper. Less popular; considerhttpx.AsyncClient(follow_redirects=True)instead.bleach— HTML sanitisation. Uses BS4 under the hood. Use for cleaning untrusted HTML.
CI integration
Scrapers in CI typically run against recorded fixtures, not the live target. Patterns:
- Record-replay with
pytest-vcr,responses, orpytest-recording. Capture real responses once, commit to repo, replay in CI. - Static fixtures — check in HTML files under
tests/fixtures/and parse them in tests. Cheaper than VCR but stale faster. - Live-fetch nightly — a scheduled job hits the live target, surfaces breakage early. Separate from per-commit CI to avoid coupling PR-merge to upstream uptime.
name: scraper-tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -e ".[test]"
- run: pytest tests/ # uses VCR cassettes, no live calls
For live-target verification on a cron:
on:
schedule:
- cron: "0 2 * * *" # 02:00 UTC daily
Failures here open an issue (actions-ecosystem/action-create-issue) rather than blocking the build.
When NOT to use this
BeautifulSoup is excellent for HTML scraping but a wrong fit when:
- Pure XML processing. Use
lxml.etreedirectly — faster and more XPath-friendly. BS4's XML mode wrapslxmland adds overhead. - JS-rendered pages. BS4 can't execute JavaScript. Fetch with Playwright/Selenium first; then pass the rendered HTML to BS4 if you want its ergonomics.
- Very large documents (50+ MB). BS4 loads the whole tree. Use
lxml.etree.iterparsefor streaming. - Pure CSS-selector reads at scale.
selectolaxis 3–10× faster than BS4 for read-only queries. BS4's value is the mutation API and the consistent abstraction over multiple parsers. - Inside a Scrapy pipeline. Scrapy ships with
parsel, which is the same idea. Don't mix.
See also
- Python: BeautifulSoup — API tutorial, navigation, CSS selectors, scraping recipes
- Concept: HTTP — what
requests/httpxdeliver before BeautifulSoup parses it - Packages: pip-requests — the canonical fetch-then-parse pairing