cheat sheet

beautifulsoup4

Package-level reference for beautifulsoup4 on PyPI — install variants, parser-backend selection (lxml/html5lib/html.parser), and alternatives.

beautifulsoup4

What it is

beautifulsoup4 (PyPI name; imported as bs4) is a Python library by Leonard Richardson for parsing HTML and XML into a navigable tree, then searching and mutating it. It does not fetch pages and does not execute JavaScript — pair it with requests or httpx to fetch, and with playwright or selenium when the page needs JS to render.

The library is a façade over one of three parser backends: the stdlib html.parser, the C-based lxml, or the spec-compliant html5lib. Picking the right backend matters more than picking BeautifulSoup itself.

The PyPI distribution is beautifulsoup4. The import name is bs4. The older BeautifulSoup (no 4) package is the abandoned 3.x line — do not install it.

Install

bash
pip install beautifulsoup4

Output: (none — exits 0 on success). Installs bs4 + soupsieve. No parser — falls back to stdlib html.parser.

bash
pip install beautifulsoup4 lxml

Output: installs the fast C-backed parser. Recommended for production scraping.

bash
pip install beautifulsoup4 html5lib

Output: installs the spec-compliant Python parser. Slower but handles malformed HTML the way browsers do.

bash
uv add beautifulsoup4 lxml

Output: added to pyproject.toml

bash
poetry add beautifulsoup4 lxml

Output: updated lockfile + virtualenv install

Versioning & Python support

  • Current line is 4.x — the 4 in the package name is the major version. Has been on 4.x since 2012.
  • Minor-release cadence is irregular — a few releases per year.
  • Recent releases support Python 3.7+; the project drops one Python minor per minor release roughly.
  • Loose semver — 4.13 (2025) shipped some find_all behaviour tweaks; check the changelog before upgrading on a production scraper.
  • The BeautifulSoup 3.x line is abandoned — do not install the BeautifulSoup (no-4) package.

Package metadata

  • Maintainer: Leonard Richardson (leonardr) — original author, still primary maintainer
  • Project home: crummy.com/software/BeautifulSoup
  • Source: code.launchpad.net/beautifulsoup (Bazaar) — unusual; not GitHub-hosted
  • Docs: crummy.com/software/BeautifulSoup/bs4/doc
  • PyPI: pypi.org/project/beautifulsoup4
  • License: MIT
  • Governance: single maintainer, very long-running project
  • First released: 2004 (3.x line), 2012 (current 4.x line)
  • Downloads: tens of millions per month

Optional dependencies & extras

Beautifulsoup4 declares no PyPI extras — you install parser backends as separate packages. The choice matters:

ParserInstallSpeedLenient?Notes
html.parserstdlibSlowModerateDefault fallback if no other parser is installed. Stricter on malformed HTML.
lxmlpip install lxmlFast (C)YesRecommended for production. Requires libxml2 system libraries — wheels usually ship them, but exotic platforms can require apt install libxml2-dev libxslt-dev first.
html5libpip install html5libSlow (pure Python)Most lenientParses the way modern browsers do — best for the worst-broken HTML.
lxml-xml / xmlpip install lxmlFastn/aUse for actual XML, not HTML.

soupsieve is a hard dependency (pulled in automatically) — it provides the CSS-selector engine for soup.select(...). SoupSieve was added in BeautifulSoup 4.7 (2018); before that, CSS selectors were partially supported in-tree.

Alternatives

PackageTrade-off
lxml (direct)Use lxml.html directly when you need raw speed and don't need BeautifulSoup's API. ~2× faster on large documents.
selectolaxModern, very fast C-backed HTML parser. CSS-selector first, no tree-mutation API. Use for read-only scraping at scale.
parselThe Scrapy team's selector library. Wraps lxml with CSS + XPath. Use inside Scrapy pipelines.
pyqueryjQuery-style API over lxml. Fading; pick parsel or selectolax instead.
html5lib (direct)Spec-compliant tokeniser. Slower; use only when you need exact browser behaviour.
playwright / seleniumFor JS-rendered pages — fetch the rendered HTML and then feed it to BeautifulSoup.

Common gotchas

  1. pip install BeautifulSoup (no 4) installs the abandoned 3.x line. The correct package name is beautifulsoup4 and the correct import is from bs4 import BeautifulSoup. The wrong package still resolves on PyPI but hasn't shipped a release in years.
  2. No parser specified → silent warning + html.parser fallback. BeautifulSoup(html) emits a warning then uses the stdlib parser. Always pass features="lxml" explicitly to get deterministic behaviour across environments.
  3. lxml requires libxml2 / libxslt. Wheels ship for common platforms (x86_64/arm64 Linux, macOS, Windows). On Alpine (musl), some BSDs, or older ARM platforms you fall back to source and need the system libraries pre-installed.
  4. html.parser is stricter than html5lib. Malformed HTML that browsers render fine may parse differently — closing tags may be inserted at unexpected points, missing tags may not be inferred. If your scraper works in a browser but not in BeautifulSoup, try features="html5lib".
  5. .find_all returns a list, .select returns a list, .find returns first match (or None). Forgetting the None case crashes scrapers on the one page where the element is missing. Always guard or use .select_one() + if.
  6. Pickling a parsed tree doesn't round-trip cleanly. The tree holds back-references to the parser. Serialise with str(soup) and re-parse instead.
  7. The SoupSieve CSS engine was added in 4.7 (2018). Code targeting older BS4 that uses select() for complex selectors may behave differently — pin beautifulsoup4>=4.7 if you rely on :has() or pseudo-classes.
  8. Source is on Launchpad, not GitHub. Filing issues requires a Launchpad account — not the usual GitHub Issues flow. PRs are accepted via email patches or Launchpad merge proposals.

Real-world recipes

Paginated scraping with rate limiting

python
import time
import httpx
from bs4 import BeautifulSoup

BASE = "https://example.com/articles"

def scrape_listing(page: int) -> list[dict]:
    r = httpx.get(BASE, params={"page": page}, timeout=10.0)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "lxml")
    return [
        {
            "title": a.get_text(strip=True),
            "url": a["href"],
            "date": item.select_one(".date").get_text(strip=True),
        }
        for item in soup.select("article.post")
        for a in [item.select_one("h2 > a")]
        if a is not None
    ]

results = []
for page in range(1, 11):
    results.extend(scrape_listing(page))
    time.sleep(1.0)  # respect rate limit

The [a in [item.select_one("h2 > a")]] idiom is a one-line guard against missing children — select_one returns None when the selector matches nothing, and dereferencing a["href"] on None crashes the scraper. Always guard.

Structured data extraction (JSON-LD)

python
import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")
for script in soup.find_all("script", type="application/ld+json"):
    try:
        data = json.loads(script.string)
    except json.JSONDecodeError:
        continue
    if data.get("@type") == "Article":
        print(data["headline"], data["author"])

JSON-LD blocks are the cleanest data source on most modern sites — they sidestep DOM scraping entirely. Parse the wrapping <script> text as JSON; the schema follows schema.org conventions.

Sitemap parsing

python
soup = BeautifulSoup(open("sitemap.xml"), "xml")  # note: "xml" parser
urls = [loc.text for loc in soup.find_all("loc")]

Pass "xml" to BeautifulSoup() to use lxml's XML mode (preserves case-sensitive tag names, doesn't treat tags as HTML). "lxml" would lowercase tag names and treat <loc> as HTML.

Modifying and writing back

python
for img in soup.find_all("img", src=lambda v: v and v.startswith("http://")):
    img["src"] = img["src"].replace("http://", "https://")

html_out = str(soup)

soup.find_all accepts callables for attribute filters. After mutation, str(soup) serialises back to HTML. The default serialisation re-wraps text nodes; use soup.encode("utf-8") for byte-stable output.

Pairing with httpx.AsyncClient

python
import asyncio
import httpx
from bs4 import BeautifulSoup

async def fetch_and_parse(client, url):
    r = await client.get(url)
    return BeautifulSoup(r.text, "lxml")

async def main(urls):
    async with httpx.AsyncClient(timeout=10) as client:
        soups = await asyncio.gather(*(fetch_and_parse(client, u) for u in urls))
    for soup in soups:
        print(soup.title.string if soup.title else "(no title)")

BeautifulSoup is sync — the parse happens after the fetch. For pure-async scraping, gather the fetches concurrently, then parse sequentially or in a thread pool (asyncio.to_thread).

Performance tuning

Parsing dominates the cost of scraping. Levers:

  • Pick lxml over html.parser. ~5× faster on typical HTML. The C extension is the single biggest performance win.
  • html5lib is the slowest. Use only when you need exact browser behaviour on broken HTML.
  • SoupStrainer parses only matching subtrees. Pass parse_only=SoupStrainer("article") to BeautifulSoup() — the parser builds only the matching parts of the tree. Major savings on large pages where you need a small slice.
  • select() (CSS) vs find_all(). select uses soupsieve which is generally faster for complex queries. find_all is faster for simple name="tag" lookups.
  • Don't re-parse. Parsing is expensive; reuse the soup across many queries on the same document.
  • features="lxml-xml" for XML — XML parsing is faster than HTML because the rules are simpler.
  • Stream parsing isn't supported. BeautifulSoup loads the full document. For huge documents (>50 MB), use lxml.etree.iterparse directly.

For pure read-only scraping at scale, selectolax (Modest engine, C) outperforms BeautifulSoup by 3–10×, but lacks the mutation API.

Version migration guide

beautifulsoup4 has been on 4.x since 2012; bumps are small and infrequent. Notable:

4.13 (2025)

  • find_all behaviour tweaks around attribute matching.
  • Better handling of namespace-prefixed tags in XML mode.

4.12 (2023)

  • Improved support for HTML5 spec edge cases.
  • soupsieve>=2.4 required (modern CSS-selector features like :has()).

4.7 (2018)historical

  • Added soupsieve as a hard dep for select() CSS support.
  • Code targeting older BS4 may use a partial in-tree CSS selector implementation; behaviour diverges on complex selectors.

Upgrade pattern:

  • Pin beautifulsoup4>=4.13 in pyproject.toml.
  • soupsieve and lxml floor versions matter too — pin them defensively if you rely on advanced CSS or XML features.
  • Test against the exact HTML you scrape — minor changes in tag-inference behaviour can shift the tree shape for malformed inputs.

Plugin & rule ecosystem

BeautifulSoup has no plugin API — extension is by parser choice and soupsieve selectors:

ComponentRole
lxmlFast C-backed parser. Recommended for HTML and XML.
html5libPure-Python spec-compliant parser. Slow but most lenient.
html.parser (stdlib)Default fallback. Strictest.
soupsieveCSS-selector engine (select, select_one). Pulled in automatically.
cchardet / chardetEncoding detection. BS4 falls back to these for byte-input. cchardet is the C-based fast version.

The features= argument selects parser per parse. Common values: "lxml", "html5lib", "html.parser", "lxml-xml", "xml" (alias for lxml-xml).

Configuration & layout patterns

BeautifulSoup itself has no global config. Conventions for keeping a scraper maintainable:

  • Always pass features= explicitly. Default fallback to html.parser makes behaviour env-dependent. BeautifulSoup(html, "lxml") is the safe form.
  • Wrap fetching + parsing in a function. Don't open files or HTTP responses inline with parsing — separation makes each layer testable.
  • Selector constants at module top. ARTICLE_SEL = "article.post", TITLE_SEL = "h2 > a". Easier to update when site markup shifts.
  • Use select_one over find for new code. CSS selectors are more readable and composable than the find(name=, class_=, id=) keyword soup.
  • Guard None everywhere. select_one and find return None on miss. elem.get_text() if elem else "" is the canonical pattern.
  • Centralise encoding — pass from_encoding="utf-8" if you know the source encoding; otherwise BS4 sniffs (using cchardet if available).

Troubleshooting common errors

SymptomCauseFix
FeatureNotFound: Couldn't find a tree builder with the features you requested: lxmllxml not installedpip install lxml.
UserWarning: No parser was explicitly specifiedFalling back to html.parserPass features="lxml" (or another) explicitly.
AttributeError: 'NoneType' object has no attribute 'get_text'find / select_one returned NoneGuard with if elem:.
select() returns nothing for a selector that looks correctSelector mismatch on case, namespace, or whitespaceInspect with print(soup.prettify()[:500]) to verify the tree. Try a simpler selector first.
Encoding shows as mojibakeSource page mis-declared encodingPass from_encoding="utf-8" explicitly; or fetch with r.text (httpx/requests handle declared encoding) before passing to BS4.
XML parsing treats tags as case-insensitiveUsed features="lxml" instead of features="lxml-xml"Switch to "lxml-xml" or "xml".
select(":has(...)") doesn't workOld soupsieve versionUpgrade to soupsieve>=2.4.
BS4 modifies whitespace on serialisationDefault formatting prettifies outputPass formatter="minimal" to str(soup) or use soup.encode().
Tag found in browser but not in BS4JavaScript rendered the element after page loadBS4 can't execute JS — fetch via Playwright/Selenium first, then parse the rendered HTML.
RecursionError on deep treesPython's default recursion limitsys.setrecursionlimit(5000) (cautiously); or use lxml.etree directly which is iterative.

The prettify() method is the diagnostic — print(soup.prettify()) shows the parsed tree exactly as BS4 sees it, surfacing missing elements or unexpected nesting from a malformed input.

Ecosystem integrations

BeautifulSoup is the parsing layer. The fetch and (sometimes) render layers below it have many options:

  • requests — synchronous HTTP, the original pairing. Mature, stable.
  • httpx — modern, both sync and async, HTTP/2 support. The preferred choice in 2026.
  • aiohttp — async-only HTTP client. Use when the whole stack is async.
  • playwright / selenium — for JS-rendered pages. Fetch the rendered DOM, feed page.content() to BeautifulSoup.
  • scrapy — full crawling framework. Uses parsel (similar API) instead of BeautifulSoup. Choose Scrapy for thousands of pages; BS4 for one-off or small jobs.
  • pandas.read_html() — wraps BS4 + lxml. Useful for table extraction; reads all <table> tags on a page.
  • mechanicalsoup — session-aware browser-like wrapper. Less popular; consider httpx.AsyncClient(follow_redirects=True) instead.
  • bleach — HTML sanitisation. Uses BS4 under the hood. Use for cleaning untrusted HTML.

CI integration

Scrapers in CI typically run against recorded fixtures, not the live target. Patterns:

  • Record-replay with pytest-vcr, responses, or pytest-recording. Capture real responses once, commit to repo, replay in CI.
  • Static fixtures — check in HTML files under tests/fixtures/ and parse them in tests. Cheaper than VCR but stale faster.
  • Live-fetch nightly — a scheduled job hits the live target, surfaces breakage early. Separate from per-commit CI to avoid coupling PR-merge to upstream uptime.
yaml
name: scraper-tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -e ".[test]"
      - run: pytest tests/  # uses VCR cassettes, no live calls

For live-target verification on a cron:

yaml
on:
  schedule:
    - cron: "0 2 * * *"  # 02:00 UTC daily

Failures here open an issue (actions-ecosystem/action-create-issue) rather than blocking the build.

When NOT to use this

BeautifulSoup is excellent for HTML scraping but a wrong fit when:

  • Pure XML processing. Use lxml.etree directly — faster and more XPath-friendly. BS4's XML mode wraps lxml and adds overhead.
  • JS-rendered pages. BS4 can't execute JavaScript. Fetch with Playwright/Selenium first; then pass the rendered HTML to BS4 if you want its ergonomics.
  • Very large documents (50+ MB). BS4 loads the whole tree. Use lxml.etree.iterparse for streaming.
  • Pure CSS-selector reads at scale. selectolax is 3–10× faster than BS4 for read-only queries. BS4's value is the mutation API and the consistent abstraction over multiple parsers.
  • Inside a Scrapy pipeline. Scrapy ships with parsel, which is the same idea. Don't mix.

See also