cheat sheet

BeautifulSoup

Parse, search, and mutate HTML/XML with BeautifulSoup 4. Covers parser choice (html.parser/lxml/html5lib), find/find_all/select, tree navigation, attribute access, and pairing with requests/httpx/playwright for end-to-end scraping.

BeautifulSoup — HTML Parsing & Scraping

What it is

BeautifulSoup (package name beautifulsoup4, import bs4) is a Python library by Leonard Richardson for parsing HTML and XML into a navigable tree. It does not fetch pages — pair it with requests, httpx, or playwright for that — and it does not execute JavaScript, so JS-rendered pages need a headless browser before the soup. Its sweet spot is screen-scraping static HTML, fixing broken markup, and re-emitting cleaned-up XML/HTML. For pure speed on well-formed documents reach for lxml directly; BeautifulSoup is the friendlier API that calls into lxml (or html.parser / html5lib) under the hood.

Install

bash
# Core library
pip install beautifulsoup4

# Recommended parser (C speed, lenient)
pip install lxml

# Browser-grade parser (slow but matches Chrome/Firefox behaviour exactly)
pip install html5lib

# HTTP clients you'll pair it with
pip install requests
pip install httpx

Output: (none — exits 0 on success)

Syntax

The library is a single class — BeautifulSoup(markup, parser) — plus the tree of Tag / NavigableString objects it returns. Pass an HTML/XML string or a file-like object, name the parser explicitly, and search the tree with find, find_all, or CSS selectors via select.

python
from bs4 import BeautifulSoup

soup = BeautifulSoup(markup, "lxml")           # or "html.parser", "html5lib", "xml"
tag = soup.find("a", class_="primary")
tags = soup.find_all("li", limit=10)
hits = soup.select("div.article > h2 a[href]")

Output: (none — exits 0 on success)

Quick example

python
# quick.py
from bs4 import BeautifulSoup

html = """
<html><body>
  <h1>Hello, World</h1>
  <ul class="links">
    <li><a href="/a">First</a></li>
    <li><a href="/b">Second</a></li>
  </ul>
</body></html>
"""

soup = BeautifulSoup(html, "html.parser")
print(soup.h1.string)
for a in soup.select("ul.links a"):
    print(a.get_text(), "->", a["href"])
bash
python quick.py

Output:

text
Hello, World
First -> /a
Second -> /b

Parser choice

BeautifulSoup is a facade — the actual parsing is delegated to one of four backends. They differ in speed, leniency, and how they handle malformed HTML. Always pass the parser name explicitly; if you omit it, BeautifulSoup picks whatever is installed and you get implicit dependencies on environment state.

ParserSpeedLeniencyNotes
"html.parser"mediummediumStdlib — no install. Fine for most jobs.
"lxml"fastlenientBest default — install lxml and use it for HTML.
"lxml-xml" or "xml"faststrictXML only — preserves namespaces, case-sensitive.
"html5lib"slowmaximalMatches a real browser bit-for-bit; use for badly broken pages.
python
# parser_choice.py
from bs4 import BeautifulSoup

broken = "<p>one<p>two<p>three"

for parser in ("html.parser", "lxml", "html5lib"):
    soup = BeautifulSoup(broken, parser)
    print(parser, "->", [p.get_text() for p in soup.find_all("p")])
bash
python parser_choice.py

Output:

text
html.parser -> ['one', 'two', 'three']
lxml -> ['one', 'two', 'three']
html5lib -> ['one', 'two', 'three']

find vs find_all vs select

find(name, attrs) returns the first matching tag or None. find_all(name, attrs, limit=) returns a list (default: all matches). select(css) runs a CSS selector and returns a list; select_one(css) returns the first match or None. CSS selectors are usually the most readable; reach for find / find_all when you need keyword args (string=, recursive=False, custom callables).

python
# search.py
from bs4 import BeautifulSoup

html = """
<article class="post" data-id="42">
  <h2>Title</h2>
  <p class="lede">Lead paragraph.</p>
  <p>Body paragraph.</p>
  <a class="cta primary" href="/buy">Buy</a>
  <a class="cta" href="/learn">Learn</a>
</article>
"""

soup = BeautifulSoup(html, "lxml")

print(soup.find("h2").get_text())                              # first match by tag name
print(soup.find("a", class_="primary")["href"])                # keyword match
print(soup.find("p", string="Body paragraph.").get("class"))    # exact text match

print([a["href"] for a in soup.find_all("a")])                  # every link
print([a["href"] for a in soup.find_all("a", limit=1)])         # cap results

print(soup.select_one("article > h2").text)                    # CSS direct-child
print([a.text for a in soup.select("a.cta")])                  # CSS class
print(soup.select_one('[data-id="42"]').name)                  # CSS attribute selector
print([a["href"] for a in soup.select("a[href^='/']")])        # CSS prefix match
bash
python search.py

Output:

text
Title
/buy
None
['/buy', '/learn']
['/buy']
Title
['Buy', 'Learn']
article
['/buy', '/learn']

class_ (with trailing underscore) is the keyword in find / find_all because class is a Python reserved word. With select you write the normal CSS .classname.

Searching with regex, lists, and callables

The name=, attrs=, and string= arguments to find / find_all accept more than plain strings — they take regexes, lists, dicts, booleans, and callables. This is how you express "any heading", "any link whose href starts with https://example.com/", or "any tag with both classes".

python
# advanced_search.py
import re
from bs4 import BeautifulSoup

html = """
<div>
  <h1>Top</h1>
  <h2>Mid</h2>
  <h3>Lower</h3>
  <a href="https://example.com/a">A</a>
  <a href="mailto:alice@example.com">Email</a>
  <a href="/local">Local</a>
</div>
"""

soup = BeautifulSoup(html, "lxml")

# Regex tag name — any heading h1..h6
print([t.name for t in soup.find_all(re.compile(r"^h[1-6]$"))])

# List — match any tag in the list
print([t.name for t in soup.find_all(["a", "h1"])])

# Boolean True — all tags with this attribute
print([a["href"] for a in soup.find_all("a", href=True)])

# Callable — link only if it goes off-site
def offsite(tag):
    return tag.name == "a" and tag.get("href", "").startswith("http")

print([a["href"] for a in soup.find_all(offsite)])

# Dict attrs — multiple constraints
print(soup.find("a", attrs={"href": re.compile(r"^mailto:")}).get_text())
bash
python advanced_search.py

Output:

text
['h1', 'h2', 'h3']
['h1', 'a', 'a', 'a']
['https://example.com/a', 'mailto:alice@example.com', '/local']
['https://example.com/a']
Email

Every node exposes parent, sibling, and descendant accessors. Use them when search-by-attribute isn't enough — e.g. "the <dd> that follows this <dt>", "the table row containing the cell with this text".

AccessorWhat it returns
tag.parentThe immediate parent tag
tag.parentsAn iterator from the parent up to the root
tag.next_sibling / previous_siblingThe next/previous sibling (may be whitespace NavigableString)
tag.next_element / previous_elementThe next/previous anything (siblings, children, text)
tag.childrenDirect children iterator
tag.descendantsAll descendants, depth-first
tag.contentsDirect children as a list
tag.strings / stripped_stringsAll text nodes (whitespace-stripped variant)
tag.find_next("p")Next matching tag, document order
tag.find_previous("p")Previous matching tag
python
# navigate.py
from bs4 import BeautifulSoup

html = """
<dl>
  <dt>name</dt><dd>alice</dd>
  <dt>email</dt><dd>alice@example.com</dd>
  <dt>role</dt><dd>admin</dd>
</dl>
"""

soup = BeautifulSoup(html, "lxml")

# Build a dict from a <dl> by pairing each <dt> with its next <dd>
data = {dt.get_text(): dt.find_next("dd").get_text() for dt in soup.find_all("dt")}
print(data)

# Walk all stripped strings under the <dl>
print(list(soup.dl.stripped_strings))
bash
python navigate.py

Output:

text
{'name': 'alice', 'email': 'alice@example.com', 'role': 'admin'}
['name', 'alice', 'email', 'alice@example.com', 'role', 'admin']

Attribute access and text

A Tag behaves like a dict for its attributes — tag["href"] raises KeyError if missing, tag.get("href", default) doesn't. Text is read via tag.string (only when the tag has a single string child), tag.get_text(sep, strip) (the workhorse — joins all descendants), or tag.stripped_strings (an iterator of trimmed strings).

python
# attrs.py
from bs4 import BeautifulSoup

html = """
<a href="/a" class="btn primary" data-id="42" aria-label="Open">
  Open <span>now</span>
</a>
"""

soup = BeautifulSoup(html, "lxml")
a = soup.a

print(a["href"])                            # required attr
print(a.get("missing", "default"))           # optional with fallback
print(a["class"])                            # multi-valued attrs become lists
print(a.get("data-id"))                      # data-* attributes work the same
print(a.attrs)                                # whole attr dict

print(a.string)                              # None — multiple children
print(a.get_text(strip=True))                # "Opennow"
print(a.get_text(" ", strip=True))           # "Open now" — separator joins
print(list(a.stripped_strings))
bash
python attrs.py

Output:

text
/a
default
['btn', 'primary']
42
{'href': '/a', 'class': ['btn', 'primary'], 'data-id': '42', 'aria-label': 'Open'}
None
Opennow
Open now
['Open', 'now']

Multi-valued HTML attributes (class, rel, rev, accept-charset, headers, archive) are returned as lists, not strings. Test with "primary" in tag["class"], not tag["class"] == "primary".

Mutating and serialising

The tree is mutable. Edit attributes, replace text, insert new tags built with soup.new_tag, and call decompose() to delete a tag and everything under it. Serialise back to a string with str(soup) (compact) or soup.prettify() (indented).

python
# mutate.py
from bs4 import BeautifulSoup

html = "<article><h1>Old</h1><p>body</p><script>tracker()</script></article>"
soup = BeautifulSoup(html, "lxml")

# Rewrite the heading
soup.h1.string = "New"

# Add an attribute
soup.article["data-version"] = "2"

# Insert a new tag before <p>
note = soup.new_tag("aside", **{"class": "note"})
note.string = "Important!"
soup.p.insert_before(note)

# Strip every <script>
for s in soup.find_all("script"):
    s.decompose()

print(soup.prettify())
bash
python mutate.py

Output:

text
<html>
 <body>
  <article data-version="2">
   <h1>
    New
   </h1>
   <aside class="note">
    Important!
   </aside>
   <p>
    body
   </p>
  </article>
 </body>
</html>

Pairing with requests

Most scraping flows are fetch → parse → extract. requests (or httpx) handles the HTTP side; BeautifulSoup handles the HTML side. Always pass an explicit timeout, raise on 4xx/5xx, and feed the bytes (not response.text) to BeautifulSoup so it can detect the encoding from the document itself.

python
# fetch_parse.py
import requests
from bs4 import BeautifulSoup

resp = requests.get("https://httpbin.org/html", timeout=10)
resp.raise_for_status()

soup = BeautifulSoup(resp.content, "lxml")   # .content (bytes), not .text
print(soup.h1.get_text(strip=True))
bash
python fetch_parse.py

Output:

text
Herman Melville - Moby-Dick

Pairing with httpx (async)

For high-throughput scraping or pages that need to be fetched concurrently, httpx.AsyncClient pairs naturally with asyncio.gather. BeautifulSoup itself is synchronous and CPU-bound, so the actual parsing won't parallelise — but the network waits will.

python
# async_scrape.py
import asyncio
import httpx
from bs4 import BeautifulSoup

URLS = [
    "https://httpbin.org/html",
    "https://example.com",
]

async def fetch(client, url):
    r = await client.get(url, timeout=10)
    r.raise_for_status()
    soup = BeautifulSoup(r.content, "lxml")
    return url, (soup.h1 or soup.title).get_text(strip=True)

async def main():
    async with httpx.AsyncClient(http2=True) as client:
        return await asyncio.gather(*(fetch(client, u) for u in URLS))

for url, title in asyncio.run(main()):
    print(url, "->", title)
bash
python async_scrape.py

Output:

text
https://httpbin.org/html -> Herman Melville - Moby-Dick
https://example.com -> Example Domain

Pairing with playwright for JS-rendered pages

BeautifulSoup can't run JavaScript. If view-source: of a page shows <div id="app"></div> and nothing else, the content was rendered client-side and requests will only see the empty shell. Run the page through playwright (or selenium) first, snapshot the rendered HTML, then hand that to BeautifulSoup.

bash
pip install playwright
playwright install chromium

Output: (none — exits 0 on success)

python
# js_page.py
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com", wait_until="networkidle")
    rendered_html = page.content()                 # post-JavaScript DOM
    browser.close()

soup = BeautifulSoup(rendered_html, "lxml")
print(soup.title.string)
bash
python js_page.py

Output:

text
Example Domain

BeautifulSoup vs lxml direct

lxml is a C library; BeautifulSoup is a Python wrapper that can use lxml under the hood. Choose by ergonomics first, performance second.

ConcernBeautifulSouplxml direct
APIPythonic, forgivingXPath-centric
Selectorsfind / CSS / regexXPath + CSS (via cssselect)
Speed (with lxml backend)~80% of lxml100%
Broken HTMLExcellent (especially html5lib)Good (libxml2 recovery)
XML namespacesUse "xml" parserFirst-class
MemoryWhole-treeStreaming via iterparse

For one-off scraping or scripts where the markup is messy, BeautifulSoup wins. For multi-gig XML feeds or schema-heavy workflows, go straight to lxml.

Common pitfalls

  1. Implicit parser selectionBeautifulSoup(html) (no parser) picks an arbitrary installed backend. Always pass the parser name explicitly so behaviour doesn't change when lxml is or isn't installed.
  2. Feeding response.text instead of response.content.text decodes using a guess; .content is bytes and lets BeautifulSoup sniff the document's own encoding (<meta charset>). Encoding bugs almost always trace back to this.
  3. class_ vs classfind(class_="foo") (with underscore) in Python; in CSS selectors it's .foo. Forgetting the underscore yields a SyntaxError.
  4. Multi-class matchingfind("div", class_="a b") looks for the literal class string "a b". To match a tag with both classes use the CSS selector form: soup.select("div.a.b").
  5. tag.string returns None for multi-child tags<p>hi <b>there</b></p>.string is None. Use get_text() for the whole subtree.
  6. Whitespace-only siblingstag.next_sibling often returns a NavigableString of whitespace. Use tag.find_next_sibling("name") to skip them, or filter if isinstance(sib, Tag).
  7. Mutating during iteration — calling .decompose() inside a for tag in soup.find_all(...) loop is safe (find_all returns a list), but mutating inside an iterator from .descendants is not.
  8. Robots.txt and rate limits — BeautifulSoup will happily parse anything; the server is the constraint. Respect robots.txt, throttle (time.sleep), set a User-Agent that identifies you, and honour Retry-After.
  9. Anti-scraping defences — JavaScript challenges, Cloudflare interstitials, fingerprinting. If you hit a wall with requests + BeautifulSoup, the next stop is playwright (or curl_cffi for TLS fingerprint impersonation), not more soup.
  10. Encoding mojibakesoup.original_encoding reports what BeautifulSoup decided to use. If it's wrong, pass from_encoding="utf-8" explicitly.
  11. select is CSS, not XPath — pseudo-classes like :contains(...) are non-standard but supported by BeautifulSoup (via soupsieve). XPath (//div[@class='foo']) is not — switch to lxml for that.
  12. Memory on huge pages — BeautifulSoup loads the entire DOM. For multi-megabyte HTML/XML, use lxml.etree.iterparse to stream instead.

Real-world recipes

Scrape a paginated table into a pandas DataFrame

A common end-to-end flow: pull every page of an HTML table, normalise rows, and hand the result to pandas for analysis. pandas.read_html works for trivial tables but fails on dynamic markup; pairing BeautifulSoup with pandas.DataFrame is more flexible.

python
# scrape_table.py
import time
import requests
from bs4 import BeautifulSoup
import pandas as pd

BASE = "https://example.com/items?page={page}"

def parse_page(html: str) -> list[dict[str, str]]:
    soup = BeautifulSoup(html, "lxml")
    rows = []
    for tr in soup.select("table#items tbody tr"):
        cells = [td.get_text(strip=True) for td in tr.find_all("td")]
        if len(cells) >= 3:
            rows.append({"id": cells[0], "name": cells[1], "price": cells[2]})
    return rows

all_rows: list[dict[str, str]] = []
session = requests.Session()
session.headers["User-Agent"] = "jockey-scraper/1.0 (alice@example.com)"

for page in range(1, 6):
    resp = session.get(BASE.format(page=page), timeout=10)
    resp.raise_for_status()
    all_rows.extend(parse_page(resp.content))
    time.sleep(1.0)                    # be polite

df = pd.DataFrame(all_rows)
df["price"] = df["price"].str.replace("$", "").astype(float)
print(df.head())
bash
python scrape_table.py

Output:

text
   id        name  price
0   1  Widget A   9.99
1   2  Widget B  19.99
2   3  Widget C  29.99
...

A one-screen helper that prints every external link on a page. Useful for link audits and broken-link checking.

python
# outbound.py
from urllib.parse import urljoin, urlparse
import requests
from bs4 import BeautifulSoup

URL = "https://example.com"
resp = requests.get(URL, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "lxml")
host = urlparse(URL).netloc

for a in soup.find_all("a", href=True):
    full = urljoin(URL, a["href"])
    if urlparse(full).netloc and urlparse(full).netloc != host:
        print(a.get_text(strip=True), "->", full)
bash
python outbound.py

Output:

text
IANA -> https://www.iana.org/domains/example

Clean an HTML email body

You receive HTML email and want plain-text body, with scripts, styles, and tracking pixels stripped. BeautifulSoup is perfect for this — walk the tree, kill the dangerous bits, return the text.

python
# clean_email.py
from bs4 import BeautifulSoup

def html_to_text(html: str) -> str:
    soup = BeautifulSoup(html, "lxml")
    for tag in soup(["script", "style", "head", "title", "meta"]):
        tag.decompose()
    # Convert <br> and block elements to newlines
    for br in soup.find_all("br"):
        br.replace_with("\n")
    for blk in soup.find_all(["p", "div", "li", "tr"]):
        blk.append("\n")
    text = soup.get_text()
    return "\n".join(line.strip() for line in text.splitlines() if line.strip())

print(html_to_text("""
<html><head><title>x</title><style>.a{}</style></head>
<body>
  <p>Hi alice,</p>
  <p>Your <b>order #1234</b> has shipped.<br>Tracking: ABC.</p>
  <script>track();</script>
</body></html>
"""))
bash
python clean_email.py

Output:

text
Hi alice,
Your order #1234 has shipped.
Tracking: ABC.

Rewrite every image URL to absolute

You're saving a page offline, so every relative <img src> and <a href> needs to become an absolute URL.

python
# absolutise.py
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import requests

BASE = "https://example.com/article"
resp = requests.get(BASE, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "lxml")

for tag, attr in (("a", "href"), ("img", "src"), ("link", "href"), ("script", "src")):
    for t in soup.find_all(tag, **{attr: True}):
        t[attr] = urljoin(BASE, t[attr])

with open("/tmp/article.html", "w", encoding="utf-8") as f:
    f.write(str(soup))
print("saved /tmp/article.html")
bash
python absolutise.py

Output:

text
saved /tmp/article.html

Parse an RSS / Atom feed

For XML feeds use the xml parser — it preserves case and namespaces. feedparser is the dedicated library, but BeautifulSoup is fine for one-off parsing.

python
# feed.py
import requests
from bs4 import BeautifulSoup

resp = requests.get("https://example.com/feed.atom", timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "xml")

for entry in soup.find_all("entry"):
    title = entry.title.get_text(strip=True)
    link = entry.find("link", rel="alternate")["href"]
    print(title, "->", link)
bash
python feed.py

Output:

text
Article one -> https://example.com/a
Article two -> https://example.com/b

Respect robots.txt before scraping

urllib.robotparser is in the stdlib. Check it once per host and cache the result; refuse to fetch paths the site disallows.

python
# polite.py
from urllib.robotparser import RobotFileParser
import requests
from bs4 import BeautifulSoup

UA = "jockey-scraper/1.0 (alice@example.com)"

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

target = "https://example.com/items"
if not rp.can_fetch(UA, target):
    raise SystemExit(f"robots.txt forbids fetching {target}")

resp = requests.get(target, headers={"User-Agent": UA}, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "lxml")
print(soup.title.get_text(strip=True))
bash
python polite.py

Output:

text
Items — Example

Scrape a JS-rendered SPA with playwright + soup

Many modern sites render content client-side. The pattern is: drive a headless browser to fully render, snapshot the resulting HTML, then run BeautifulSoup against that snapshot.

python
# spa.py
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/dashboard", wait_until="networkidle")
    page.wait_for_selector(".widget", timeout=10_000)
    html = page.content()
    browser.close()

soup = BeautifulSoup(html, "lxml")
for w in soup.select(".widget"):
    print(w.select_one(".title").get_text(strip=True))
bash
python spa.py

Output:

text
Active users
Revenue today
Errors

A throwaway script that becomes a reusable shell tool with curl ... | python ....

python
# links.py
import sys
from bs4 import BeautifulSoup

soup = BeautifulSoup(sys.stdin.read(), "lxml")
for a in soup.find_all("a", href=True):
    print(a["href"])
bash
curl -s https://example.com | python links.py

Output:

text
https://www.iana.org/domains/example