cheat sheet

BeautifulSoup

Parse, search, and mutate HTML/XML with BeautifulSoup 4. Covers parser choice (html.parser/lxml/html5lib), find/find_all/select, tree navigation, attribute access, and pairing with requests/httpx/playwright for end-to-end scraping.

updated 05-25-2026

BeautifulSoup — HTML Parsing & Scraping

What it is

BeautifulSoup (package name beautifulsoup4, import bs4) is a Python library by Leonard Richardson for parsing HTML and XML into a navigable tree. It does not fetch pages — pair it with requests, httpx, or playwright for that — and it does not execute JavaScript, so JS-rendered pages need a headless browser before the soup. Its sweet spot is screen-scraping static HTML, fixing broken markup, and re-emitting cleaned-up XML/HTML. For pure speed on well-formed documents reach for lxml directly; BeautifulSoup is the friendlier API that calls into lxml (or html.parser / html5lib) under the hood.

Install

bash

# Core library
pip install beautifulsoup4

# Recommended parser (C speed, lenient)
pip install lxml

# Browser-grade parser (slow but matches Chrome/Firefox behaviour exactly)
pip install html5lib

# HTTP clients you'll pair it with
pip install requests
pip install httpx

Output: (none — exits 0 on success)

Syntax

The library is a single class — BeautifulSoup(markup, parser) — plus the tree of Tag / NavigableString objects it returns. Pass an HTML/XML string or a file-like object, name the parser explicitly, and search the tree with find, find_all, or CSS selectors via select.

python

from bs4 import BeautifulSoup

soup = BeautifulSoup(markup, "lxml")           # or "html.parser", "html5lib", "xml"
tag = soup.find("a", class_="primary")
tags = soup.find_all("li", limit=10)
hits = soup.select("div.article > h2 a[href]")

Output: (none — exits 0 on success)

Quick example

python

# quick.py
from bs4 import BeautifulSoup

html = """
<html><body>
  <h1>Hello, World</h1>
  <ul class="links">
    <li><a href="/a">First</a></li>
    <li><a href="/b">Second</a></li>
  </ul>
</body></html>
"""

soup = BeautifulSoup(html, "html.parser")
print(soup.h1.string)
for a in soup.select("ul.links a"):
    print(a.get_text(), "->", a["href"])

bash

python quick.py

Output:

text

Hello, World
First -> /a
Second -> /b

Parser choice

BeautifulSoup is a facade — the actual parsing is delegated to one of four backends. They differ in speed, leniency, and how they handle malformed HTML. Always pass the parser name explicitly; if you omit it, BeautifulSoup picks whatever is installed and you get implicit dependencies on environment state.

Parser	Speed	Leniency	Notes
`"html.parser"`	medium	medium	Stdlib — no install. Fine for most jobs.
`"lxml"`	fast	lenient	Best default — install `lxml` and use it for HTML.
`"lxml-xml"` or `"xml"`	fast	strict	XML only — preserves namespaces, case-sensitive.
`"html5lib"`	slow	maximal	Matches a real browser bit-for-bit; use for badly broken pages.

python

# parser_choice.py
from bs4 import BeautifulSoup

broken = "<p>one<p>two<p>three"

for parser in ("html.parser", "lxml", "html5lib"):
    soup = BeautifulSoup(broken, parser)
    print(parser, "->", [p.get_text() for p in soup.find_all("p")])

bash

python parser_choice.py

Output:

text

html.parser -> ['one', 'two', 'three']
lxml -> ['one', 'two', 'three']
html5lib -> ['one', 'two', 'three']

find vs find_all vs select

find(name, attrs) returns the first matching tag or None. find_all(name, attrs, limit=) returns a list (default: all matches). select(css) runs a CSS selector and returns a list; select_one(css) returns the first match or None. CSS selectors are usually the most readable; reach for find / find_all when you need keyword args (string=, recursive=False, custom callables).

python

# search.py
from bs4 import BeautifulSoup

html = """
<article class="post" data-id="42">
  <h2>Title</h2>
  <p class="lede">Lead paragraph.</p>
  <p>Body paragraph.</p>
  <a class="cta primary" href="/buy">Buy</a>
  <a class="cta" href="/learn">Learn</a>
</article>
"""

soup = BeautifulSoup(html, "lxml")

print(soup.find("h2").get_text())                              # first match by tag name
print(soup.find("a", class_="primary")["href"])                # keyword match
print(soup.find("p", string="Body paragraph.").get("class"))    # exact text match

print([a["href"] for a in soup.find_all("a")])                  # every link
print([a["href"] for a in soup.find_all("a", limit=1)])         # cap results

print(soup.select_one("article > h2").text)                    # CSS direct-child
print([a.text for a in soup.select("a.cta")])                  # CSS class
print(soup.select_one('[data-id="42"]').name)                  # CSS attribute selector
print([a["href"] for a in soup.select("a[href^='/']")])        # CSS prefix match

bash

python search.py

Output:

text

Title
/buy
None
['/buy', '/learn']
['/buy']
Title
['Buy', 'Learn']
article
['/buy', '/learn']

class_ (with trailing underscore) is the keyword in find / find_all because class is a Python reserved word. With select you write the normal CSS .classname.

Searching with regex, lists, and callables

The name=, attrs=, and string= arguments to find / find_all accept more than plain strings — they take regexes, lists, dicts, booleans, and callables. This is how you express "any heading", "any link whose href starts with https://example.com/", or "any tag with both classes".

python

# advanced_search.py
import re
from bs4 import BeautifulSoup

html = """
<div>
  <h1>Top</h1>
  <h2>Mid</h2>
  <h3>Lower</h3>
  <a href="https://example.com/a">A</a>
  <a href="mailto:alice@example.com">Email</a>
  <a href="/local">Local</a>
</div>
"""

soup = BeautifulSoup(html, "lxml")

# Regex tag name — any heading h1..h6
print([t.name for t in soup.find_all(re.compile(r"^h[1-6]$"))])

# List — match any tag in the list
print([t.name for t in soup.find_all(["a", "h1"])])

# Boolean True — all tags with this attribute
print([a["href"] for a in soup.find_all("a", href=True)])

# Callable — link only if it goes off-site
def offsite(tag):
    return tag.name == "a" and tag.get("href", "").startswith("http")

print([a["href"] for a in soup.find_all(offsite)])

# Dict attrs — multiple constraints
print(soup.find("a", attrs={"href": re.compile(r"^mailto:")}).get_text())

bash

python advanced_search.py

Output:

text

['h1', 'h2', 'h3']
['h1', 'a', 'a', 'a']
['https://example.com/a', 'mailto:alice@example.com', '/local']
['https://example.com/a']
Email

Navigating the tree

Every node exposes parent, sibling, and descendant accessors. Use them when search-by-attribute isn't enough — e.g. "the <dd> that follows this <dt>", "the table row containing the cell with this text".

Accessor	What it returns
`tag.parent`	The immediate parent tag
`tag.parents`	An iterator from the parent up to the root
`tag.next_sibling` / `previous_sibling`	The next/previous sibling (may be whitespace `NavigableString`)
`tag.next_element` / `previous_element`	The next/previous anything (siblings, children, text)
`tag.children`	Direct children iterator
`tag.descendants`	All descendants, depth-first
`tag.contents`	Direct children as a list
`tag.strings` / `stripped_strings`	All text nodes (whitespace-stripped variant)
`tag.find_next("p")`	Next matching tag, document order
`tag.find_previous("p")`	Previous matching tag

python

# navigate.py
from bs4 import BeautifulSoup

html = """
<dl>
  <dt>name</dt><dd>alice</dd>
  <dt>email</dt><dd>alice@example.com</dd>
  <dt>role</dt><dd>admin</dd>
</dl>
"""

soup = BeautifulSoup(html, "lxml")

# Build a dict from a <dl> by pairing each <dt> with its next <dd>
data = {dt.get_text(): dt.find_next("dd").get_text() for dt in soup.find_all("dt")}
print(data)

# Walk all stripped strings under the <dl>
print(list(soup.dl.stripped_strings))

bash

python navigate.py

Output:

text

{'name': 'alice', 'email': 'alice@example.com', 'role': 'admin'}
['name', 'alice', 'email', 'alice@example.com', 'role', 'admin']

Attribute access and text

A Tag behaves like a dict for its attributes — tag["href"] raises KeyError if missing, tag.get("href", default) doesn't. Text is read via tag.string (only when the tag has a single string child), tag.get_text(sep, strip) (the workhorse — joins all descendants), or tag.stripped_strings (an iterator of trimmed strings).

python

# attrs.py
from bs4 import BeautifulSoup

html = """
<a href="/a" class="btn primary" data-id="42" aria-label="Open">
  Open <span>now</span>
</a>
"""

soup = BeautifulSoup(html, "lxml")
a = soup.a

print(a["href"])                            # required attr
print(a.get("missing", "default"))           # optional with fallback
print(a["class"])                            # multi-valued attrs become lists
print(a.get("data-id"))                      # data-* attributes work the same
print(a.attrs)                                # whole attr dict

print(a.string)                              # None — multiple children
print(a.get_text(strip=True))                # "Opennow"
print(a.get_text(" ", strip=True))           # "Open now" — separator joins
print(list(a.stripped_strings))

bash

python attrs.py

Output:

text

/a
default
['btn', 'primary']
42
{'href': '/a', 'class': ['btn', 'primary'], 'data-id': '42', 'aria-label': 'Open'}
None
Opennow
Open now
['Open', 'now']

Multi-valued HTML attributes (class, rel, rev, accept-charset, headers, archive) are returned as lists, not strings. Test with "primary" in tag["class"], not tag["class"] == "primary".

Mutating and serialising

The tree is mutable. Edit attributes, replace text, insert new tags built with soup.new_tag, and call decompose() to delete a tag and everything under it. Serialise back to a string with str(soup) (compact) or soup.prettify() (indented).

python

# mutate.py
from bs4 import BeautifulSoup

html = "<article><h1>Old</h1><p>body</p><script>tracker()</script></article>"
soup = BeautifulSoup(html, "lxml")

# Rewrite the heading
soup.h1.string = "New"

# Add an attribute
soup.article["data-version"] = "2"

# Insert a new tag before <p>
note = soup.new_tag("aside", **{"class": "note"})
note.string = "Important!"
soup.p.insert_before(note)

# Strip every <script>
for s in soup.find_all("script"):
    s.decompose()

print(soup.prettify())

bash

python mutate.py

Output:

text

<html>
 <body>
  <article data-version="2">
   <h1>
    New
   </h1>
   <aside class="note">
    Important!
   </aside>
   <p>
    body
   </p>
  </article>
 </body>
</html>

Pairing with requests

Most scraping flows are fetch → parse → extract. requests (or httpx) handles the HTTP side; BeautifulSoup handles the HTML side. Always pass an explicit timeout, raise on 4xx/5xx, and feed the bytes (not response.text) to BeautifulSoup so it can detect the encoding from the document itself.

python

# fetch_parse.py
import requests
from bs4 import BeautifulSoup

resp = requests.get("https://httpbin.org/html", timeout=10)
resp.raise_for_status()

soup = BeautifulSoup(resp.content, "lxml")   # .content (bytes), not .text
print(soup.h1.get_text(strip=True))

bash

python fetch_parse.py

Output:

text

Herman Melville - Moby-Dick

Pairing with httpx (async)

For high-throughput scraping or pages that need to be fetched concurrently, httpx.AsyncClient pairs naturally with asyncio.gather. BeautifulSoup itself is synchronous and CPU-bound, so the actual parsing won't parallelise — but the network waits will.

python

# async_scrape.py
import asyncio
import httpx
from bs4 import BeautifulSoup

URLS = [
    "https://httpbin.org/html",
    "https://example.com",
]

async def fetch(client, url):
    r = await client.get(url, timeout=10)
    r.raise_for_status()
    soup = BeautifulSoup(r.content, "lxml")
    return url, (soup.h1 or soup.title).get_text(strip=True)

async def main():
    async with httpx.AsyncClient(http2=True) as client:
        return await asyncio.gather(*(fetch(client, u) for u in URLS))

for url, title in asyncio.run(main()):
    print(url, "->", title)

bash

python async_scrape.py

Output:

text

https://httpbin.org/html -> Herman Melville - Moby-Dick
https://example.com -> Example Domain

Pairing with playwright for JS-rendered pages

BeautifulSoup can't run JavaScript. If view-source: of a page shows <div id="app"></div> and nothing else, the content was rendered client-side and requests will only see the empty shell. Run the page through playwright (or selenium) first, snapshot the rendered HTML, then hand that to BeautifulSoup.

bash

pip install playwright
playwright install chromium

Output: (none — exits 0 on success)

python

# js_page.py
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com", wait_until="networkidle")
    rendered_html = page.content()                 # post-JavaScript DOM
    browser.close()

soup = BeautifulSoup(rendered_html, "lxml")
print(soup.title.string)

bash

python js_page.py

Output:

text

Example Domain

BeautifulSoup vs lxml direct

lxml is a C library; BeautifulSoup is a Python wrapper that can use lxml under the hood. Choose by ergonomics first, performance second.

Concern	BeautifulSoup	lxml direct
API	Pythonic, forgiving	XPath-centric
Selectors	`find` / CSS / regex	XPath + CSS (via `cssselect`)
Speed (with `lxml` backend)	~80% of lxml	100%
Broken HTML	Excellent (especially `html5lib`)	Good (libxml2 recovery)
XML namespaces	Use `"xml"` parser	First-class
Memory	Whole-tree	Streaming via `iterparse`

For one-off scraping or scripts where the markup is messy, BeautifulSoup wins. For multi-gig XML feeds or schema-heavy workflows, go straight to lxml.

Common pitfalls

Implicit parser selection — BeautifulSoup(html) (no parser) picks an arbitrary installed backend. Always pass the parser name explicitly so behaviour doesn't change when lxml is or isn't installed.
Feeding response.text instead of response.content — .text decodes using a guess; .content is bytes and lets BeautifulSoup sniff the document's own encoding (<meta charset>). Encoding bugs almost always trace back to this.
class_ vs class — find(class_="foo") (with underscore) in Python; in CSS selectors it's .foo. Forgetting the underscore yields a SyntaxError.
Multi-class matching — find("div", class_="a b") looks for the literal class string "a b". To match a tag with both classes use the CSS selector form: soup.select("div.a.b").
tag.string returns None for multi-child tags — <p>hi <b>there</b></p>.string is None. Use get_text() for the whole subtree.
Whitespace-only siblings — tag.next_sibling often returns a NavigableString of whitespace. Use tag.find_next_sibling("name") to skip them, or filter if isinstance(sib, Tag).
Mutating during iteration — calling .decompose() inside a for tag in soup.find_all(...) loop is safe (find_all returns a list), but mutating inside an iterator from .descendants is not.
Robots.txt and rate limits — BeautifulSoup will happily parse anything; the server is the constraint. Respect robots.txt, throttle (time.sleep), set a User-Agent that identifies you, and honour Retry-After.
Anti-scraping defences — JavaScript challenges, Cloudflare interstitials, fingerprinting. If you hit a wall with requests + BeautifulSoup, the next stop is playwright (or curl_cffi for TLS fingerprint impersonation), not more soup.
Encoding mojibake — soup.original_encoding reports what BeautifulSoup decided to use. If it's wrong, pass from_encoding="utf-8" explicitly.
select is CSS, not XPath — pseudo-classes like :contains(...) are non-standard but supported by BeautifulSoup (via soupsieve). XPath (//div[@class='foo']) is not — switch to lxml for that.
Memory on huge pages — BeautifulSoup loads the entire DOM. For multi-megabyte HTML/XML, use lxml.etree.iterparse to stream instead.

Real-world recipes

Scrape a paginated table into a pandas DataFrame

A common end-to-end flow: pull every page of an HTML table, normalise rows, and hand the result to pandas for analysis. pandas.read_html works for trivial tables but fails on dynamic markup; pairing BeautifulSoup with pandas.DataFrame is more flexible.

python

# scrape_table.py
import time
import requests
from bs4 import BeautifulSoup
import pandas as pd

BASE = "https://example.com/items?page={page}"

def parse_page(html: str) -> list[dict[str, str]]:
    soup = BeautifulSoup(html, "lxml")
    rows = []
    for tr in soup.select("table#items tbody tr"):
        cells = [td.get_text(strip=True) for td in tr.find_all("td")]
        if len(cells) >= 3:
            rows.append({"id": cells[0], "name": cells[1], "price": cells[2]})
    return rows

all_rows: list[dict[str, str]] = []
session = requests.Session()
session.headers["User-Agent"] = "jockey-scraper/1.0 (alice@example.com)"

for page in range(1, 6):
    resp = session.get(BASE.format(page=page), timeout=10)
    resp.raise_for_status()
    all_rows.extend(parse_page(resp.content))
    time.sleep(1.0)                    # be polite

df = pd.DataFrame(all_rows)
df["price"] = df["price"].str.replace("$", "").astype(float)
print(df.head())

bash

python scrape_table.py

Output:

text

   id        name  price
0   1  Widget A   9.99
1   2  Widget B  19.99
2   3  Widget C  29.99
...

Extract all outbound links from a page

A one-screen helper that prints every external link on a page. Useful for link audits and broken-link checking.

python

# outbound.py
from urllib.parse import urljoin, urlparse
import requests
from bs4 import BeautifulSoup

URL = "https://example.com"
resp = requests.get(URL, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "lxml")
host = urlparse(URL).netloc

for a in soup.find_all("a", href=True):
    full = urljoin(URL, a["href"])
    if urlparse(full).netloc and urlparse(full).netloc != host:
        print(a.get_text(strip=True), "->", full)

bash

python outbound.py

Output:

text

IANA -> https://www.iana.org/domains/example

Clean an HTML email body

You receive HTML email and want plain-text body, with scripts, styles, and tracking pixels stripped. BeautifulSoup is perfect for this — walk the tree, kill the dangerous bits, return the text.

python

# clean_email.py
from bs4 import BeautifulSoup

def html_to_text(html: str) -> str:
    soup = BeautifulSoup(html, "lxml")
    for tag in soup(["script", "style", "head", "title", "meta"]):
        tag.decompose()
    # Convert <br> and block elements to newlines
    for br in soup.find_all("br"):
        br.replace_with("\n")
    for blk in soup.find_all(["p", "div", "li", "tr"]):
        blk.append("\n")
    text = soup.get_text()
    return "\n".join(line.strip() for line in text.splitlines() if line.strip())

print(html_to_text("""
<html><head><title>x</title><style>.a{}</style></head>
<body>
  <p>Hi alice,</p>
  <p>Your <b>order #1234</b> has shipped.<br>Tracking: ABC.</p>
  <script>track();</script>
</body></html>
"""))

bash

python clean_email.py

Output:

text

Hi alice,
Your order #1234 has shipped.
Tracking: ABC.

Rewrite every image URL to absolute

You're saving a page offline, so every relative <img src> and <a href> needs to become an absolute URL.

python

# absolutise.py
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import requests

BASE = "https://example.com/article"
resp = requests.get(BASE, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "lxml")

for tag, attr in (("a", "href"), ("img", "src"), ("link", "href"), ("script", "src")):
    for t in soup.find_all(tag, **{attr: True}):
        t[attr] = urljoin(BASE, t[attr])

with open("/tmp/article.html", "w", encoding="utf-8") as f:
    f.write(str(soup))
print("saved /tmp/article.html")

bash

python absolutise.py

Output:

text

saved /tmp/article.html

Parse an RSS / Atom feed

For XML feeds use the xml parser — it preserves case and namespaces. feedparser is the dedicated library, but BeautifulSoup is fine for one-off parsing.

python

# feed.py
import requests
from bs4 import BeautifulSoup

resp = requests.get("https://example.com/feed.atom", timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "xml")

for entry in soup.find_all("entry"):
    title = entry.title.get_text(strip=True)
    link = entry.find("link", rel="alternate")["href"]
    print(title, "->", link)

bash

python feed.py

Output:

text

Article one -> https://example.com/a
Article two -> https://example.com/b

Respect robots.txt before scraping

urllib.robotparser is in the stdlib. Check it once per host and cache the result; refuse to fetch paths the site disallows.

python

# polite.py
from urllib.robotparser import RobotFileParser
import requests
from bs4 import BeautifulSoup

UA = "jockey-scraper/1.0 (alice@example.com)"

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

target = "https://example.com/items"
if not rp.can_fetch(UA, target):
    raise SystemExit(f"robots.txt forbids fetching {target}")

resp = requests.get(target, headers={"User-Agent": UA}, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "lxml")
print(soup.title.get_text(strip=True))

bash

python polite.py

Output:

text

Items — Example

Scrape a JS-rendered SPA with playwright + soup

Many modern sites render content client-side. The pattern is: drive a headless browser to fully render, snapshot the resulting HTML, then run BeautifulSoup against that snapshot.

python

# spa.py
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/dashboard", wait_until="networkidle")
    page.wait_for_selector(".widget", timeout=10_000)
    html = page.content()
    browser.close()

soup = BeautifulSoup(html, "lxml")
for w in soup.select(".widget"):
    print(w.select_one(".title").get_text(strip=True))

bash

python spa.py

Output:

text

Active users
Revenue today
Errors

One-line CLI: extract every link from stdin

A throwaway script that becomes a reusable shell tool with curl ... | python ....

python

# links.py
import sys
from bs4 import BeautifulSoup

soup = BeautifulSoup(sys.stdin.read(), "lxml")
for a in soup.find_all("a", href=True):
    print(a["href"])

bash

curl -s https://example.com | python links.py

Output:

text

https://www.iana.org/domains/example