cheat sheet
BeautifulSoup
Parse, search, and mutate HTML/XML with BeautifulSoup 4. Covers parser choice (html.parser/lxml/html5lib), find/find_all/select, tree navigation, attribute access, and pairing with requests/httpx/playwright for end-to-end scraping.
BeautifulSoup — HTML Parsing & Scraping
What it is
BeautifulSoup (package name beautifulsoup4, import bs4) is a Python library by Leonard Richardson for parsing HTML and XML into a navigable tree. It does not fetch pages — pair it with requests, httpx, or playwright for that — and it does not execute JavaScript, so JS-rendered pages need a headless browser before the soup. Its sweet spot is screen-scraping static HTML, fixing broken markup, and re-emitting cleaned-up XML/HTML. For pure speed on well-formed documents reach for lxml directly; BeautifulSoup is the friendlier API that calls into lxml (or html.parser / html5lib) under the hood.
Install
# Core library
pip install beautifulsoup4
# Recommended parser (C speed, lenient)
pip install lxml
# Browser-grade parser (slow but matches Chrome/Firefox behaviour exactly)
pip install html5lib
# HTTP clients you'll pair it with
pip install requests
pip install httpx
Output: (none — exits 0 on success)
Syntax
The library is a single class — BeautifulSoup(markup, parser) — plus the tree of Tag / NavigableString objects it returns. Pass an HTML/XML string or a file-like object, name the parser explicitly, and search the tree with find, find_all, or CSS selectors via select.
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "lxml") # or "html.parser", "html5lib", "xml"
tag = soup.find("a", class_="primary")
tags = soup.find_all("li", limit=10)
hits = soup.select("div.article > h2 a[href]")
Output: (none — exits 0 on success)
Quick example
# quick.py
from bs4 import BeautifulSoup
html = """
<html><body>
<h1>Hello, World</h1>
<ul class="links">
<li><a href="/a">First</a></li>
<li><a href="/b">Second</a></li>
</ul>
</body></html>
"""
soup = BeautifulSoup(html, "html.parser")
print(soup.h1.string)
for a in soup.select("ul.links a"):
print(a.get_text(), "->", a["href"])
python quick.py
Output:
Hello, World
First -> /a
Second -> /b
Parser choice
BeautifulSoup is a facade — the actual parsing is delegated to one of four backends. They differ in speed, leniency, and how they handle malformed HTML. Always pass the parser name explicitly; if you omit it, BeautifulSoup picks whatever is installed and you get implicit dependencies on environment state.
| Parser | Speed | Leniency | Notes |
|---|---|---|---|
"html.parser" | medium | medium | Stdlib — no install. Fine for most jobs. |
"lxml" | fast | lenient | Best default — install lxml and use it for HTML. |
"lxml-xml" or "xml" | fast | strict | XML only — preserves namespaces, case-sensitive. |
"html5lib" | slow | maximal | Matches a real browser bit-for-bit; use for badly broken pages. |
# parser_choice.py
from bs4 import BeautifulSoup
broken = "<p>one<p>two<p>three"
for parser in ("html.parser", "lxml", "html5lib"):
soup = BeautifulSoup(broken, parser)
print(parser, "->", [p.get_text() for p in soup.find_all("p")])
python parser_choice.py
Output:
html.parser -> ['one', 'two', 'three']
lxml -> ['one', 'two', 'three']
html5lib -> ['one', 'two', 'three']
find vs find_all vs select
find(name, attrs) returns the first matching tag or None. find_all(name, attrs, limit=) returns a list (default: all matches). select(css) runs a CSS selector and returns a list; select_one(css) returns the first match or None. CSS selectors are usually the most readable; reach for find / find_all when you need keyword args (string=, recursive=False, custom callables).
# search.py
from bs4 import BeautifulSoup
html = """
<article class="post" data-id="42">
<h2>Title</h2>
<p class="lede">Lead paragraph.</p>
<p>Body paragraph.</p>
<a class="cta primary" href="/buy">Buy</a>
<a class="cta" href="/learn">Learn</a>
</article>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.find("h2").get_text()) # first match by tag name
print(soup.find("a", class_="primary")["href"]) # keyword match
print(soup.find("p", string="Body paragraph.").get("class")) # exact text match
print([a["href"] for a in soup.find_all("a")]) # every link
print([a["href"] for a in soup.find_all("a", limit=1)]) # cap results
print(soup.select_one("article > h2").text) # CSS direct-child
print([a.text for a in soup.select("a.cta")]) # CSS class
print(soup.select_one('[data-id="42"]').name) # CSS attribute selector
print([a["href"] for a in soup.select("a[href^='/']")]) # CSS prefix match
python search.py
Output:
Title
/buy
None
['/buy', '/learn']
['/buy']
Title
['Buy', 'Learn']
article
['/buy', '/learn']
class_(with trailing underscore) is the keyword infind/find_allbecauseclassis a Python reserved word. Withselectyou write the normal CSS.classname.
Searching with regex, lists, and callables
The name=, attrs=, and string= arguments to find / find_all accept more than plain strings — they take regexes, lists, dicts, booleans, and callables. This is how you express "any heading", "any link whose href starts with https://example.com/", or "any tag with both classes".
# advanced_search.py
import re
from bs4 import BeautifulSoup
html = """
<div>
<h1>Top</h1>
<h2>Mid</h2>
<h3>Lower</h3>
<a href="https://example.com/a">A</a>
<a href="mailto:alice@example.com">Email</a>
<a href="/local">Local</a>
</div>
"""
soup = BeautifulSoup(html, "lxml")
# Regex tag name — any heading h1..h6
print([t.name for t in soup.find_all(re.compile(r"^h[1-6]$"))])
# List — match any tag in the list
print([t.name for t in soup.find_all(["a", "h1"])])
# Boolean True — all tags with this attribute
print([a["href"] for a in soup.find_all("a", href=True)])
# Callable — link only if it goes off-site
def offsite(tag):
return tag.name == "a" and tag.get("href", "").startswith("http")
print([a["href"] for a in soup.find_all(offsite)])
# Dict attrs — multiple constraints
print(soup.find("a", attrs={"href": re.compile(r"^mailto:")}).get_text())
python advanced_search.py
Output:
['h1', 'h2', 'h3']
['h1', 'a', 'a', 'a']
['https://example.com/a', 'mailto:alice@example.com', '/local']
['https://example.com/a']
Email
Navigating the tree
Every node exposes parent, sibling, and descendant accessors. Use them when search-by-attribute isn't enough — e.g. "the <dd> that follows this <dt>", "the table row containing the cell with this text".
| Accessor | What it returns |
|---|---|
tag.parent | The immediate parent tag |
tag.parents | An iterator from the parent up to the root |
tag.next_sibling / previous_sibling | The next/previous sibling (may be whitespace NavigableString) |
tag.next_element / previous_element | The next/previous anything (siblings, children, text) |
tag.children | Direct children iterator |
tag.descendants | All descendants, depth-first |
tag.contents | Direct children as a list |
tag.strings / stripped_strings | All text nodes (whitespace-stripped variant) |
tag.find_next("p") | Next matching tag, document order |
tag.find_previous("p") | Previous matching tag |
# navigate.py
from bs4 import BeautifulSoup
html = """
<dl>
<dt>name</dt><dd>alice</dd>
<dt>email</dt><dd>alice@example.com</dd>
<dt>role</dt><dd>admin</dd>
</dl>
"""
soup = BeautifulSoup(html, "lxml")
# Build a dict from a <dl> by pairing each <dt> with its next <dd>
data = {dt.get_text(): dt.find_next("dd").get_text() for dt in soup.find_all("dt")}
print(data)
# Walk all stripped strings under the <dl>
print(list(soup.dl.stripped_strings))
python navigate.py
Output:
{'name': 'alice', 'email': 'alice@example.com', 'role': 'admin'}
['name', 'alice', 'email', 'alice@example.com', 'role', 'admin']
Attribute access and text
A Tag behaves like a dict for its attributes — tag["href"] raises KeyError if missing, tag.get("href", default) doesn't. Text is read via tag.string (only when the tag has a single string child), tag.get_text(sep, strip) (the workhorse — joins all descendants), or tag.stripped_strings (an iterator of trimmed strings).
# attrs.py
from bs4 import BeautifulSoup
html = """
<a href="/a" class="btn primary" data-id="42" aria-label="Open">
Open <span>now</span>
</a>
"""
soup = BeautifulSoup(html, "lxml")
a = soup.a
print(a["href"]) # required attr
print(a.get("missing", "default")) # optional with fallback
print(a["class"]) # multi-valued attrs become lists
print(a.get("data-id")) # data-* attributes work the same
print(a.attrs) # whole attr dict
print(a.string) # None — multiple children
print(a.get_text(strip=True)) # "Opennow"
print(a.get_text(" ", strip=True)) # "Open now" — separator joins
print(list(a.stripped_strings))
python attrs.py
Output:
/a
default
['btn', 'primary']
42
{'href': '/a', 'class': ['btn', 'primary'], 'data-id': '42', 'aria-label': 'Open'}
None
Opennow
Open now
['Open', 'now']
Multi-valued HTML attributes (
class,rel,rev,accept-charset,headers,archive) are returned as lists, not strings. Test with"primary" in tag["class"], nottag["class"] == "primary".
Mutating and serialising
The tree is mutable. Edit attributes, replace text, insert new tags built with soup.new_tag, and call decompose() to delete a tag and everything under it. Serialise back to a string with str(soup) (compact) or soup.prettify() (indented).
# mutate.py
from bs4 import BeautifulSoup
html = "<article><h1>Old</h1><p>body</p><script>tracker()</script></article>"
soup = BeautifulSoup(html, "lxml")
# Rewrite the heading
soup.h1.string = "New"
# Add an attribute
soup.article["data-version"] = "2"
# Insert a new tag before <p>
note = soup.new_tag("aside", **{"class": "note"})
note.string = "Important!"
soup.p.insert_before(note)
# Strip every <script>
for s in soup.find_all("script"):
s.decompose()
print(soup.prettify())
python mutate.py
Output:
<html>
<body>
<article data-version="2">
<h1>
New
</h1>
<aside class="note">
Important!
</aside>
<p>
body
</p>
</article>
</body>
</html>
Pairing with requests
Most scraping flows are fetch → parse → extract. requests (or httpx) handles the HTTP side; BeautifulSoup handles the HTML side. Always pass an explicit timeout, raise on 4xx/5xx, and feed the bytes (not response.text) to BeautifulSoup so it can detect the encoding from the document itself.
# fetch_parse.py
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://httpbin.org/html", timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "lxml") # .content (bytes), not .text
print(soup.h1.get_text(strip=True))
python fetch_parse.py
Output:
Herman Melville - Moby-Dick
Pairing with httpx (async)
For high-throughput scraping or pages that need to be fetched concurrently, httpx.AsyncClient pairs naturally with asyncio.gather. BeautifulSoup itself is synchronous and CPU-bound, so the actual parsing won't parallelise — but the network waits will.
# async_scrape.py
import asyncio
import httpx
from bs4 import BeautifulSoup
URLS = [
"https://httpbin.org/html",
"https://example.com",
]
async def fetch(client, url):
r = await client.get(url, timeout=10)
r.raise_for_status()
soup = BeautifulSoup(r.content, "lxml")
return url, (soup.h1 or soup.title).get_text(strip=True)
async def main():
async with httpx.AsyncClient(http2=True) as client:
return await asyncio.gather(*(fetch(client, u) for u in URLS))
for url, title in asyncio.run(main()):
print(url, "->", title)
python async_scrape.py
Output:
https://httpbin.org/html -> Herman Melville - Moby-Dick
https://example.com -> Example Domain
Pairing with playwright for JS-rendered pages
BeautifulSoup can't run JavaScript. If view-source: of a page shows <div id="app"></div> and nothing else, the content was rendered client-side and requests will only see the empty shell. Run the page through playwright (or selenium) first, snapshot the rendered HTML, then hand that to BeautifulSoup.
pip install playwright
playwright install chromium
Output: (none — exits 0 on success)
# js_page.py
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com", wait_until="networkidle")
rendered_html = page.content() # post-JavaScript DOM
browser.close()
soup = BeautifulSoup(rendered_html, "lxml")
print(soup.title.string)
python js_page.py
Output:
Example Domain
BeautifulSoup vs lxml direct
lxml is a C library; BeautifulSoup is a Python wrapper that can use lxml under the hood. Choose by ergonomics first, performance second.
| Concern | BeautifulSoup | lxml direct |
|---|---|---|
| API | Pythonic, forgiving | XPath-centric |
| Selectors | find / CSS / regex | XPath + CSS (via cssselect) |
Speed (with lxml backend) | ~80% of lxml | 100% |
| Broken HTML | Excellent (especially html5lib) | Good (libxml2 recovery) |
| XML namespaces | Use "xml" parser | First-class |
| Memory | Whole-tree | Streaming via iterparse |
For one-off scraping or scripts where the markup is messy, BeautifulSoup wins. For multi-gig XML feeds or schema-heavy workflows, go straight to lxml.
Common pitfalls
- Implicit parser selection —
BeautifulSoup(html)(no parser) picks an arbitrary installed backend. Always pass the parser name explicitly so behaviour doesn't change whenlxmlis or isn't installed. - Feeding
response.textinstead ofresponse.content—.textdecodes using a guess;.contentis bytes and lets BeautifulSoup sniff the document's own encoding (<meta charset>). Encoding bugs almost always trace back to this. class_vsclass—find(class_="foo")(with underscore) in Python; in CSS selectors it's.foo. Forgetting the underscore yields aSyntaxError.- Multi-class matching —
find("div", class_="a b")looks for the literal class string"a b". To match a tag with both classes use the CSS selector form:soup.select("div.a.b"). tag.stringreturnsNonefor multi-child tags —<p>hi <b>there</b></p>.stringisNone. Useget_text()for the whole subtree.- Whitespace-only siblings —
tag.next_siblingoften returns aNavigableStringof whitespace. Usetag.find_next_sibling("name")to skip them, or filterif isinstance(sib, Tag). - Mutating during iteration — calling
.decompose()inside afor tag in soup.find_all(...)loop is safe (find_allreturns a list), but mutating inside aniteratorfrom.descendantsis not. - Robots.txt and rate limits — BeautifulSoup will happily parse anything; the server is the constraint. Respect
robots.txt, throttle (time.sleep), set aUser-Agentthat identifies you, and honourRetry-After. - Anti-scraping defences — JavaScript challenges, Cloudflare interstitials, fingerprinting. If you hit a wall with
requests+ BeautifulSoup, the next stop isplaywright(orcurl_cffifor TLS fingerprint impersonation), not more soup. - Encoding mojibake —
soup.original_encodingreports what BeautifulSoup decided to use. If it's wrong, passfrom_encoding="utf-8"explicitly. selectis CSS, not XPath — pseudo-classes like:contains(...)are non-standard but supported by BeautifulSoup (viasoupsieve). XPath (//div[@class='foo']) is not — switch tolxmlfor that.- Memory on huge pages — BeautifulSoup loads the entire DOM. For multi-megabyte HTML/XML, use
lxml.etree.iterparseto stream instead.
Real-world recipes
Scrape a paginated table into a pandas DataFrame
A common end-to-end flow: pull every page of an HTML table, normalise rows, and hand the result to pandas for analysis. pandas.read_html works for trivial tables but fails on dynamic markup; pairing BeautifulSoup with pandas.DataFrame is more flexible.
# scrape_table.py
import time
import requests
from bs4 import BeautifulSoup
import pandas as pd
BASE = "https://example.com/items?page={page}"
def parse_page(html: str) -> list[dict[str, str]]:
soup = BeautifulSoup(html, "lxml")
rows = []
for tr in soup.select("table#items tbody tr"):
cells = [td.get_text(strip=True) for td in tr.find_all("td")]
if len(cells) >= 3:
rows.append({"id": cells[0], "name": cells[1], "price": cells[2]})
return rows
all_rows: list[dict[str, str]] = []
session = requests.Session()
session.headers["User-Agent"] = "jockey-scraper/1.0 (alice@example.com)"
for page in range(1, 6):
resp = session.get(BASE.format(page=page), timeout=10)
resp.raise_for_status()
all_rows.extend(parse_page(resp.content))
time.sleep(1.0) # be polite
df = pd.DataFrame(all_rows)
df["price"] = df["price"].str.replace("$", "").astype(float)
print(df.head())
python scrape_table.py
Output:
id name price
0 1 Widget A 9.99
1 2 Widget B 19.99
2 3 Widget C 29.99
...
Extract all outbound links from a page
A one-screen helper that prints every external link on a page. Useful for link audits and broken-link checking.
# outbound.py
from urllib.parse import urljoin, urlparse
import requests
from bs4 import BeautifulSoup
URL = "https://example.com"
resp = requests.get(URL, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "lxml")
host = urlparse(URL).netloc
for a in soup.find_all("a", href=True):
full = urljoin(URL, a["href"])
if urlparse(full).netloc and urlparse(full).netloc != host:
print(a.get_text(strip=True), "->", full)
python outbound.py
Output:
IANA -> https://www.iana.org/domains/example
Clean an HTML email body
You receive HTML email and want plain-text body, with scripts, styles, and tracking pixels stripped. BeautifulSoup is perfect for this — walk the tree, kill the dangerous bits, return the text.
# clean_email.py
from bs4 import BeautifulSoup
def html_to_text(html: str) -> str:
soup = BeautifulSoup(html, "lxml")
for tag in soup(["script", "style", "head", "title", "meta"]):
tag.decompose()
# Convert <br> and block elements to newlines
for br in soup.find_all("br"):
br.replace_with("\n")
for blk in soup.find_all(["p", "div", "li", "tr"]):
blk.append("\n")
text = soup.get_text()
return "\n".join(line.strip() for line in text.splitlines() if line.strip())
print(html_to_text("""
<html><head><title>x</title><style>.a{}</style></head>
<body>
<p>Hi alice,</p>
<p>Your <b>order #1234</b> has shipped.<br>Tracking: ABC.</p>
<script>track();</script>
</body></html>
"""))
python clean_email.py
Output:
Hi alice,
Your order #1234 has shipped.
Tracking: ABC.
Rewrite every image URL to absolute
You're saving a page offline, so every relative <img src> and <a href> needs to become an absolute URL.
# absolutise.py
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import requests
BASE = "https://example.com/article"
resp = requests.get(BASE, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "lxml")
for tag, attr in (("a", "href"), ("img", "src"), ("link", "href"), ("script", "src")):
for t in soup.find_all(tag, **{attr: True}):
t[attr] = urljoin(BASE, t[attr])
with open("/tmp/article.html", "w", encoding="utf-8") as f:
f.write(str(soup))
print("saved /tmp/article.html")
python absolutise.py
Output:
saved /tmp/article.html
Parse an RSS / Atom feed
For XML feeds use the xml parser — it preserves case and namespaces. feedparser is the dedicated library, but BeautifulSoup is fine for one-off parsing.
# feed.py
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://example.com/feed.atom", timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "xml")
for entry in soup.find_all("entry"):
title = entry.title.get_text(strip=True)
link = entry.find("link", rel="alternate")["href"]
print(title, "->", link)
python feed.py
Output:
Article one -> https://example.com/a
Article two -> https://example.com/b
Respect robots.txt before scraping
urllib.robotparser is in the stdlib. Check it once per host and cache the result; refuse to fetch paths the site disallows.
# polite.py
from urllib.robotparser import RobotFileParser
import requests
from bs4 import BeautifulSoup
UA = "jockey-scraper/1.0 (alice@example.com)"
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
target = "https://example.com/items"
if not rp.can_fetch(UA, target):
raise SystemExit(f"robots.txt forbids fetching {target}")
resp = requests.get(target, headers={"User-Agent": UA}, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "lxml")
print(soup.title.get_text(strip=True))
python polite.py
Output:
Items — Example
Scrape a JS-rendered SPA with playwright + soup
Many modern sites render content client-side. The pattern is: drive a headless browser to fully render, snapshot the resulting HTML, then run BeautifulSoup against that snapshot.
# spa.py
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/dashboard", wait_until="networkidle")
page.wait_for_selector(".widget", timeout=10_000)
html = page.content()
browser.close()
soup = BeautifulSoup(html, "lxml")
for w in soup.select(".widget"):
print(w.select_one(".title").get_text(strip=True))
python spa.py
Output:
Active users
Revenue today
Errors
One-line CLI: extract every link from stdin
A throwaway script that becomes a reusable shell tool with curl ... | python ....
# links.py
import sys
from bs4 import BeautifulSoup
soup = BeautifulSoup(sys.stdin.read(), "lxml")
for a in soup.find_all("a", href=True):
print(a["href"])
curl -s https://example.com | python links.py
Output:
https://www.iana.org/domains/example