cheat sheet
unstructured
Extract structured text from PDFs, Word docs, HTML, images, and more with the unstructured library. Covers partitioning, chunking, cleaning, metadata, and pipeline integrations.
unstructured — Document Parsing & Ingestion
What it is
unstructured is a Python library for extracting and normalising text from a wide range of document formats — PDFs, Word documents, PowerPoint files, HTML, Markdown, images (via OCR), email, and more. It converts heterogeneous raw files into a list of typed Element objects (Title, NarrativeText, Table, ListItem, etc.) that can be cleaned, chunked, and fed directly into RAG pipelines. unstructured is the ingestion layer most LangChain and LlamaIndex data loaders call under the hood.
Install
pip install unstructured # core, handles HTML/text/markdown/email
pip install "unstructured[pdf]" # adds pdfminer + pdfplumber for PDFs
pip install "unstructured[docx,xlsx,pptx]" # MS Office formats
pip install "unstructured[all-docs]" # everything, including OCR (tesseract required)
Output: (none — exits 0 on success)
For image/scanned-PDF OCR (optional):
# macOS
brew install tesseract poppler
# Debian/Ubuntu
apt-get install tesseract-ocr poppler-utils
Output: (none — exits 0 on success)
Quick example
from unstructured.partition.auto import partition
elements = partition("report.pdf")
for el in elements[:5]:
print(el.category, "|", str(el)[:80])
Output:
Title | Quarterly Financial Report — Q1 2026
NarrativeText | Revenue increased by 12% year-over-year, driven by…
Table | | Revenue | Q1 2025 | Q1 2026 |...
NarrativeText | Operating expenses remained stable at 45% of revenue.
ListItem | Expanded into three new markets: APAC, LATAM, and EMEA.
When / why to use it
- Ingesting a corpus of mixed-format documents (PDFs, Word, HTML, email) into a RAG pipeline without writing per-format parsers.
- Extracting structured data (tables, headings, lists) from PDFs where layout matters.
- Preprocessing documents before embedding: clean, chunk, and normalise in one library.
- Building ETL pipelines that consume documents from S3, SharePoint, Google Drive, or local disk.
- Replacing fragile format-specific parsers (pdfplumber, python-docx, BeautifulSoup) with a single unified API.
Common pitfalls
OCR is slow —
strategy="hi_res"triggers Tesseract for every page. For large PDF batches, usestrategy="fast"(text-only, no OCR) unless the PDF is scanned. Checkel.metadata.page_numberto see which pages triggered OCR.
Tables are extracted as raw strings by default —
Tableelements return HTML-table strings. To get structured data, parse withel.metadata.text_as_htmland feed throughpandas.read_html().
partition_autorequires optional deps —partition("file.pdf")raisesImportErrorifpdfminer.sixisn't installed. Use"unstructured[pdf]"or"unstructured[all-docs]"to avoid missing-dep errors.
chunk_by_title()is the best default chunking strategy for RAG — it splits on section headings so each chunk maps to a coherent topic, rather than splitting mid-sentence at a fixed character count.
Every element carries rich metadata:
el.metadata.filename,el.metadata.page_number,el.metadata.last_modified,el.metadata.url. These are ideal for source-citation in RAG responses.
Supported file types
unstructured dispatches to the correct parser based on file extension or MIME type.
| Format | Function | Extra install |
|---|---|---|
partition_pdf | unstructured[pdf] | |
| Word (.docx) | partition_docx | unstructured[docx] |
| PowerPoint (.pptx) | partition_pptx | unstructured[pptx] |
| Excel (.xlsx) | partition_xlsx | unstructured[xlsx] |
| HTML | partition_html | none |
| Markdown | partition_md | none |
| Plain text | partition_text | none |
| Email (.eml/.msg) | partition_email | none |
| Image (PNG/JPG) | partition_image | unstructured[all-docs] + tesseract |
| EPUB | partition_epub | unstructured[epub] |
| RST / OrgMode | partition_rst | none |
| CSV | partition_csv | none |
from unstructured.partition.auto import partition
# Auto-detect format from extension
elements = partition("slides.pptx")
# Or call the format-specific function directly
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("report.pdf", strategy="hi_res")
Partition strategies
A partition strategy controls how aggressively unstructured analyses each page. Choose based on the trade-off between accuracy and speed.
from unstructured.partition.pdf import partition_pdf
# fast — pdfminer text extraction, no OCR, best for text-native PDFs
elements_fast = partition_pdf("report.pdf", strategy="fast")
# hi_res — layout-aware model + Tesseract OCR, best for scanned/complex PDFs
elements_hi = partition_pdf("report.pdf", strategy="hi_res")
# auto — picks fast if text is extractable, hi_res for scanned pages
elements_auto = partition_pdf("report.pdf", strategy="auto")
print(f"fast: {len(elements_fast)} elements")
print(f"hi_res: {len(elements_hi)} elements")
Output:
fast: 42 elements
hi_res: 58 elements # hi_res finds more elements by analysing layout
Element types
Every element has a category and rich metadata. Understanding element types lets you filter and route content correctly.
from unstructured.partition.auto import partition
from unstructured.documents.elements import Title, NarrativeText, Table, ListItem, Image
elements = partition("report.pdf", strategy="auto")
titles = [e for e in elements if isinstance(e, Title)]
tables = [e for e in elements if isinstance(e, Table)]
body_text = [e for e in elements if isinstance(e, NarrativeText)]
list_items = [e for e in elements if isinstance(e, ListItem)]
print(f"Titles: {len(titles)}, Tables: {len(tables)}, Body: {len(body_text)}")
# Access metadata
for el in elements[:3]:
print(f"[{el.category}] page={el.metadata.page_number} | {str(el)[:60]}")
Output:
Titles: 12, Tables: 5, Body: 38
[Title] page=1 | Quarterly Financial Report — Q1 2026
[NarrativeText] page=1 | Revenue increased by 12% year-over-year, driv
[Table] page=2 | Revenue Q1 2025 Q1 2026 12.4M 13.9M
Cleaning elements
unstructured provides a clean module to normalise extracted text — remove extra whitespace, strip bullets, dequote, etc.
from unstructured.cleaners.core import (
clean,
clean_extra_whitespace,
clean_bullets,
clean_dashes,
group_broken_paragraphs,
bytes_string_to_string,
)
raw = " • Revenue increased by 12% \n\n year-over-year. "
cleaned = clean(
raw,
extra_whitespace=True,
bullets=True,
dashes=False,
trailing_punctuation=False,
lowercase=False,
)
print(repr(cleaned))
Output:
'Revenue increased by 12% year-over-year.'
# Group broken paragraphs (common in PDFs with hard line-breaks)
broken = "Revenue\nincreased\nby 12%\nyear-over-year.\n\nOperating expenses\nremained stable."
grouped = group_broken_paragraphs(broken)
print(grouped)
Output:
Revenue increased by 12% year-over-year.
Operating expenses remained stable.
Chunking for RAG
Chunking splits a document's elements into manageable pieces before embedding. unstructured provides several strategies:
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elements
elements = partition("report.pdf")
# chunk_by_title — keeps sections together, splits on heading boundaries
chunks = chunk_by_title(
elements,
max_characters=1500, # max chars per chunk
new_after_n_chars=1200, # prefer to split earlier than max_characters
combine_text_under_n_chars=500, # merge tiny elements into previous chunk
)
print(f"Chunks: {len(chunks)}")
for chunk in chunks[:2]:
print(f"[{chunk.category}] chars={len(str(chunk))} | {str(chunk)[:80]}")
Output:
Chunks: 9
[CompositeElement] chars=1243 | Quarterly Financial Report — Q1 2026 Revenue increased by 12%…
[CompositeElement] chars=987 | Operating expenses remained stable at 45% of revenue. Expanded…
# chunk_elements — simple character-count splitting, ignores headings
basic_chunks = chunk_elements(elements, max_characters=800, new_after_n_chars=600)
print(f"Basic chunks: {len(basic_chunks)}")
Table extraction
Tables are extracted as Table elements. Access the raw HTML representation for structured parsing.
from unstructured.partition.pdf import partition_pdf
from unstructured.documents.elements import Table
import pandas as pd, io
elements = partition_pdf("report.pdf", strategy="hi_res")
tables = [e for e in elements if isinstance(e, Table)]
for i, table in enumerate(tables):
print(f"Table {i+1} — page {table.metadata.page_number}")
print("Text:", str(table)[:120])
# Parse HTML table into a DataFrame
html = table.metadata.text_as_html
if html:
dfs = pd.read_html(io.StringIO(html))
print(dfs[0].head())
print()
Output:
Table 1 — page 2
Text: Revenue Q1 2025 Q1 2026 12.4M 13.9M
Metric Q1 2025 Q1 2026
0 Revenue 12.4M 13.9M
1 OpEx 5.6M 6.2M
Loading from URLs and S3
# Load from a public URL
from unstructured.partition.html import partition_html
elements = partition_html(url="https://example.com/article")
print(len(elements), "elements extracted")
# Load from S3 (requires s3fs)
from unstructured.partition.auto import partition
elements = partition(
url="s3://my-bucket/documents/report.pdf",
strategy="fast",
# s3_url_kwargs={"profile": "myprofile"},
)
print(f"Loaded {len(elements)} elements from S3")
LangChain integration
LangChain's UnstructuredFileLoader and UnstructuredPDFLoader call unstructured under the hood and return Document objects.
from langchain_community.document_loaders import UnstructuredFileLoader, UnstructuredPDFLoader
# Generic loader — auto-detects format
loader = UnstructuredFileLoader("report.pdf", mode="elements") # one Document per element
docs = loader.load()
print(f"Loaded {len(docs)} documents")
print(docs[0].page_content[:120])
print(docs[0].metadata)
Output:
Loaded 42 documents
Quarterly Financial Report — Q1 2026
{'source': 'report.pdf', 'filename': 'report.pdf', 'page_number': 1, 'category': 'Title'}
# Chunked mode — combine elements into larger chunks before returning
loader = UnstructuredFileLoader(
"report.pdf",
mode="paged", # one Document per page
strategy="fast",
)
paged_docs = loader.load()
LlamaIndex integration
LlamaIndex's UnstructuredReader wraps the unstructured library as a BaseReader.
from llama_index.readers.file import UnstructuredReader
from pathlib import Path
reader = UnstructuredReader()
docs = reader.load_data(file=Path("report.pdf"))
print(f"Loaded {len(docs)} nodes")
for doc in docs[:2]:
print(doc.text[:80])
Output:
Loaded 42 nodes
Quarterly Financial Report — Q1 2026
Revenue increased by 12% year-over-year, driven by…
Processing a directory of mixed files
import os
from pathlib import Path
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
docs_dir = Path("./docs")
all_chunks = []
for path in docs_dir.rglob("*"):
if path.suffix.lower() not in {".pdf", ".docx", ".pptx", ".html", ".md", ".txt"}:
continue
try:
elements = partition(str(path), strategy="auto")
chunks = chunk_by_title(elements, max_characters=1500)
for chunk in chunks:
all_chunks.append({
"text": str(chunk),
"source": str(path),
"page": chunk.metadata.page_number,
"category": chunk.category,
})
except Exception as exc:
print(f"Skipped {path}: {exc}")
print(f"Total chunks: {len(all_chunks)}")
Output:
Total chunks: 247
Quick reference
| Task | Code |
|---|---|
| Partition any file | partition("file.pdf") |
| PDF (fast) | partition_pdf("f.pdf", strategy="fast") |
| PDF (hi-res OCR) | partition_pdf("f.pdf", strategy="hi_res") |
| HTML from URL | partition_html(url="https://...") |
| Filter by type | [e for e in els if isinstance(e, Title)] |
| Get page number | el.metadata.page_number |
| Get HTML table | el.metadata.text_as_html |
| Clean text | clean(text, extra_whitespace=True, bullets=True) |
| Chunk by heading | chunk_by_title(elements, max_characters=1500) |
| Basic chunk | chunk_elements(elements, max_characters=800) |
| LangChain loader | UnstructuredFileLoader("f.pdf", mode="elements") |
| LlamaIndex reader | UnstructuredReader().load_data(file=Path("f.pdf")) |
| All element types | Title, NarrativeText, Table, ListItem, Image, Header, Footer |