cheat sheet

unstructured

Extract structured text from PDFs, Word docs, HTML, images, and more with the unstructured library. Covers partitioning, chunking, cleaning, metadata, and pipeline integrations.

unstructured — Document Parsing & Ingestion

What it is

unstructured is a Python library for extracting and normalising text from a wide range of document formats — PDFs, Word documents, PowerPoint files, HTML, Markdown, images (via OCR), email, and more. It converts heterogeneous raw files into a list of typed Element objects (Title, NarrativeText, Table, ListItem, etc.) that can be cleaned, chunked, and fed directly into RAG pipelines. unstructured is the ingestion layer most LangChain and LlamaIndex data loaders call under the hood.

Install

bash
pip install unstructured                     # core, handles HTML/text/markdown/email
pip install "unstructured[pdf]"              # adds pdfminer + pdfplumber for PDFs
pip install "unstructured[docx,xlsx,pptx]"  # MS Office formats
pip install "unstructured[all-docs]"         # everything, including OCR (tesseract required)

Output: (none — exits 0 on success)

For image/scanned-PDF OCR (optional):

bash
# macOS
brew install tesseract poppler

# Debian/Ubuntu
apt-get install tesseract-ocr poppler-utils

Output: (none — exits 0 on success)

Quick example

python
from unstructured.partition.auto import partition

elements = partition("report.pdf")

for el in elements[:5]:
    print(el.category, "|", str(el)[:80])

Output:

text
Title | Quarterly Financial Report — Q1 2026
NarrativeText | Revenue increased by 12% year-over-year, driven by…
Table | | Revenue | Q1 2025 | Q1 2026 |...
NarrativeText | Operating expenses remained stable at 45% of revenue.
ListItem | Expanded into three new markets: APAC, LATAM, and EMEA.

When / why to use it

  • Ingesting a corpus of mixed-format documents (PDFs, Word, HTML, email) into a RAG pipeline without writing per-format parsers.
  • Extracting structured data (tables, headings, lists) from PDFs where layout matters.
  • Preprocessing documents before embedding: clean, chunk, and normalise in one library.
  • Building ETL pipelines that consume documents from S3, SharePoint, Google Drive, or local disk.
  • Replacing fragile format-specific parsers (pdfplumber, python-docx, BeautifulSoup) with a single unified API.

Common pitfalls

OCR is slowstrategy="hi_res" triggers Tesseract for every page. For large PDF batches, use strategy="fast" (text-only, no OCR) unless the PDF is scanned. Check el.metadata.page_number to see which pages triggered OCR.

Tables are extracted as raw strings by defaultTable elements return HTML-table strings. To get structured data, parse with el.metadata.text_as_html and feed through pandas.read_html().

partition_auto requires optional depspartition("file.pdf") raises ImportError if pdfminer.six isn't installed. Use "unstructured[pdf]" or "unstructured[all-docs]" to avoid missing-dep errors.

chunk_by_title() is the best default chunking strategy for RAG — it splits on section headings so each chunk maps to a coherent topic, rather than splitting mid-sentence at a fixed character count.

Every element carries rich metadata: el.metadata.filename, el.metadata.page_number, el.metadata.last_modified, el.metadata.url. These are ideal for source-citation in RAG responses.

Supported file types

unstructured dispatches to the correct parser based on file extension or MIME type.

FormatFunctionExtra install
PDFpartition_pdfunstructured[pdf]
Word (.docx)partition_docxunstructured[docx]
PowerPoint (.pptx)partition_pptxunstructured[pptx]
Excel (.xlsx)partition_xlsxunstructured[xlsx]
HTMLpartition_htmlnone
Markdownpartition_mdnone
Plain textpartition_textnone
Email (.eml/.msg)partition_emailnone
Image (PNG/JPG)partition_imageunstructured[all-docs] + tesseract
EPUBpartition_epubunstructured[epub]
RST / OrgModepartition_rstnone
CSVpartition_csvnone
python
from unstructured.partition.auto import partition

# Auto-detect format from extension
elements = partition("slides.pptx")

# Or call the format-specific function directly
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("report.pdf", strategy="hi_res")

Partition strategies

A partition strategy controls how aggressively unstructured analyses each page. Choose based on the trade-off between accuracy and speed.

python
from unstructured.partition.pdf import partition_pdf

# fast — pdfminer text extraction, no OCR, best for text-native PDFs
elements_fast = partition_pdf("report.pdf", strategy="fast")

# hi_res — layout-aware model + Tesseract OCR, best for scanned/complex PDFs
elements_hi = partition_pdf("report.pdf", strategy="hi_res")

# auto — picks fast if text is extractable, hi_res for scanned pages
elements_auto = partition_pdf("report.pdf", strategy="auto")

print(f"fast: {len(elements_fast)} elements")
print(f"hi_res: {len(elements_hi)} elements")

Output:

text
fast: 42 elements
hi_res: 58 elements   # hi_res finds more elements by analysing layout

Element types

Every element has a category and rich metadata. Understanding element types lets you filter and route content correctly.

python
from unstructured.partition.auto import partition
from unstructured.documents.elements import Title, NarrativeText, Table, ListItem, Image

elements = partition("report.pdf", strategy="auto")

titles     = [e for e in elements if isinstance(e, Title)]
tables     = [e for e in elements if isinstance(e, Table)]
body_text  = [e for e in elements if isinstance(e, NarrativeText)]
list_items = [e for e in elements if isinstance(e, ListItem)]

print(f"Titles: {len(titles)}, Tables: {len(tables)}, Body: {len(body_text)}")

# Access metadata
for el in elements[:3]:
    print(f"[{el.category}] page={el.metadata.page_number} | {str(el)[:60]}")

Output:

text
Titles: 12, Tables: 5, Body: 38
[Title] page=1 | Quarterly Financial Report — Q1 2026
[NarrativeText] page=1 | Revenue increased by 12% year-over-year, driv
[Table] page=2 | Revenue Q1 2025 Q1 2026 12.4M 13.9M

Cleaning elements

unstructured provides a clean module to normalise extracted text — remove extra whitespace, strip bullets, dequote, etc.

python
from unstructured.cleaners.core import (
    clean,
    clean_extra_whitespace,
    clean_bullets,
    clean_dashes,
    group_broken_paragraphs,
    bytes_string_to_string,
)

raw = "  •  Revenue  increased   by  12%  \n\n year-over-year.  "

cleaned = clean(
    raw,
    extra_whitespace=True,
    bullets=True,
    dashes=False,
    trailing_punctuation=False,
    lowercase=False,
)
print(repr(cleaned))

Output:

text
'Revenue increased by 12% year-over-year.'
python
# Group broken paragraphs (common in PDFs with hard line-breaks)
broken = "Revenue\nincreased\nby 12%\nyear-over-year.\n\nOperating expenses\nremained stable."
grouped = group_broken_paragraphs(broken)
print(grouped)

Output:

text
Revenue increased by 12% year-over-year.

Operating expenses remained stable.

Chunking for RAG

Chunking splits a document's elements into manageable pieces before embedding. unstructured provides several strategies:

python
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elements

elements = partition("report.pdf")

# chunk_by_title — keeps sections together, splits on heading boundaries
chunks = chunk_by_title(
    elements,
    max_characters=1500,    # max chars per chunk
    new_after_n_chars=1200, # prefer to split earlier than max_characters
    combine_text_under_n_chars=500,  # merge tiny elements into previous chunk
)

print(f"Chunks: {len(chunks)}")
for chunk in chunks[:2]:
    print(f"[{chunk.category}] chars={len(str(chunk))} | {str(chunk)[:80]}")

Output:

text
Chunks: 9
[CompositeElement] chars=1243 | Quarterly Financial Report — Q1 2026 Revenue increased by 12%…
[CompositeElement] chars=987  | Operating expenses remained stable at 45% of revenue. Expanded…
python
# chunk_elements — simple character-count splitting, ignores headings
basic_chunks = chunk_elements(elements, max_characters=800, new_after_n_chars=600)
print(f"Basic chunks: {len(basic_chunks)}")

Table extraction

Tables are extracted as Table elements. Access the raw HTML representation for structured parsing.

python
from unstructured.partition.pdf import partition_pdf
from unstructured.documents.elements import Table
import pandas as pd, io

elements = partition_pdf("report.pdf", strategy="hi_res")
tables = [e for e in elements if isinstance(e, Table)]

for i, table in enumerate(tables):
    print(f"Table {i+1} — page {table.metadata.page_number}")
    print("Text:", str(table)[:120])
    # Parse HTML table into a DataFrame
    html = table.metadata.text_as_html
    if html:
        dfs = pd.read_html(io.StringIO(html))
        print(dfs[0].head())
    print()

Output:

text
Table 1 — page 2
Text: Revenue Q1 2025 Q1 2026 12.4M 13.9M
     Metric  Q1 2025  Q1 2026
0   Revenue   12.4M    13.9M
1   OpEx       5.6M     6.2M

Loading from URLs and S3

python
# Load from a public URL
from unstructured.partition.html import partition_html

elements = partition_html(url="https://example.com/article")
print(len(elements), "elements extracted")

# Load from S3 (requires s3fs)
from unstructured.partition.auto import partition

elements = partition(
    url="s3://my-bucket/documents/report.pdf",
    strategy="fast",
    # s3_url_kwargs={"profile": "myprofile"},
)
print(f"Loaded {len(elements)} elements from S3")

LangChain integration

LangChain's UnstructuredFileLoader and UnstructuredPDFLoader call unstructured under the hood and return Document objects.

python
from langchain_community.document_loaders import UnstructuredFileLoader, UnstructuredPDFLoader

# Generic loader — auto-detects format
loader = UnstructuredFileLoader("report.pdf", mode="elements")  # one Document per element
docs = loader.load()

print(f"Loaded {len(docs)} documents")
print(docs[0].page_content[:120])
print(docs[0].metadata)

Output:

text
Loaded 42 documents
Quarterly Financial Report — Q1 2026
{'source': 'report.pdf', 'filename': 'report.pdf', 'page_number': 1, 'category': 'Title'}
python
# Chunked mode — combine elements into larger chunks before returning
loader = UnstructuredFileLoader(
    "report.pdf",
    mode="paged",   # one Document per page
    strategy="fast",
)
paged_docs = loader.load()

LlamaIndex integration

LlamaIndex's UnstructuredReader wraps the unstructured library as a BaseReader.

python
from llama_index.readers.file import UnstructuredReader
from pathlib import Path

reader = UnstructuredReader()
docs = reader.load_data(file=Path("report.pdf"))

print(f"Loaded {len(docs)} nodes")
for doc in docs[:2]:
    print(doc.text[:80])

Output:

text
Loaded 42 nodes
Quarterly Financial Report — Q1 2026
Revenue increased by 12% year-over-year, driven by…

Processing a directory of mixed files

python
import os
from pathlib import Path
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

docs_dir = Path("./docs")
all_chunks = []

for path in docs_dir.rglob("*"):
    if path.suffix.lower() not in {".pdf", ".docx", ".pptx", ".html", ".md", ".txt"}:
        continue
    try:
        elements = partition(str(path), strategy="auto")
        chunks = chunk_by_title(elements, max_characters=1500)
        for chunk in chunks:
            all_chunks.append({
                "text":     str(chunk),
                "source":   str(path),
                "page":     chunk.metadata.page_number,
                "category": chunk.category,
            })
    except Exception as exc:
        print(f"Skipped {path}: {exc}")

print(f"Total chunks: {len(all_chunks)}")

Output:

text
Total chunks: 247

Quick reference

TaskCode
Partition any filepartition("file.pdf")
PDF (fast)partition_pdf("f.pdf", strategy="fast")
PDF (hi-res OCR)partition_pdf("f.pdf", strategy="hi_res")
HTML from URLpartition_html(url="https://...")
Filter by type[e for e in els if isinstance(e, Title)]
Get page numberel.metadata.page_number
Get HTML tableel.metadata.text_as_html
Clean textclean(text, extra_whitespace=True, bullets=True)
Chunk by headingchunk_by_title(elements, max_characters=1500)
Basic chunkchunk_elements(elements, max_characters=800)
LangChain loaderUnstructuredFileLoader("f.pdf", mode="elements")
LlamaIndex readerUnstructuredReader().load_data(file=Path("f.pdf"))
All element typesTitle, NarrativeText, Table, ListItem, Image, Header, Footer