cheat sheet

unstructured

Extract structured text from PDFs, Word docs, HTML, images, and more with the unstructured library. Covers partitioning, chunking, cleaning, metadata, and pipeline integrations.

updated 04-27-2026

unstructured — Document Parsing & Ingestion

What it is

unstructured is a Python library for extracting and normalising text from a wide range of document formats — PDFs, Word documents, PowerPoint files, HTML, Markdown, images (via OCR), email, and more. It converts heterogeneous raw files into a list of typed Element objects (Title, NarrativeText, Table, ListItem, etc.) that can be cleaned, chunked, and fed directly into RAG pipelines. unstructured is the ingestion layer most LangChain and LlamaIndex data loaders call under the hood.

Install

bash

pip install unstructured                     # core, handles HTML/text/markdown/email
pip install "unstructured[pdf]"              # adds pdfminer + pdfplumber for PDFs
pip install "unstructured[docx,xlsx,pptx]"  # MS Office formats
pip install "unstructured[all-docs]"         # everything, including OCR (tesseract required)

Output: (none — exits 0 on success)

For image/scanned-PDF OCR (optional):

bash

# macOS
brew install tesseract poppler

# Debian/Ubuntu
apt-get install tesseract-ocr poppler-utils

Output: (none — exits 0 on success)

Quick example

python

from unstructured.partition.auto import partition

elements = partition("report.pdf")

for el in elements[:5]:
    print(el.category, "|", str(el)[:80])

Output:

text

Title | Quarterly Financial Report — Q1 2026
NarrativeText | Revenue increased by 12% year-over-year, driven by…
Table | | Revenue | Q1 2025 | Q1 2026 |...
NarrativeText | Operating expenses remained stable at 45% of revenue.
ListItem | Expanded into three new markets: APAC, LATAM, and EMEA.

When / why to use it

Ingesting a corpus of mixed-format documents (PDFs, Word, HTML, email) into a RAG pipeline without writing per-format parsers.
Extracting structured data (tables, headings, lists) from PDFs where layout matters.
Preprocessing documents before embedding: clean, chunk, and normalise in one library.
Building ETL pipelines that consume documents from S3, SharePoint, Google Drive, or local disk.
Replacing fragile format-specific parsers (pdfplumber, python-docx, BeautifulSoup) with a single unified API.

Common pitfalls

OCR is slow — strategy="hi_res" triggers Tesseract for every page. For large PDF batches, use strategy="fast" (text-only, no OCR) unless the PDF is scanned. Check el.metadata.page_number to see which pages triggered OCR.

Tables are extracted as raw strings by default — Table elements return HTML-table strings. To get structured data, parse with el.metadata.text_as_html and feed through pandas.read_html().

partition_auto requires optional deps — partition("file.pdf") raises ImportError if pdfminer.six isn't installed. Use "unstructured[pdf]" or "unstructured[all-docs]" to avoid missing-dep errors.

chunk_by_title() is the best default chunking strategy for RAG — it splits on section headings so each chunk maps to a coherent topic, rather than splitting mid-sentence at a fixed character count.

Every element carries rich metadata: el.metadata.filename, el.metadata.page_number, el.metadata.last_modified, el.metadata.url. These are ideal for source-citation in RAG responses.

Supported file types

unstructured dispatches to the correct parser based on file extension or MIME type.

Format	Function	Extra install
PDF	`partition_pdf`	`unstructured[pdf]`
Word (.docx)	`partition_docx`	`unstructured[docx]`
PowerPoint (.pptx)	`partition_pptx`	`unstructured[pptx]`
Excel (.xlsx)	`partition_xlsx`	`unstructured[xlsx]`
HTML	`partition_html`	none
Markdown	`partition_md`	none
Plain text	`partition_text`	none
Email (.eml/.msg)	`partition_email`	none
Image (PNG/JPG)	`partition_image`	`unstructured[all-docs]` + tesseract
EPUB	`partition_epub`	`unstructured[epub]`
RST / OrgMode	`partition_rst`	none
CSV	`partition_csv`	none

python

from unstructured.partition.auto import partition

# Auto-detect format from extension
elements = partition("slides.pptx")

# Or call the format-specific function directly
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("report.pdf", strategy="hi_res")

Partition strategies

A partition strategy controls how aggressively unstructured analyses each page. Choose based on the trade-off between accuracy and speed.

python

from unstructured.partition.pdf import partition_pdf

# fast — pdfminer text extraction, no OCR, best for text-native PDFs
elements_fast = partition_pdf("report.pdf", strategy="fast")

# hi_res — layout-aware model + Tesseract OCR, best for scanned/complex PDFs
elements_hi = partition_pdf("report.pdf", strategy="hi_res")

# auto — picks fast if text is extractable, hi_res for scanned pages
elements_auto = partition_pdf("report.pdf", strategy="auto")

print(f"fast: {len(elements_fast)} elements")
print(f"hi_res: {len(elements_hi)} elements")

Output:

text

fast: 42 elements
hi_res: 58 elements   # hi_res finds more elements by analysing layout

Element types

Every element has a category and rich metadata. Understanding element types lets you filter and route content correctly.

python

from unstructured.partition.auto import partition
from unstructured.documents.elements import Title, NarrativeText, Table, ListItem, Image

elements = partition("report.pdf", strategy="auto")

titles     = [e for e in elements if isinstance(e, Title)]
tables     = [e for e in elements if isinstance(e, Table)]
body_text  = [e for e in elements if isinstance(e, NarrativeText)]
list_items = [e for e in elements if isinstance(e, ListItem)]

print(f"Titles: {len(titles)}, Tables: {len(tables)}, Body: {len(body_text)}")

# Access metadata
for el in elements[:3]:
    print(f"[{el.category}] page={el.metadata.page_number} | {str(el)[:60]}")

Output:

text

Titles: 12, Tables: 5, Body: 38
[Title] page=1 | Quarterly Financial Report — Q1 2026
[NarrativeText] page=1 | Revenue increased by 12% year-over-year, driv
[Table] page=2 | Revenue Q1 2025 Q1 2026 12.4M 13.9M

Cleaning elements

unstructured provides a clean module to normalise extracted text — remove extra whitespace, strip bullets, dequote, etc.

python

from unstructured.cleaners.core import (
    clean,
    clean_extra_whitespace,
    clean_bullets,
    clean_dashes,
    group_broken_paragraphs,
    bytes_string_to_string,
)

raw = "  •  Revenue  increased   by  12%  \n\n year-over-year.  "

cleaned = clean(
    raw,
    extra_whitespace=True,
    bullets=True,
    dashes=False,
    trailing_punctuation=False,
    lowercase=False,
)
print(repr(cleaned))

Output:

text

'Revenue increased by 12% year-over-year.'

python

# Group broken paragraphs (common in PDFs with hard line-breaks)
broken = "Revenue\nincreased\nby 12%\nyear-over-year.\n\nOperating expenses\nremained stable."
grouped = group_broken_paragraphs(broken)
print(grouped)

Output:

text

Revenue increased by 12% year-over-year.

Operating expenses remained stable.

Chunking for RAG

Chunking splits a document's elements into manageable pieces before embedding. unstructured provides several strategies:

python

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elements

elements = partition("report.pdf")

# chunk_by_title — keeps sections together, splits on heading boundaries
chunks = chunk_by_title(
    elements,
    max_characters=1500,    # max chars per chunk
    new_after_n_chars=1200, # prefer to split earlier than max_characters
    combine_text_under_n_chars=500,  # merge tiny elements into previous chunk
)

print(f"Chunks: {len(chunks)}")
for chunk in chunks[:2]:
    print(f"[{chunk.category}] chars={len(str(chunk))} | {str(chunk)[:80]}")

Output:

text

Chunks: 9
[CompositeElement] chars=1243 | Quarterly Financial Report — Q1 2026 Revenue increased by 12%…
[CompositeElement] chars=987  | Operating expenses remained stable at 45% of revenue. Expanded…

python

# chunk_elements — simple character-count splitting, ignores headings
basic_chunks = chunk_elements(elements, max_characters=800, new_after_n_chars=600)
print(f"Basic chunks: {len(basic_chunks)}")

Table extraction

Tables are extracted as Table elements. Access the raw HTML representation for structured parsing.

python

from unstructured.partition.pdf import partition_pdf
from unstructured.documents.elements import Table
import pandas as pd, io

elements = partition_pdf("report.pdf", strategy="hi_res")
tables = [e for e in elements if isinstance(e, Table)]

for i, table in enumerate(tables):
    print(f"Table {i+1} — page {table.metadata.page_number}")
    print("Text:", str(table)[:120])
    # Parse HTML table into a DataFrame
    html = table.metadata.text_as_html
    if html:
        dfs = pd.read_html(io.StringIO(html))
        print(dfs[0].head())
    print()

Output:

text

Table 1 — page 2
Text: Revenue Q1 2025 Q1 2026 12.4M 13.9M
     Metric  Q1 2025  Q1 2026
0   Revenue   12.4M    13.9M
1   OpEx       5.6M     6.2M

Loading from URLs and S3

python

# Load from a public URL
from unstructured.partition.html import partition_html

elements = partition_html(url="https://example.com/article")
print(len(elements), "elements extracted")

# Load from S3 (requires s3fs)
from unstructured.partition.auto import partition

elements = partition(
    url="s3://my-bucket/documents/report.pdf",
    strategy="fast",
    # s3_url_kwargs={"profile": "myprofile"},
)
print(f"Loaded {len(elements)} elements from S3")

LangChain integration

LangChain's UnstructuredFileLoader and UnstructuredPDFLoader call unstructured under the hood and return Document objects.

python

from langchain_community.document_loaders import UnstructuredFileLoader, UnstructuredPDFLoader

# Generic loader — auto-detects format
loader = UnstructuredFileLoader("report.pdf", mode="elements")  # one Document per element
docs = loader.load()

print(f"Loaded {len(docs)} documents")
print(docs[0].page_content[:120])
print(docs[0].metadata)

Output:

text

Loaded 42 documents
Quarterly Financial Report — Q1 2026
{'source': 'report.pdf', 'filename': 'report.pdf', 'page_number': 1, 'category': 'Title'}

python

# Chunked mode — combine elements into larger chunks before returning
loader = UnstructuredFileLoader(
    "report.pdf",
    mode="paged",   # one Document per page
    strategy="fast",
)
paged_docs = loader.load()

LlamaIndex integration

LlamaIndex's UnstructuredReader wraps the unstructured library as a BaseReader.

python

from llama_index.readers.file import UnstructuredReader
from pathlib import Path

reader = UnstructuredReader()
docs = reader.load_data(file=Path("report.pdf"))

print(f"Loaded {len(docs)} nodes")
for doc in docs[:2]:
    print(doc.text[:80])

Output:

text

Loaded 42 nodes
Quarterly Financial Report — Q1 2026
Revenue increased by 12% year-over-year, driven by…

Processing a directory of mixed files

python

import os
from pathlib import Path
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

docs_dir = Path("./docs")
all_chunks = []

for path in docs_dir.rglob("*"):
    if path.suffix.lower() not in {".pdf", ".docx", ".pptx", ".html", ".md", ".txt"}:
        continue
    try:
        elements = partition(str(path), strategy="auto")
        chunks = chunk_by_title(elements, max_characters=1500)
        for chunk in chunks:
            all_chunks.append({
                "text":     str(chunk),
                "source":   str(path),
                "page":     chunk.metadata.page_number,
                "category": chunk.category,
            })
    except Exception as exc:
        print(f"Skipped {path}: {exc}")

print(f"Total chunks: {len(all_chunks)}")

Output:

text

Total chunks: 247

Quick reference

Task	Code
Partition any file	`partition("file.pdf")`
PDF (fast)	`partition_pdf("f.pdf", strategy="fast")`
PDF (hi-res OCR)	`partition_pdf("f.pdf", strategy="hi_res")`
HTML from URL	`partition_html(url="https://...")`
Filter by type	`[e for e in els if isinstance(e, Title)]`
Get page number	`el.metadata.page_number`
Get HTML table	`el.metadata.text_as_html`
Clean text	`clean(text, extra_whitespace=True, bullets=True)`
Chunk by heading	`chunk_by_title(elements, max_characters=1500)`
Basic chunk	`chunk_elements(elements, max_characters=800)`
LangChain loader	`UnstructuredFileLoader("f.pdf", mode="elements")`
LlamaIndex reader	`UnstructuredReader().load_data(file=Path("f.pdf"))`
All element types	`Title, NarrativeText, Table, ListItem, Image, Header, Footer`