cheat sheet

unstructured

Package-level reference for unstructured on PyPI — install variants, the huge extras tree, system-level dependencies, and alternative parsers.

unstructured

What it is

unstructured is a Python library from Unstructured.io that turns heterogeneous documents — PDF, DOCX, PPTX, HTML, EML, images, scanned pages — into a stream of Element objects (Title, NarrativeText, Table, Image, ListItem, …) suitable for ingestion into a RAG pipeline or vector store. It wraps a long list of upstream parsers (poppler, pdfminer.six, pikepdf, python-docx, python-pptx, BeautifulSoup, Tesseract, layout-detection models) behind a single partition() entry point.

Reach for unstructured when you have a mixed bag of file formats and want one consistent partitioning API instead of writing per-format glue. Reach for pymupdf / pdfplumber if you only need PDF, or docling / marker / llama-parse if you specifically want layout-aware modern PDF parsing.

Install

bash
pip install unstructured

Output: (none — exits 0 on success)

bash
uv add unstructured

Output: dependency resolved + added to pyproject.toml

bash
poetry add unstructured

Output: updated lockfile + virtualenv install

bash
pip install "unstructured[all-docs]"     # every format-specific Python dep

Output: installs the full document-parsing dependency tree (large)

Versioning & Python support

  • The package iterates frequently; format-handlers, OCR strategies, and table-extraction heuristics change across minors. Pin a tight version range in production ingestion pipelines.
  • Recent versions support Python 3.9+. Pure-Python on its own, but the format extras pull in heavy binary wheels (pillow, opencv-python-headless, pdf2image, pdfminer.six, tesserocr, optionally torch for layout detection).
  • A separate hosted API exists at api.unstructured.io with a thin client (unstructured-client). The Python unstructured library is the self-hosted code path.
  • The partition_pdf function has had multiple strategy reshuffles — fast, hi_res, ocr_only, and (when supported) auto. Read the changelog when upgrading PDF-heavy pipelines.

Package metadata

  • Maintainer: Unstructured.io and community contributors
  • Project home: github.com/Unstructured-IO/unstructured
  • Docs: docs.unstructured.io
  • PyPI: pypi.org/project/unstructured
  • License: Apache-2.0 (with commercial-API service alongside)
  • Governance: company-led with open contributions; hosted Unstructured API for high-volume use
  • First released: 2023
  • Downloads: millions per month

Optional dependencies & extras

unstructured's extras tree is unusually large because each file format has its own parser dependencies.

  • unstructured[all-docs] — every format-specific Python dependency. The most common choice for general ingestion pipelines.
  • unstructured[pdf]pdfminer.six, pdf2image, pikepdf, image deps. Use for PDF-only workloads.
  • unstructured[docx], unstructured[pptx], unstructured[xlsx], unstructured[odt] — Office-format parsers.
  • unstructured[html], unstructured[md], unstructured[epub], unstructured[rtf], unstructured[org], unstructured[rst] — markup parsers.
  • unstructured[image] — Pillow + image preprocessing for image inputs.
  • unstructured[csv], unstructured[tsv] — tabular parsers.
  • unstructured[email] — RFC 822 / EML parsing.

System-level dependencies are not handled by pip. Many parsers shell out to native binaries:

  • poppler (pdftoppm, pdftotext) — required for pdf2image and parts of the PDF path. Install via apt-get install poppler-utils, brew install poppler, or include in your Docker image.
  • tesseract-ocr — required for OCR strategies (hi_res, ocr_only). apt-get install tesseract-ocr, brew install tesseract.
  • libreoffice — used to convert some legacy Office formats to PDF before parsing.
  • For the hi_res strategy that uses a layout-detection model, torch and a GPU-capable runtime accelerate processing dramatically.

Plan Docker image sizes accordingly — unstructured[all-docs] plus system deps is comfortably > 2 GB.

Alternatives

PackageTrade-off
pymupdf (a.k.a. fitz)Fast PDF parsing with text and image extraction. Use for PDF-only with no OCR.
pdfplumberExcellent table extraction from text PDFs. Use when tables matter and pages are not scanned.
marker-pdfModern PDF-to-markdown via layout models. Use when you want clean markdown output.
doclingIBM Research's document parser with strong layout. Use for production document AI without going to a hosted API.
llama-parse (LlamaIndex Cloud)Hosted PDF parser with layout. Use when you do not want to host parsers.
mineru / surya-ocrSpecialised research-grade layout/OCR models. Use when accuracy on edge-case documents trumps everything.
tika-pythonApache Tika bindings — JVM-backed parser for many formats. Use when you can run a Java sidecar.

Common gotchas

  1. Install size explodes with [all-docs]. The Python deps alone add hundreds of MB; with system binaries on top, multi-GB images are routine. Use the narrowest extras you can — unstructured[pdf] is far smaller.
  2. partition_pdf strategies behave very differently. fast is text-extraction only and skips images; hi_res runs a layout-detection model (and needs torch); ocr_only runs Tesseract over rasterised pages. Wrong strategy on the wrong document silently returns empty or low-quality elements.
  3. GPU expectations for hi_res. The default layout model is CPU-tolerable for one-off use but slow at batch scale. Production hi_res pipelines effectively need a GPU.
  4. Missing system deps fail at runtime, not install time. pip install succeeds even when poppler or tesseract is not on the system; you only learn when partition_pdf raises a subprocess error mid-pipeline. Add a startup smoke test.
  5. Element schema is not stable across versions. New element types appear; existing ones occasionally have their attributes renamed. Downstream code that pickles Element objects must re-process after upgrades.
  6. Encoding edge cases on emails and legacy DOC files. partition_email is sensitive to malformed MIME; .doc (not .docx) requires LibreOffice for reliable conversion.
  7. Hosted API vs self-hosted parity. Some features (notably high-accuracy table extraction with the proprietary layout model) live in the hosted Unstructured API and are not reproducible locally with the open-source unstructured alone.

Real-world recipes

The recipes below show the install footprint, strategy choice, and system-dep requirements each pattern implies — the sections/ai/unstructured companion covers element types and partitioning APIs in more depth.

Single PDF, text-only, no system deps — the lightest possible call. partition dispatches by file extension to the appropriate format-specific handler.

python
from unstructured.partition.auto import partition

elements = partition(filename="report.pdf", strategy="fast")
for el in elements[:5]:
    print(type(el).__name__, "—", el.text[:80])

Output: a stream of Title, NarrativeText, ListItem, etc. objects; fast strategy uses pdfminer.six only (no OCR, no layout model) and is suitable for clean text PDFs

High-accuracy PDF with layout model + tables — the hi_res strategy runs a layout-detection model and is the only path that reliably extracts tables. Requires poppler + tesseract system deps and torch Python dep.

python
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="scientific.pdf",
    strategy="hi_res",
    hi_res_model_name="yolox",
    infer_table_structure=True,
    extract_image_block_types=["Image", "Table"],
    extract_image_block_output_dir="./figures",
)
tables = [el for el in elements if type(el).__name__ == "Table"]
print(tables[0].metadata.text_as_html)

Output: elements include Table instances with structured HTML; figures are saved as image files; the layout model runs on CPU (slow) or GPU (fast)

Scanned-document OCR pipelineocr_only strategy rasterises every page and runs Tesseract. Use for scanned PDFs and image inputs.

python
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="scanned.pdf",
    strategy="ocr_only",
    languages=["eng", "fra"],     # tesseract language packs must be installed
    ocr_languages="eng+fra",
)

Output: OCR-extracted text grouped into elements; tesseract language packs (apt install tesseract-ocr-fra) must be present on the system

Chunking for RAG ingestchunk_by_title walks the element stream and groups under each Title, respecting a max character target. This is the canonical pre-vector-store step.

python
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

elements = partition(filename="report.pdf", strategy="hi_res")
chunks = chunk_by_title(
    elements,
    max_characters=1500,
    new_after_n_chars=1200,
    combine_text_under_n_chars=200,
)
for chunk in chunks[:3]:
    print(len(chunk.text), "—", chunk.text[:120])

Output: a list of CompositeElement objects roughly 1.2–1.5k characters each, broken at title boundaries; each chunk carries metadata (page numbers, source filename, parent element) for citation

Bulk ingest a directory with partition + multiprocessingunstructured is not async, but its work is CPU-bound (parsing) or IO-bound (OCR), so multiprocessing.Pool parallelises well.

python
from multiprocessing import Pool
from pathlib import Path
from unstructured.partition.auto import partition

def process_one(path):
    return path, partition(filename=str(path), strategy="fast")

with Pool(8) as pool:
    for path, elements in pool.imap_unordered(process_one, Path("docs").glob("*.pdf")):
        ingest_to_vector_store(path, elements)

Output: 8 parallel partition workers; for hi_res, drop to 2–4 workers (each loads the layout model into memory)

HTML cleanup pipelinepartition_html strips boilerplate (nav, footer, scripts) and returns content elements only.

python
from unstructured.partition.html import partition_html

elements = partition_html(url="https://example.com/post")
text = "\n\n".join(el.text for el in elements if el.text)

Output: a single concatenated string of body content; much cleaner than BeautifulSoup.get_text() for typical article pages

Production deployment

unstructured is an ETL library, not a service. Production usage means wrapping it in a job scheduler (Airflow, Prefect, Dagster) or a queue-backed worker pool.

Topology checklist:

ConcernApproach
Where parsing runsdedicated worker pool with system deps installed
System depspoppler-utils, tesseract-ocr + language packs, libreoffice (legacy DOC)
GPU for hi_resoptional but transformative — 10–50× faster at batch
Image sizebase image with extras is 2+ GB; consider multi-stage builds
Failure handlingretry per-file with smaller strategy; skip on persistent failure
Idempotencyhash input files; skip already-processed
Outputelement JSON to S3/object store, then index into vector DB

Docker base image. A working hi_res image needs:

dockerfile
FROM python:3.11-slim
RUN apt-get update && apt-get install -y \
    poppler-utils tesseract-ocr tesseract-ocr-eng \
    libgl1 libglib2.0-0 \
    libreoffice \
    && rm -rf /var/lib/apt/lists/*
RUN pip install "unstructured[all-docs]"

Output: ~2.5 GB image with all extras; build sizes drop dramatically with narrower extras (unstructured[pdf] only, no LibreOffice).

Worker pool sizing. fast strategy is CPU-light — scale workers to CPU count. hi_res loads a layout model (hundreds of MB), so each worker is RAM-heavy. Rule of thumb: 1 worker per 2 GB of available RAM for hi_res on CPU; 1 worker per GPU for hi_res on GPU.

Failure modes worth catching.

  • poppler not installed → pdf2image raises subprocess error mid-file.
  • tesseract not installed → ocr_only raises on first page.
  • libreoffice not installed → .doc (not .docx) fails to convert.
  • Layout model download → first hi_res call downloads several hundred MB from HuggingFace.

A startup smoke test (partition_pdf("test.pdf", strategy="hi_res") on a small sample at boot) surfaces all of these before real work arrives.

Schema stability. Element types are not perfectly stable across versions — new types appear, occasional attribute renames happen. If you pickle Element objects, re-process after upgrades; if you serialise to dict (unstructured.staging.base.elements_to_json), you're safer.

Performance tuning

LeverMechanismWhen it helps
strategy="fast"text extraction onlyclean text PDFs; throughput priority
strategy="hi_res" + GPUlayout modelscientific docs, tables, scans
strategy="ocr_only"rasterise + Tesseracttrue scanned PDFs
multiprocessing.PoolCPU parallelismmany small files
extract_image_block_types=[...]only extract what you needsmaller working set
Narrower extras[pdf] instead of [all-docs]leaner images
Hosted APIoffload to managedwhen you don't want to maintain workers

Strategy decision tree.

  1. Is the PDF text-based and well-structured? → fast.
  2. Does it have tables you need? → hi_res with infer_table_structure=True.
  3. Is it scanned (no extractable text)? → ocr_only.
  4. Mixed bag? → auto (where supported) lets the library pick per-file.

GPU acceleration. hi_res ships with a layout model that runs via torch. With CUDA available, batch throughput improves 10–50×. The model is downloaded on first use; pre-bake it into the image to avoid cold-start delays.

Hosted API for high-accuracy tables. The proprietary layout model behind the hosted Unstructured API extracts complex tables better than the open-source yolox model. For workloads where table fidelity is critical, evaluate the hosted API or docling/marker-pdf as alternatives.

Embeddings & chunking strategy

unstructured produces elements; chunking decides how elements become embedding inputs. The two built-in strategies are chunk_by_title (semantic) and chunk_by_character (fixed window).

chunk_by_title — recommended. Groups elements under each Title element, splitting only when the running chunk exceeds max_characters. Preserves semantic boundaries.

python
from unstructured.chunking.title import chunk_by_title

chunks = chunk_by_title(
    elements,
    max_characters=1500,         # hard cap
    new_after_n_chars=1200,      # soft target
    combine_text_under_n_chars=200,
    overlap=100,                 # character overlap between adjacent chunks
)

Output: chunks averaging ~1.2k chars with 100-char overlap; metadata records source file, page numbers, and parent element type

Hierarchical embedding pattern. Embed chunks for retrieval, but keep the parent Document as the source of truth. At retrieval time, return the full parent (or a wider window around the chunk) — known as parent-document retrieval.

Chunk-size selection. Roughly:

Chunk size (chars)Use case
300–500dense retrieval, narrow factual questions
800–1500general RAG, modern long-context LMs
2000–4000summarisation, reference lookup

Smaller chunks improve recall (more focused matches); larger chunks improve generation quality (more context). The "right" size depends on the embedder's context window and the questions you expect.

Element-aware chunking. Beyond chunk_by_title, you can write custom chunking that respects Table boundaries (don't split tables across chunks), keeps lists together, or merges short Title elements with the following paragraph.

Version migration guide

unstructured iterates frequently. Three axes change across versions:

Partition strategy reshuffles.

  • partition_pdf strategies have changed across versions: fast, hi_res, ocr_only, and (where supported) auto. Strategy behaviour also shifts — hi_res model choice has changed defaults (detectron2yolox).
  • extract_images_in_pdf=True (older) became extract_image_block_types=[...] (newer).
  • infer_table_structure=True is the current way to ask for table HTML; older flags varied.

Element schema. New element types appear, attribute names occasionally rename. Code that pickles Element objects across versions is fragile; prefer the JSON staging format (elements_to_json / elements_from_json) for durable storage.

Extras tree. The set of available extras has expanded — [all-docs] is the catch-all; format-specific extras are now finer-grained.

Hosted client split. unstructured-client is the SDK for api.unstructured.io; the self-hosted code path uses the main unstructured package. Don't confuse the two — the hosted API exposes features (proprietary layout model) that the open-source library does not.

Pinning strategy. Pin a tight version range in production ingestion pipelines:

text
unstructured[all-docs]>=0.16,<0.17

Re-process previously ingested documents when upgrading minors, especially if hi_res or table extraction is in your pipeline.

Troubleshooting common errors

  • PDFInfoNotInstalledError / pdftoppm not foundpoppler-utils is missing. apt install poppler-utils or brew install poppler.
  • TesseractNotFoundErrortesseract-ocr is missing. Install the binary and the language packs you need (tesseract-ocr-eng, etc.).
  • Empty element list from partition_pdf(strategy="fast") — the PDF has no extractable text (scanned). Switch to ocr_only or hi_res.
  • OSError: cannot identify image file — image format unsupported or corrupted. Many older Office documents use embedded images Pillow can't read; convert via LibreOffice first.
  • hi_res very slow on CPU — expected. Use GPU or fall back to fast for large batches.
  • libreoffice not found for .doc files — install LibreOffice. .docx works without it via python-docx.
  • Memory blow-up on huge PDFs — partition one page range at a time using starting_page_number= / ending_page_number= (where supported), or pre-split with pikepdf.
  • Layout model download stalls — first hi_res run pulls hundreds of MB. Pre-bake into the Docker image or pre-cache ~/.cache/torch/hub/.

Ecosystem integrations

  • LangChainUnstructuredFileLoader, UnstructuredPDFLoader, etc., wrap partition as document loaders.
  • LlamaIndexUnstructuredReader in llama-index-readers-file.
  • HaystackUnstructuredFileConverter component via haystack-integrations.
  • Marker / Docling / LlamaParse — alternative parsers worth evaluating for PDF-heavy workloads. Different strengths: Marker for clean Markdown, Docling for IBM-research-grade layout, LlamaParse for hosted ease.
  • Dagster / Airflow / Prefectunstructured plugs in as a Python operator/task in any of these; the ETL pattern is the typical home.

Security considerations

unstructured shells out to many native binaries and downloads layout models from HuggingFace on first use. Each is a security touchpoint.

  • Native binary surface. poppler, tesseract, and libreoffice are large attack surfaces. Track CVEs and pin base image versions. CVE-2023-32546 (poppler) and various ImageMagick CVEs have hit historic versions.
  • Model downloads on first use. hi_res downloads a layout model (yolox / detectron2) from HuggingFace on first call. Pre-bake into images to (a) avoid runtime downloads, (b) verify model provenance ahead of time.
  • Malicious documents. PDFs can contain JavaScript, embedded files, and malformed structures that crash parsers. pikepdf and pdfminer.six have CVE history; sandbox the parsing process.
  • OCR-stage memory bombs. Crafted images at extreme dimensions can blow up Pillow. Cap input file sizes before invoking partition.
  • PII in extracted text. Outputs land in your vector store, logs, and downstream LM context. Apply PII redaction (e.g. Presidio) before indexing if regulated.
  • Hosted API data exposure. The hosted Unstructured API sees every uploaded document. For sensitive data, self-host or use the on-prem version of the API.
  • Subprocess isolation. libreoffice runs as a full office suite to convert .doc files; consider running parsing workers in a container/microVM rather than alongside other code.

When NOT to use this

unstructured is the right tool for heterogeneous document ingestion at modest-to-large scale. It's the wrong tool when:

  • You only need clean text PDFs. pymupdf (faster) or pdfplumber (better tables on text PDFs) are leaner.
  • You need pristine Markdown out of PDFs. marker-pdf produces cleaner Markdown for layout-rich documents.
  • You want a hosted API. llama-parse (LlamaIndex Cloud) and the Unstructured hosted API both offload parsing.
  • Your inputs are already structured JSON / clean HTML. A 5-line BeautifulSoup script does the job — unstructured is overkill.
  • You can't afford the install footprint. [all-docs] plus system deps is multi-GB. For a single format, install only that extra.
  • Real-time low-latency parsing. unstructured is batch-oriented; for sub-second response, pre-parse and cache.

See also