cheat sheet
unstructured
Package-level reference for unstructured on PyPI — install variants, the huge extras tree, system-level dependencies, and alternative parsers.
unstructured
What it is
unstructured is a Python library from Unstructured.io that turns heterogeneous documents — PDF, DOCX, PPTX, HTML, EML, images, scanned pages — into a stream of Element objects (Title, NarrativeText, Table, Image, ListItem, …) suitable for ingestion into a RAG pipeline or vector store. It wraps a long list of upstream parsers (poppler, pdfminer.six, pikepdf, python-docx, python-pptx, BeautifulSoup, Tesseract, layout-detection models) behind a single partition() entry point.
Reach for unstructured when you have a mixed bag of file formats and want one consistent partitioning API instead of writing per-format glue. Reach for pymupdf / pdfplumber if you only need PDF, or docling / marker / llama-parse if you specifically want layout-aware modern PDF parsing.
Install
pip install unstructured
Output: (none — exits 0 on success)
uv add unstructured
Output: dependency resolved + added to pyproject.toml
poetry add unstructured
Output: updated lockfile + virtualenv install
pip install "unstructured[all-docs]" # every format-specific Python dep
Output: installs the full document-parsing dependency tree (large)
Versioning & Python support
- The package iterates frequently; format-handlers, OCR strategies, and table-extraction heuristics change across minors. Pin a tight version range in production ingestion pipelines.
- Recent versions support Python 3.9+. Pure-Python on its own, but the format extras pull in heavy binary wheels (
pillow,opencv-python-headless,pdf2image,pdfminer.six,tesserocr, optionallytorchfor layout detection). - A separate hosted API exists at
api.unstructured.iowith a thin client (unstructured-client). The Pythonunstructuredlibrary is the self-hosted code path. - The
partition_pdffunction has had multiple strategy reshuffles —fast,hi_res,ocr_only, and (when supported)auto. Read the changelog when upgrading PDF-heavy pipelines.
Package metadata
- Maintainer: Unstructured.io and community contributors
- Project home: github.com/Unstructured-IO/unstructured
- Docs: docs.unstructured.io
- PyPI: pypi.org/project/unstructured
- License: Apache-2.0 (with commercial-API service alongside)
- Governance: company-led with open contributions; hosted Unstructured API for high-volume use
- First released: 2023
- Downloads: millions per month
Optional dependencies & extras
unstructured's extras tree is unusually large because each file format has its own parser dependencies.
unstructured[all-docs]— every format-specific Python dependency. The most common choice for general ingestion pipelines.unstructured[pdf]—pdfminer.six,pdf2image,pikepdf, image deps. Use for PDF-only workloads.unstructured[docx],unstructured[pptx],unstructured[xlsx],unstructured[odt]— Office-format parsers.unstructured[html],unstructured[md],unstructured[epub],unstructured[rtf],unstructured[org],unstructured[rst]— markup parsers.unstructured[image]— Pillow + image preprocessing for image inputs.unstructured[csv],unstructured[tsv]— tabular parsers.unstructured[email]— RFC 822 / EML parsing.
System-level dependencies are not handled by pip. Many parsers shell out to native binaries:
poppler(pdftoppm,pdftotext) — required forpdf2imageand parts of the PDF path. Install viaapt-get install poppler-utils,brew install poppler, or include in your Docker image.tesseract-ocr— required for OCR strategies (hi_res,ocr_only).apt-get install tesseract-ocr,brew install tesseract.libreoffice— used to convert some legacy Office formats to PDF before parsing.- For the
hi_resstrategy that uses a layout-detection model,torchand a GPU-capable runtime accelerate processing dramatically.
Plan Docker image sizes accordingly — unstructured[all-docs] plus system deps is comfortably > 2 GB.
Alternatives
| Package | Trade-off |
|---|---|
pymupdf (a.k.a. fitz) | Fast PDF parsing with text and image extraction. Use for PDF-only with no OCR. |
pdfplumber | Excellent table extraction from text PDFs. Use when tables matter and pages are not scanned. |
marker-pdf | Modern PDF-to-markdown via layout models. Use when you want clean markdown output. |
docling | IBM Research's document parser with strong layout. Use for production document AI without going to a hosted API. |
llama-parse (LlamaIndex Cloud) | Hosted PDF parser with layout. Use when you do not want to host parsers. |
mineru / surya-ocr | Specialised research-grade layout/OCR models. Use when accuracy on edge-case documents trumps everything. |
tika-python | Apache Tika bindings — JVM-backed parser for many formats. Use when you can run a Java sidecar. |
Common gotchas
- Install size explodes with
[all-docs]. The Python deps alone add hundreds of MB; with system binaries on top, multi-GB images are routine. Use the narrowest extras you can —unstructured[pdf]is far smaller. partition_pdfstrategies behave very differently.fastis text-extraction only and skips images;hi_resruns a layout-detection model (and needstorch);ocr_onlyruns Tesseract over rasterised pages. Wrong strategy on the wrong document silently returns empty or low-quality elements.- GPU expectations for
hi_res. The default layout model is CPU-tolerable for one-off use but slow at batch scale. Productionhi_respipelines effectively need a GPU. - Missing system deps fail at runtime, not install time.
pip installsucceeds even whenpopplerortesseractis not on the system; you only learn whenpartition_pdfraises a subprocess error mid-pipeline. Add a startup smoke test. - Element schema is not stable across versions. New element types appear; existing ones occasionally have their attributes renamed. Downstream code that pickles
Elementobjects must re-process after upgrades. - Encoding edge cases on emails and legacy DOC files.
partition_emailis sensitive to malformed MIME;.doc(not.docx) requires LibreOffice for reliable conversion. - Hosted API vs self-hosted parity. Some features (notably high-accuracy table extraction with the proprietary layout model) live in the hosted Unstructured API and are not reproducible locally with the open-source
unstructuredalone.
Real-world recipes
The recipes below show the install footprint, strategy choice, and system-dep requirements each pattern implies — the sections/ai/unstructured companion covers element types and partitioning APIs in more depth.
Single PDF, text-only, no system deps — the lightest possible call. partition dispatches by file extension to the appropriate format-specific handler.
from unstructured.partition.auto import partition
elements = partition(filename="report.pdf", strategy="fast")
for el in elements[:5]:
print(type(el).__name__, "—", el.text[:80])
Output: a stream of Title, NarrativeText, ListItem, etc. objects; fast strategy uses pdfminer.six only (no OCR, no layout model) and is suitable for clean text PDFs
High-accuracy PDF with layout model + tables — the hi_res strategy runs a layout-detection model and is the only path that reliably extracts tables. Requires poppler + tesseract system deps and torch Python dep.
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="scientific.pdf",
strategy="hi_res",
hi_res_model_name="yolox",
infer_table_structure=True,
extract_image_block_types=["Image", "Table"],
extract_image_block_output_dir="./figures",
)
tables = [el for el in elements if type(el).__name__ == "Table"]
print(tables[0].metadata.text_as_html)
Output: elements include Table instances with structured HTML; figures are saved as image files; the layout model runs on CPU (slow) or GPU (fast)
Scanned-document OCR pipeline — ocr_only strategy rasterises every page and runs Tesseract. Use for scanned PDFs and image inputs.
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="scanned.pdf",
strategy="ocr_only",
languages=["eng", "fra"], # tesseract language packs must be installed
ocr_languages="eng+fra",
)
Output: OCR-extracted text grouped into elements; tesseract language packs (apt install tesseract-ocr-fra) must be present on the system
Chunking for RAG ingest — chunk_by_title walks the element stream and groups under each Title, respecting a max character target. This is the canonical pre-vector-store step.
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
elements = partition(filename="report.pdf", strategy="hi_res")
chunks = chunk_by_title(
elements,
max_characters=1500,
new_after_n_chars=1200,
combine_text_under_n_chars=200,
)
for chunk in chunks[:3]:
print(len(chunk.text), "—", chunk.text[:120])
Output: a list of CompositeElement objects roughly 1.2–1.5k characters each, broken at title boundaries; each chunk carries metadata (page numbers, source filename, parent element) for citation
Bulk ingest a directory with partition + multiprocessing — unstructured is not async, but its work is CPU-bound (parsing) or IO-bound (OCR), so multiprocessing.Pool parallelises well.
from multiprocessing import Pool
from pathlib import Path
from unstructured.partition.auto import partition
def process_one(path):
return path, partition(filename=str(path), strategy="fast")
with Pool(8) as pool:
for path, elements in pool.imap_unordered(process_one, Path("docs").glob("*.pdf")):
ingest_to_vector_store(path, elements)
Output: 8 parallel partition workers; for hi_res, drop to 2–4 workers (each loads the layout model into memory)
HTML cleanup pipeline — partition_html strips boilerplate (nav, footer, scripts) and returns content elements only.
from unstructured.partition.html import partition_html
elements = partition_html(url="https://example.com/post")
text = "\n\n".join(el.text for el in elements if el.text)
Output: a single concatenated string of body content; much cleaner than BeautifulSoup.get_text() for typical article pages
Production deployment
unstructured is an ETL library, not a service. Production usage means wrapping it in a job scheduler (Airflow, Prefect, Dagster) or a queue-backed worker pool.
Topology checklist:
| Concern | Approach |
|---|---|
| Where parsing runs | dedicated worker pool with system deps installed |
| System deps | poppler-utils, tesseract-ocr + language packs, libreoffice (legacy DOC) |
GPU for hi_res | optional but transformative — 10–50× faster at batch |
| Image size | base image with extras is 2+ GB; consider multi-stage builds |
| Failure handling | retry per-file with smaller strategy; skip on persistent failure |
| Idempotency | hash input files; skip already-processed |
| Output | element JSON to S3/object store, then index into vector DB |
Docker base image. A working hi_res image needs:
FROM python:3.11-slim
RUN apt-get update && apt-get install -y \
poppler-utils tesseract-ocr tesseract-ocr-eng \
libgl1 libglib2.0-0 \
libreoffice \
&& rm -rf /var/lib/apt/lists/*
RUN pip install "unstructured[all-docs]"
Output: ~2.5 GB image with all extras; build sizes drop dramatically with narrower extras (unstructured[pdf] only, no LibreOffice).
Worker pool sizing. fast strategy is CPU-light — scale workers to CPU count. hi_res loads a layout model (hundreds of MB), so each worker is RAM-heavy. Rule of thumb: 1 worker per 2 GB of available RAM for hi_res on CPU; 1 worker per GPU for hi_res on GPU.
Failure modes worth catching.
popplernot installed →pdf2imageraises subprocess error mid-file.tesseractnot installed →ocr_onlyraises on first page.libreofficenot installed →.doc(not.docx) fails to convert.- Layout model download → first
hi_rescall downloads several hundred MB from HuggingFace.
A startup smoke test (partition_pdf("test.pdf", strategy="hi_res") on a small sample at boot) surfaces all of these before real work arrives.
Schema stability. Element types are not perfectly stable across versions — new types appear, occasional attribute renames happen. If you pickle Element objects, re-process after upgrades; if you serialise to dict (unstructured.staging.base.elements_to_json), you're safer.
Performance tuning
| Lever | Mechanism | When it helps |
|---|---|---|
strategy="fast" | text extraction only | clean text PDFs; throughput priority |
strategy="hi_res" + GPU | layout model | scientific docs, tables, scans |
strategy="ocr_only" | rasterise + Tesseract | true scanned PDFs |
multiprocessing.Pool | CPU parallelism | many small files |
extract_image_block_types=[...] | only extract what you need | smaller working set |
| Narrower extras | [pdf] instead of [all-docs] | leaner images |
| Hosted API | offload to managed | when you don't want to maintain workers |
Strategy decision tree.
- Is the PDF text-based and well-structured? →
fast. - Does it have tables you need? →
hi_reswithinfer_table_structure=True. - Is it scanned (no extractable text)? →
ocr_only. - Mixed bag? →
auto(where supported) lets the library pick per-file.
GPU acceleration. hi_res ships with a layout model that runs via torch. With CUDA available, batch throughput improves 10–50×. The model is downloaded on first use; pre-bake it into the image to avoid cold-start delays.
Hosted API for high-accuracy tables. The proprietary layout model behind the hosted Unstructured API extracts complex tables better than the open-source yolox model. For workloads where table fidelity is critical, evaluate the hosted API or docling/marker-pdf as alternatives.
Embeddings & chunking strategy
unstructured produces elements; chunking decides how elements become embedding inputs. The two built-in strategies are chunk_by_title (semantic) and chunk_by_character (fixed window).
chunk_by_title — recommended. Groups elements under each Title element, splitting only when the running chunk exceeds max_characters. Preserves semantic boundaries.
from unstructured.chunking.title import chunk_by_title
chunks = chunk_by_title(
elements,
max_characters=1500, # hard cap
new_after_n_chars=1200, # soft target
combine_text_under_n_chars=200,
overlap=100, # character overlap between adjacent chunks
)
Output: chunks averaging ~1.2k chars with 100-char overlap; metadata records source file, page numbers, and parent element type
Hierarchical embedding pattern. Embed chunks for retrieval, but keep the parent Document as the source of truth. At retrieval time, return the full parent (or a wider window around the chunk) — known as parent-document retrieval.
Chunk-size selection. Roughly:
| Chunk size (chars) | Use case |
|---|---|
| 300–500 | dense retrieval, narrow factual questions |
| 800–1500 | general RAG, modern long-context LMs |
| 2000–4000 | summarisation, reference lookup |
Smaller chunks improve recall (more focused matches); larger chunks improve generation quality (more context). The "right" size depends on the embedder's context window and the questions you expect.
Element-aware chunking. Beyond chunk_by_title, you can write custom chunking that respects Table boundaries (don't split tables across chunks), keeps lists together, or merges short Title elements with the following paragraph.
Version migration guide
unstructured iterates frequently. Three axes change across versions:
Partition strategy reshuffles.
partition_pdfstrategies have changed across versions:fast,hi_res,ocr_only, and (where supported)auto. Strategy behaviour also shifts —hi_resmodel choice has changed defaults (detectron2→yolox).extract_images_in_pdf=True(older) becameextract_image_block_types=[...](newer).infer_table_structure=Trueis the current way to ask for table HTML; older flags varied.
Element schema. New element types appear, attribute names occasionally rename. Code that pickles Element objects across versions is fragile; prefer the JSON staging format (elements_to_json / elements_from_json) for durable storage.
Extras tree. The set of available extras has expanded — [all-docs] is the catch-all; format-specific extras are now finer-grained.
Hosted client split. unstructured-client is the SDK for api.unstructured.io; the self-hosted code path uses the main unstructured package. Don't confuse the two — the hosted API exposes features (proprietary layout model) that the open-source library does not.
Pinning strategy. Pin a tight version range in production ingestion pipelines:
unstructured[all-docs]>=0.16,<0.17
Re-process previously ingested documents when upgrading minors, especially if hi_res or table extraction is in your pipeline.
Troubleshooting common errors
PDFInfoNotInstalledError/pdftoppm not found—poppler-utilsis missing.apt install poppler-utilsorbrew install poppler.TesseractNotFoundError—tesseract-ocris missing. Install the binary and the language packs you need (tesseract-ocr-eng, etc.).- Empty element list from
partition_pdf(strategy="fast")— the PDF has no extractable text (scanned). Switch toocr_onlyorhi_res. OSError: cannot identify image file— image format unsupported or corrupted. Many older Office documents use embedded images Pillow can't read; convert via LibreOffice first.hi_resvery slow on CPU — expected. Use GPU or fall back tofastfor large batches.libreoffice not foundfor.docfiles — install LibreOffice..docxworks without it viapython-docx.- Memory blow-up on huge PDFs — partition one page range at a time using
starting_page_number=/ending_page_number=(where supported), or pre-split withpikepdf. - Layout model download stalls — first
hi_resrun pulls hundreds of MB. Pre-bake into the Docker image or pre-cache~/.cache/torch/hub/.
Ecosystem integrations
- LangChain —
UnstructuredFileLoader,UnstructuredPDFLoader, etc., wrappartitionas document loaders. - LlamaIndex —
UnstructuredReaderinllama-index-readers-file. - Haystack —
UnstructuredFileConvertercomponent viahaystack-integrations. - Marker / Docling / LlamaParse — alternative parsers worth evaluating for PDF-heavy workloads. Different strengths: Marker for clean Markdown, Docling for IBM-research-grade layout, LlamaParse for hosted ease.
- Dagster / Airflow / Prefect —
unstructuredplugs in as a Python operator/task in any of these; the ETL pattern is the typical home.
Security considerations
unstructured shells out to many native binaries and downloads layout models from HuggingFace on first use. Each is a security touchpoint.
- Native binary surface.
poppler,tesseract, andlibreofficeare large attack surfaces. Track CVEs and pin base image versions. CVE-2023-32546 (poppler) and various ImageMagick CVEs have hit historic versions. - Model downloads on first use.
hi_resdownloads a layout model (yolox / detectron2) from HuggingFace on first call. Pre-bake into images to (a) avoid runtime downloads, (b) verify model provenance ahead of time. - Malicious documents. PDFs can contain JavaScript, embedded files, and malformed structures that crash parsers.
pikepdfandpdfminer.sixhave CVE history; sandbox the parsing process. - OCR-stage memory bombs. Crafted images at extreme dimensions can blow up Pillow. Cap input file sizes before invoking
partition. - PII in extracted text. Outputs land in your vector store, logs, and downstream LM context. Apply PII redaction (e.g. Presidio) before indexing if regulated.
- Hosted API data exposure. The hosted Unstructured API sees every uploaded document. For sensitive data, self-host or use the on-prem version of the API.
- Subprocess isolation.
libreofficeruns as a full office suite to convert.docfiles; consider running parsing workers in a container/microVM rather than alongside other code.
When NOT to use this
unstructured is the right tool for heterogeneous document ingestion at modest-to-large scale. It's the wrong tool when:
- You only need clean text PDFs.
pymupdf(faster) orpdfplumber(better tables on text PDFs) are leaner. - You need pristine Markdown out of PDFs.
marker-pdfproduces cleaner Markdown for layout-rich documents. - You want a hosted API.
llama-parse(LlamaIndex Cloud) and the Unstructured hosted API both offload parsing. - Your inputs are already structured JSON / clean HTML. A 5-line
BeautifulSoupscript does the job —unstructuredis overkill. - You can't afford the install footprint.
[all-docs]plus system deps is multi-GB. For a single format, install only that extra. - Real-time low-latency parsing.
unstructuredis batch-oriented; for sub-second response, pre-parse and cache.
See also
- AI: unstructured — partitioning APIs, element types, RAG ingestion
- Concept: RAG — retrieval-augmented generation patterns
- Concept: filesystem — file I/O and paths