cheat sheet

xidel Web Scraping & Data Extraction

Extract data from HTML, XML, and JSON using XPath, CSS selectors, pattern matching, and JSONiq from the command line.

xidel Web Scraping & Data Extraction

What it is

xidel is an open-source command-line tool for downloading web pages and extracting structured data from HTML, XML, and JSON sources, maintained at videlibri.de. It supports XPath 2.0/3.0, CSS selectors, custom template-based pattern matching, and JSONiq, making it one of the most expressive scraping tools available without writing a full script. Reach for xidel when you need to query deeply nested HTML or XML documents with XPath from a shell pipeline, or when grep/sed are too brittle for structured markup.

Install: apt-get install xidel (Debian/Ubuntu), brew install xidel (macOS), or download from videlibri.de/xidel.html

Release status (May 2026): the last tagged stable release on GitHub remains 0.9.8 (April 2022). Development version 0.9.9 is published irregularly as preview binaries for Windows, Linux, macOS, and Android — it ships ~99.6% XPath/XQuery 3.1 coverage, partial XPath 4.0 syntax, --json-mode, --in-place, and new extension functions (inner-text, x:request-decode, matched-text). The project is low-velocity but not abandoned. If a feature below appears missing on 0.9.8, grab a 0.9.9 preview build from videlibri.de.

Extract with XPath

XPath is a query language for navigating the tree structure of HTML and XML documents; xidel supports XPath 2.0/3.0, which adds functions, sequences, and regular expressions beyond what most tools support. Use --extract with an XPath expression when you need to traverse nested elements, filter by attribute value, or apply string functions that CSS selectors cannot express.

bash
# Extract all link href attributes from a page
xidel https://example.org --extract "//a/@href"

Output:

text
/
/about
/contact
https://docs.example.org/
https://github.com/example/repo
bash
# Extract all page titles from links found via Google
xidel "https://www.google.com/search?q=linux+tips" \
  --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"

Output:

text
https://www.linuxcommand.org/
https://linuxjourney.com/
https://tldr.sh/
https://cheat.sh/
code
# Extract all image sources
xidel https://example.org --extract "//img/@src"

Output:

text
/assets/logo.png
/assets/hero.jpg
/assets/icons/arrow.svg
bash
# Extract text content of all headings
xidel https://example.org --extract "//h1|//h2|//h3"

Output:

text
Welcome to Example
Getting Started
Installation
Configuration
API Reference

Extract with CSS selectors

CSS selectors are a concise alternative to XPath for element selection by tag name, class, ID, or attribute — the same syntax used in browser DevTools. Use --css when the query is simple and familiar from web development; switch to XPath when you need axis traversal, positional predicates, or string operations.

bash
# Extract text of all paragraphs
xidel https://example.org --css "p"

Output:

text
This is the first paragraph describing the product.
Use it to simplify your workflow and automate tasks.
See the documentation for full details.
bash
# Extract href from all nav links
xidel https://example.org --css "nav a" --extract "@href"

Output:

text
/
/docs
/api
/blog
/contact
bash
# Combine: follow CSS-selected links and extract their titles
xidel https://example.org --follow "css('a')" --css title

Output:

text
Example - Home
Example - Documentation
Example - API Reference
Example - Blog

Pattern matching (template syntax)

Pattern matching lets you describe the shape of the data you want with placeholders:

bash
# Extract whatever is between <title> and </title>
xidel https://example.org --extract "<title>{.}</title>"

Output:

text
Example Domain
bash
# Follow all <a> links and extract each page's title
xidel https://example.org \
  --follow "<a>{.}</a>*" \
  --extract "<title>{.}</title>"

Output:

text
Example Domain
Example - About
Example - Contact
Example - Documentation
bash
# Extract a specific nested value — also validates structure is present
xidel path/to/example.xml \
  --extract "<x><foo>ood</foo><bar>{.}</bar></x>"

Output:

text
the bar value

--follow takes an XPath or CSS expression that selects URLs, fetches each one, and applies the --extract expression to the resulting pages. This turns xidel into a single-command crawler — useful for scraping paginated sites or downloading all assets linked from a page.

bash
# Follow all <a> tags on a page and print each linked page's title
xidel https://example.org --follow //a --extract //title

Output:

text
Example Domain
Example - Getting Started
Example - API Reference
Example - Changelog
bash
# Follow Google result links, print titles, download pages into host-named dirs
xidel "https://www.google.com/search?q=test" \
  --follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" \
  --extract //title \
  --download '{$host}/'

Output:

text
Test - Wikipedia
Software testing - MDN
Pytest documentation
…

JSON APIs

xidel parses JSON responses and exposes them as an XPath-navigable tree, so the same --extract "//field" syntax works for JSON as it does for XML. This makes it a lightweight alternative to curl | jq when the extraction logic is straightforward.

bash
# Extract a field from a JSON API response
xidel https://api.github.com/repos/octocat/Hello-World --extract "//name"

Output:

text
Hello-World
bash
# Use JSONiq-style extraction
xidel https://api.example.com/data.json --extract "//items/title"

Output:

text
First Article
Second Article
Third Article

Structured output from RSS / Atom

RSS and Atom feeds are well-formed XML, making them ideal targets for xidel's template syntax. Named variable assignments (field:=.) let you pair related values from different elements in a single pass, producing structured records rather than a flat list of values.

bash
# Extract title + URL from every Stack Overflow question in the RSS feed
xidel http://stackoverflow.com/feeds \
  --extract "<entry><title>{title:=.}</title><link>{uri:=@href}</link></entry>+"

Output:

text
title: How do I reverse a list in Python?
uri: https://stackoverflow.com/questions/3940128/

title: Difference between append and extend in Python
uri: https://stackoverflow.com/questions/252703/

title: How to check if a file exists in Python?
uri: https://stackoverflow.com/questions/82831/
…

The + at the end means "repeat this pattern one or more times." Named variables (title:=, uri:=) pair related fields.

Form automation & login

xidel can submit HTML forms by wrapping a CSS selector for the form element with form() and a dictionary of field values. It maintains cookies across requests, enabling login flows where subsequent --follow and --extract calls run as the authenticated session.

bash
# Log in to Reddit and check unread mail count
# Combines CSS selectors, XPath, JSONiq, and form evaluation
xidel https://reddit.com \
  --follow "form(css('form.login-form')[1], {'user': 'myuser', 'passwd': 'mypassword'})" \
  --extract "css('#mail')/@title"

Output:

text
3 messages

Output formats

--output-format controls how extracted values are serialized. The default adhoc prints one result per line, json produces a JSON array suitable for piping into jq, and xml wraps results in a <result> element. Choose json or xml when passing xidel output to another tool that expects structured data.

bash
# Output as JSON array
xidel https://example.org --extract "//a/@href" --output-format json

Output:

text
["/","\/about","\/contact","https:\/\/docs.example.org\/","https:\/\/github.com\/example\/repo"]
bash
# Output as XML
xidel https://example.org --extract "//a" --output-format xml

Output:

text
<result>
  <a href="/">Home</a>
  <a href="/about">About</a>
  <a href="/contact">Contact</a>
</result>
bash
# Wrap each result on its own line (default)
xidel https://example.org --extract "//a/@href" --output-format adhoc

Output:

text
/
/about
/contact
https://docs.example.org/
https://github.com/example/repo

Query language comparison

QueryXPathCSSPattern
All links//acss('a')<a>{.}</a>*
Link href//a/@hrefcss('a') + @href<a href="{.}">
Page title//titlecss('title')<title>{.}</title>
First h1//h1[1]css('h1:first-of-type')

Combine with shell pipelines

bash
# Save all scraped URLs to a file for aria2c batch download
xidel https://example.org/downloads --extract "//a[contains(@href,'.iso')]/@href" \
  > iso-urls.txt
aria2c --input-file=iso-urls.txt -c -d ~/Downloads

Output (xidel ... > iso-urls.txt / preview of iso-urls.txt):

text
https://example.org/downloads/ubuntu-24.04-desktop-amd64.iso
https://example.org/downloads/ubuntu-24.04-server-amd64.iso
https://example.org/downloads/debian-12.5.0-amd64-netinst.iso

XPath 3.0 essentials

XPath 3.0 is the query language xidel uses by default; the syntax extends XPath 2.0 with map and array types, higher-order functions, and improved string handling. Understanding the four primary building blocks — path steps, predicates, axes, and functions — turns xidel into a precise surgical tool rather than a guess-and-check scraper.

Path expressions

A path expression is a sequence of steps separated by / (one level down) or // (any descendant), where each step yields a sequence of nodes that the next step traverses. The leading / anchors at the document root; an unanchored //foo finds every foo element anywhere in the tree.

bash
xidel page.html --extract "/html/body//p"        # absolute path to paragraphs
xidel page.html --extract "//div//span"          # spans nested under any div
xidel page.html --extract "//article/h1"         # direct h1 children of article
xidel page.html --extract "//*[@id='main']"      # any element with id="main"

Output:

text
Welcome to the article
Section one heading
A paragraph in section one
Section two heading

Predicates

A predicate is a […] filter appended to a step that keeps only nodes matching its condition. Numeric predicates select positionally ([1] is first, [last()] is last); boolean predicates filter by attribute, text content, or function results.

bash
# First link, last link, links with class "external"
xidel page.html --extract "//a[1]"
xidel page.html --extract "//a[last()]"
xidel page.html --extract "//a[@class='external']"
xidel page.html --extract "//a[contains(@href, 'github')]"
xidel page.html --extract "//li[position() <= 3]"     # first three list items
xidel page.html --extract "//tr[td[3] > 100]"         # rows where col 3 > 100

Output:

text
https://example.org/page1
https://github.com/example/repo
Item one
Item two
Item three

Axes

An axis defines the direction of traversal from a context node — most queries use the default child:: (implicit before each step) but explicit axes unlock parent/sibling/ancestor lookups. Use parent::, following-sibling::, and ancestor:: when CSS selectors hit a dead end.

bash
# Parent of the first h1
xidel page.html --extract "//h1[1]/parent::*"

# All siblings after a heading until the next heading
xidel page.html --extract "//h2[1]/following-sibling::p"

# Ancestor div of an element
xidel page.html --extract "//a[@id='link']/ancestor::div[1]"

# Preceding nodes (text before the first table)
xidel page.html --extract "//table[1]/preceding::p"

Output:

text
The first paragraph after section one
The second paragraph after section one

XPath functions

XPath 3.0 ships with a rich function library covering strings, numbers, sequences, and dates. The most-used in scraping are normalize-space(), tokenize(), matches(), substring-before/after(), lower-case(), and concat().

bash
xidel page.html --extract "normalize-space(//h1)"
xidel page.html --extract "lower-case(//title)"
xidel page.html --extract "tokenize(//meta[@name='keywords']/@content, ',\s*')"
xidel page.html --extract "//a[matches(@href, '^https://github\.com')]/@href"
xidel page.html --extract "substring-after(//meta[@property='og:url']/@content, '://')"
xidel page.html --extract "concat(//h1, ' — ', //h2[1])"

Output:

text
welcome to example
linux, cli, tutorial
https://github.com/example/repo
example.org/article/123
Welcome — Getting Started

CSS selector deep dive

xidel's CSS engine implements most of Selectors Level 3 — descendant, child, attribute, pseudo-class — invoked via --css for whole-document selection or inside an expression with css('…') for use alongside XPath. Selectors are looser than XPath: they cannot walk up the tree and have no built-in string functions, but they are typically 50% shorter for common scraping queries.

bash
# Descendant, child, adjacent sibling
xidel page.html --css "article p"          # descendant
xidel page.html --css "article > p"        # direct child
xidel page.html --css "h2 + p"             # first paragraph after each h2

# Attribute selectors
xidel page.html --css "a[href^='https://']"     # starts-with
xidel page.html --css "a[href$='.pdf']"          # ends-with
xidel page.html --css "a[href*='github']"        # contains
xidel page.html --css "input[type='hidden']"     # exact match

# Pseudo-classes
xidel page.html --css "li:first-child"
xidel page.html --css "li:nth-child(odd)"
xidel page.html --css "tr:not(.header)"

Output:

text
First paragraph in the article
Direct child paragraph
https://example.org/intro.pdf
https://github.com/example/repo

Mixing CSS and XPath

css('selector') is a function inside any expression — combine it with XPath axes when CSS picks the starting element but you need to walk relatives.

bash
# Use CSS to find articles, then XPath to grab their first paragraph
xidel page.html --extract "css('article')/p[1]"

# CSS-selected nav links followed by parent <li> for context
xidel page.html --extract "css('nav a')/parent::li"

Output:

text
First paragraph from article one
First paragraph from article two

JSON / JSONiq queries

When xidel reads JSON, the document becomes a tree of maps (objects) and arrays accessible by both XPath-flavoured //field syntax and JSONiq's dot/bracket notation. JSONiq is the W3C standard for JSON query and is closer in feel to jq, while XPath syntax is consistent with everything else xidel does — pick whichever reads better for the task.

bash
# Plain XPath against JSON
xidel api.json --extract "//items/title"
xidel api.json --extract "//user/email"

# JSONiq dot/bracket — strongly recommended for nested arrays
xidel api.json --extract '$json.items[].title'
xidel api.json --extract '$json.items()[$$.score > 80].name'
xidel api.json --extract 'count($json.items())'

# Multiple fields paired
xidel api.json --extract '$json.items()!{name: ., score: .score}'

Output:

text
First item
Second item
Third item
3

When to prefer xidel over jq for JSON

jq is the default JSON tool and is faster for pure JSON pipelines; xidel becomes attractive when the same scraping job mixes HTML/XML pages and JSON APIs, when you want a single expression language across all formats, or when you need to follow links discovered inside JSON responses.

bash
# Read a JSON index, follow each item's URL, scrape the HTML title
xidel https://api.example.com/articles.json \
  --follow '$json.articles().url' \
  --extract "//title"

Output:

text
First Article Title
Second Article Title
Third Article Title

Variables and bindings

--variable NAME=value (or -v short form) binds a variable accessible as $NAME inside any expression — useful for parameterising base URLs, query strings, or output prefixes without rewriting the expression. Variables can also be JSON-encoded objects for richer data passing.

bash
# Pass a base URL into the extract expression
xidel "https://example.org/page" \
  --variable "host=example.org" \
  --extract "concat($host, ': ', //title)"

# Multiple variables
xidel page.html \
  -v "year=2026" \
  -v "section=docs" \
  --extract "concat($section, '/', $year, '/', //h1)"

# JSON-typed variable
xidel page.html \
  --var-json 'filters={"min": 10, "max": 100}' \
  --extract "//item[price >= $filters.min and price <= $filters.max]"

Output:

text
example.org: Welcome
docs/2026/Getting Started

Recursive crawling with --follow

--follow accepts the same selectors as --extract but uses each result as a URL to fetch. Add --follow-level N to cap recursion depth, and --follow-from to control which page the follow expression runs against. xidel maintains its own visited-URL set, so cycles are detected automatically.

bash
# Crawl two levels deep, extract every page's title
xidel https://example.org \
  --follow "//a[contains(@href, 'example.org')]/@href" \
  --follow-level 2 \
  --extract "//title"

# Stay within one domain
xidel https://example.org \
  --follow "//a/@href" \
  --follow-include "example.org" \
  --extract "//h1"

# Follow only Atom feed entries
xidel https://example.org/feed.atom \
  --follow "//entry/link/@href" \
  --extract "//article/h1"

Output:

text
Welcome to Example
Example - About
Example - Documentation
Example - API Reference
Example - Changelog
Example - Contributing

HTTP options — headers, cookies, auth

xidel is built on libcurl-style HTTP, exposing user-agent, cookie jar, custom headers, basic auth, POST data, and proxy flags. Use these when the target site differentiates between browsers, requires login, or rate-limits anonymous traffic.

bash
# Custom user-agent (some sites block default xidel UA)
xidel https://example.org \
  --user-agent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36" \
  --extract "//h1"

# Persistent cookie jar across runs
xidel https://example.org/dashboard \
  --cookie-jar ~/.cache/xidel-cookies.txt \
  --extract "//span[@class='username']"

# Custom header (API tokens)
xidel "https://api.example.com/v1/me" \
  --header "Authorization: Bearer YOUR_TOKEN" \
  --extract "//email"

# POST a form body
xidel https://example.org/login \
  --post "user=alicedev&pass=secret" \
  --extract "//div[@id='status']"

# Basic auth
xidel "https://admin.example.org/stats" \
  --user "alicedev:secret" \
  --extract "//table//td"

Output:

text
Welcome
alicedev
alice@example.com
Logged in successfully
42 active sessions

Local files and standard input

xidel accepts file paths and - for stdin as readily as URLs. This makes it scriptable against archived pages, downloaded snapshots, and pipe-driven workflows where curl or wget handles the fetching and xidel handles the parsing.

bash
# Local file
xidel path/to/example.html --extract "//title"

# stdin
curl -s https://example.org | xidel - --extract "//title"

# Glob across many local files
xidel ./archive/*.html --extract "//h1"

# Pipe gzipped HTML through gunzip
gunzip -c snapshot.html.gz | xidel - --extract "//meta[@property='og:title']/@content"

Output:

text
Example Domain
Welcome to Example
Page Title from Archived Snapshot

Comparison with sibling tools

xidel sits at the intersection of HTML parsing (htmlq, pup), XML query (xmllint), JSON query (jq), and the newer format-agnostic crop (dasel, yq). The table below shows where each tool wins; reach for xidel when one job spans more than one of these formats and needs XPath/XQuery expressiveness.

NeedBest toolWhy
Pure JSON pipelinesjqSmaller, faster, ubiquitous
CSS-selector-only HTML extractionhtmlq (Rust) or pup (Go)Lighter syntax, focused scope, single static binary
Cross-format query/edit (JSON + YAML + TOML + XML + CSV) without XPathdaselOne path-style selector across all formats; small Go binary
XML schema validation, namespacesxmllintMature XML stack, XSD support
XPath against HTMLxidelTolerates malformed HTML, XPath 3.0 (3.1 in 0.9.9 dev)
Mixed HTML + JSON crawl in one expressionxidelOne query language, built-in --follow
Login flows with cookiesxidel--cookie-jar + form submission built-in
JavaScript-rendered pagesPlaywright / Puppeteerxidel does not execute JS

When to pick something else

  • You only need CSS selectors on HTMLhtmlq or pup start faster, have shorter syntax, and are easier to install on minimal images. pup went dormant for a stretch; htmlq is the more actively maintained Rust replacement.
  • You're slicing JSON/YAML/TOML/XML configs with no HTML in the loop — dasel gives one path syntax across all of them and is a static Go binary with no Pascal runtime. xidel's XPath wins as soon as the structure gets deep or you need predicates / functions.
  • You want pure-XML rigor (namespaces, XSD, XSLT) — xmllint and xmlstarlet remain the canonical choice.

xidel does not run JavaScript — single-page apps that render content client-side return empty results. For those, render with a headless browser first (Playwright, Puppeteer, or chromium --dump-dom) and pipe the rendered HTML into xidel.

0.9.9 dev-build extras

The 0.9.9 preview adds a handful of conveniences worth knowing about if you grab the development binary from videlibri.de. They fix small annoyances in 0.9.8 — especially around visible-text extraction and in-place file edits — without changing day-to-day query syntax.

bash
# inner-text: skip <script>/<style>/hidden, collapse whitespace,
# return what a browser actually renders
xidel article.html --extract "inner-text(//article)"

# --json-mode picks the JSON dialect explicitly
xidel api.json --json-mode=xpath  --extract '?items?*?title'   # XPath 3.1 syntax
xidel api.json --json-mode=jsoniq --extract '$json.items().title'

# --in-place overwrites the input file with the result —
# useful for batch rewrites of local snapshots
xidel ./snapshots/*.html --in-place --extract "//main"

# matched-text exposes what a pattern match actually captured
xidel page.html --extract "<a href='{link:=.}'>{matched-text()}</a>*"

# x:request-decode parses an application/x-www-form-urlencoded body
xidel - --extract "x:request-decode('a=1&b=hello%20world')"

Output:

text
Welcome to the article.
First paragraph in context.
…

{"a": "1", "b": "hello world"}

Recipes

A grab bag of complete one-liners that combine the features above for real scraping problems. Each recipe is copy-paste runnable against a site that exposes the relevant markup.

Extract OpenGraph metadata

bash
xidel "https://example.org/article" --extract '
  {
    "title":       //meta[@property="og:title"]/@content,
    "description": //meta[@property="og:description"]/@content,
    "image":       //meta[@property="og:image"]/@content,
    "url":         //meta[@property="og:url"]/@content
  }
' --output-format json

Output:

text
{
  "title": "How to scrape with xidel",
  "description": "A practical walkthrough of XPath and CSS selectors.",
  "image": "https://example.org/images/og-card.png",
  "url": "https://example.org/article"
}

Sitemap → flat URL list

bash
xidel "https://example.org/sitemap.xml" --extract "//*[local-name()='loc']"

Output:

text
https://example.org/
https://example.org/about
https://example.org/blog/post-1
https://example.org/blog/post-2

RSS feed → JSON

bash
xidel "https://example.org/feed.rss" \
  --extract '//item ! { "title": title, "link": link, "date": pubDate }' \
  --output-format json

Output:

text
[
  { "title": "New release 1.4", "link": "https://example.org/blog/1.4", "date": "Wed, 22 Apr 2026 12:00:00 GMT" },
  { "title": "Roadmap update",   "link": "https://example.org/blog/roadmap", "date": "Mon, 15 Apr 2026 09:30:00 GMT" }
]

Crawl a blog index and dump articles to JSON

bash
xidel "https://example.org/blog" \
  --follow "//article//a/@href" \
  --extract '{
    "title":   //h1,
    "author":  //meta[@name="author"]/@content,
    "date":    //time/@datetime,
    "content": string-join(//article//p, "\n\n")
  }' --output-format json

Output:

text
[
  { "title": "Post One", "author": "Alice Dev", "date": "2026-04-22", "content": "First paragraph...\n\nSecond paragraph..." },
  { "title": "Post Two", "author": "Alice Dev", "date": "2026-04-15", "content": "Intro line.\n\nMore text..." }
]

Combine with jq for post-processing

bash
xidel "https://example.org/products" \
  --extract '//product ! { name: name, price: number(price) }' \
  --output-format json \
  | jq '[.[] | select(.price < 50)] | sort_by(.price)'

Output:

text
[
  { "name": "Pen", "price": 2.5 },
  { "name": "Notebook", "price": 8.99 },
  { "name": "Mug", "price": 12.0 }
]

Add --silent to suppress xidel's progress and informational messages, leaving only the extracted data — useful when piping into another command or capturing in a variable.

If an expression returns nothing, run xidel with --printed-node-format=text-with-html-tags (or use --extract "//*") to inspect the actual parsed tree. Browser DevTools selectors sometimes target post-JavaScript DOM that xidel cannot see.

Sources