Skip to content
  • jsoup
  • News
  • Bugs
  • Discussion
  • Download
  • API Reference
  • Cookbook
  • Try jsoup
jsoup » Cookbook » Extracting data » Use DOM methods to navigate a document

Use DOM methods to navigate a document

Jan 22, 2010

Problem

You have a HTML document that you want to extract data from. You know generally the structure of the HTML document.

Solution

Use the DOM-like methods available after parsing HTML into a Document.

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "https://example.com/");

Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
  String linkHref = link.attr("href");
  String linkText = link.text();
}

Description

Elements provide a range of DOM-like methods to find elements, and extract and manipulate their data. The DOM getters are contextual: called on a parent Document they find matching elements under the document; called on a child element they find elements under that child. In this way you can winnow in on the data you want.

Finding elements

  • getElementById(String id)
  • getElementsByTag(String tag)
  • getElementsByClass(String className)
  • getElementsByAttribute(String key) (and related methods)
  • Element siblings: siblingElements(), firstElementSibling(), lastElementSibling(); nextElementSibling(), previousElementSibling()
  • Graph: parent(), children(), child(int index)

Element data

  • attr(String key) to get and attr(String key, String value) to set attributes
  • attributes() to get all attributes
  • id(), className() and classNames()
  • text() to get and text(String value) to set the text content
  • html() to get and html(String value) to set the inner HTML content
  • outerHtml() to get the outer HTML value
  • data() to get data content (e.g. of script and style tags)
  • tag() and tagName()

Manipulating HTML and text

  • append(String html), prepend(String html)
  • appendText(String text), prependText(String text)
  • appendElement(String tagName), prependElement(String tagName)
  • html(String value)

Cookbook

Introduction

  1. Parsing and traversing a Document

Input

  1. Parse a document from a String
  2. Parsing a body fragment
  3. Load a Document from a URL
  4. Load a Document from a File
  5. Parse large documents efficiently with StreamParser

Extracting data

  1. Use DOM methods to navigate a document
  2. Use CSS selectors to find elements
  3. Use XPath selectors to find elements and nodes
  4. Extract attributes, text, and HTML from elements
  5. Working with relative and absolute URLs
  6. Example program: list links

Modifying data

  1. Set attribute values
  2. Set the HTML of an element
  3. Setting the text content of elements

Cleaning HTML

  1. Sanitize untrusted HTML (to prevent XSS)

Working with the web

  1. Maintaining a request session
jsoup HTML parser © 2009 - 2026 Jonathan Hedley