Skip to content
  • jsoup
  • News
  • Bugs
  • Discussion
  • Download
  • API Reference
  • Cookbook
  • Try jsoup
jsoup » Cookbook » Introduction » Parsing and traversing a Document

Parsing and traversing a Document

Jan 22, 2010

To parse a HTML document:

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

(See parsing a document from a string for more info.)

The parser will make every attempt to create a clean parse from the HTML you provide, regardless of whether the HTML is well-formed or not. It handles:

  • unclosed tags (e.g. <p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>)
  • implicit tags (e.g. a naked <td>Table data</td> is wrapped into a <table><tr><td>...)
  • reliably creating the document structure (html containing a head and body, and only appropriate elements within the head)

The object model of a document

  • Documents consist of Elements and TextNodes (and a couple of other misc nodes: see the nodes package tree).
  • The inheritance chain is: Document extends Element extends Node. TextNode extends LeafNode extends Node.
  • An Element contains a list of children Nodes, and has one parent Element. They also have provide a filtered list of child Elements only.

See also

  • Extracting data: DOM navigation
  • Extracting data: Selector syntax

Cookbook

Introduction

  1. Parsing and traversing a Document

Input

  1. Parse a document from a String
  2. Parsing a body fragment
  3. Load a Document from a URL
  4. Load a Document from a File
  5. Parse large documents efficiently with StreamParser

Extracting data

  1. Use DOM methods to navigate a document
  2. Use CSS selectors to find elements
  3. Use XPath selectors to find elements and nodes
  4. Extract attributes, text, and HTML from elements
  5. Working with relative and absolute URLs
  6. Example program: list links

Modifying data

  1. Set attribute values
  2. Set the HTML of an element
  3. Setting the text content of elements

Cleaning HTML

  1. Sanitize untrusted HTML (to prevent XSS)

Working with the web

  1. Maintaining a request session
jsoup HTML parser © 2009 - 2026 Jonathan Hedley