Use DOM methods to navigate a document
Problem
You have a HTML document that you want to extract data from. You know generally the structure of the HTML document.
Solution
Use the DOM-like methods available after parsing HTML into a Document
.
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
Description
Elements provide a range of DOM-like methods to find elements, and extract and manipulate their data. The DOM getters are contextual: called on a parent Document they find matching elements under the document; called on a child element they find elements under that child. In this way you can winnow in on the data you want.
Finding elements
getElementById(
String id) getElementsByTag(
String tag) getElementsByClass(
String className) getElementsByAttribute(
(and related methods)String key) - Element siblings:
siblingElements(
,) firstElementSibling(
,) lastElementSibling(
;) nextElementSibling(
,) previousElementSibling(
) - Graph:
parent(
,) children(
,) child(
int index)
Element data
attr(
to get andString key) attr(
to set attributesString key, String value) attributes(
to get all attributes) id(
,) className(
and) classNames(
) text(
to get and) text(
to set the text contentString value) html(
to get and) html(
to set the inner HTML contentString value) outerHtml(
to get the outer HTML value) data(
to get data content (e.g. of) script
andstyle
tags)tag(
and) tagName(
)
Manipulating HTML and text
Cookbook
Introduction
Input
- Parse a document from a String
- Parsing a body fragment
- Load a Document from a URL
- Load a Document from a File
- Parse large documents efficiently with StreamParser
Extracting data
- Use DOM methods to navigate a document
- Use CSS selectors to find elements
- Use XPath selectors to find elements and nodes
- Extract attributes, text, and HTML from elements
- Working with relative and absolute URLs
- Example program: list links