Use DOM methods to navigate a document
Problem
You have a HTML document that you want to extract data from. You know generally the structure of the HTML document.
Solution
Use the DOM-like methods available after parsing HTML into a Document.
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
Description
Elements provide a range of DOM-like methods to find elements, and extract and manipulate their data. The DOM getters are contextual: called on a parent Document they find matching elements under the document; called on a child element they find elements under that child. In this way you can winnow in on the data you want.
Finding elements
-
getElementById(String id) -
getElementsByTag(String tag) -
getElementsByClass(String className) -
getElementsByAttribute(String key)(and related methods) -
Element siblings:
siblingElements(),firstElementSibling(),lastElementSibling();nextElementSibling(),previousElementSibling() -
Graph:
parent(),children(),child(int index)
Element data
-
attr(String key)to get andattr(String key, String value)to set attributes -
attributes()to get all attributes -
id(),className()andclassNames() -
text()to get andtext(String value)to set the text content -
html()to get andhtml(String value)to set the inner HTML content -
outerHtml()to get the outer HTML value -
data()to get data content (e.g. ofscriptandstyletags) -
tag()andtagName()
Manipulating HTML and text
Cookbook contents
Introduction
Input
- Parse a document from a String
- Parsing a body fragment
- Load a Document from a URL
- Load a Document from a File
Extracting data
- Use DOM methods to navigate a document
- Use selector-syntax to find elements
- Extract attributes, text, and HTML from elements
- Working with URLs
- Example program: list links