Use DOM methods to navigate a document
You have a HTML document that you want to extract data from. You know generally the structure of the HTML document.
Use the DOM-like methods available after parsing HTML into a Document
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
Elements provide a range of DOM-like methods to find elements, and extract and manipulate their data. The DOM getters are contextual: called on a parent Document they find matching elements under the document; called on a child element they find elements under that child. In this way you can winnow in on the data you want.
Finding elements
String id) getElementsByTag(
String tag) getElementsByClass(
String className) getElementsByAttribute(
(and related methods)String key) - Element siblings:
,) firstElementSibling(
,) lastElementSibling(
;) nextElementSibling(
,) previousElementSibling(
) - Graph:
,) children(
,) child(
int index)
Element data
to get andString key) attr(
to set attributesString key, String value) attributes(
to get all attributes) id(
,) className(
and) classNames(
) text(
to get and) text(
to set the text contentString value) html(
to get and) html(
to set the inner HTML contentString value) outerHtml(
to get the outer HTML value) data(
to get data content (e.g. of) script
and) tagName(
Manipulating HTML and text
- Parse a document from a String
- Parsing a body fragment
- Load a Document from a URL
- Load a Document from a File
- Parse large documents efficiently with StreamParser
Extracting data
- Use DOM methods to navigate a document
- Use CSS selectors to find elements
- Use XPath selectors to find elements and nodes
- Extract attributes, text, and HTML from elements
- Working with relative and absolute URLs
- Example program: list links