Parsing a body fragment
Problem
You have a fragment of body HTML (e.g. a div
containing a couple of p
tags; as opposed to a full HTML document) that you want to parse. Perhaps it was provided by a user submitting a comment, or editing the body of a page in a CMS.
Solution
Use the Jsoup.parseBodyFragment(String html)
method.
String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();
Description
The parseBodyFragment
method creates an empty shell document, and inserts the parsed HTML into the body
element. If you used the normal Jsoup.parse(String html)
method, you would generally get the same result, but explicitly treating the input as a body fragment ensures that any bozo HTML provided by the user is parsed into the body
element.
The Document.body()
method retrieves the element children of the document's body
element; it is equivalent to doc.getElementsByTag("body")
.
Stay safe
If you are going to accept HTML input from a user, you need to be careful to avoid cross-site scripting attacks. See the documentation for the Safelist
based cleaner, and clean the input with clean(String bodyHtml, Safelist safelist)
.
Cookbook
Introduction
Input
- Parse a document from a String
- Parsing a body fragment
- Load a Document from a URL
- Load a Document from a File
Extracting data
- Use DOM methods to navigate a document
- Use CSS selectors to find elements
- Use XPath selectors to find elements and nodes
- Extract attributes, text, and HTML from elements
- Working with relative and absolute URLs
- Example program: list links