Parsing a body fragment

Jan 21, 2010

Problem

You have a fragment of body HTML (e.g. a div containing a couple of p tags; as opposed to a full HTML document) that you want to parse. Perhaps it was provided by a user submitting a comment, or editing the body of a page in a CMS.

Solution

Use the Jsoup.parseBodyFragment(String html) method.

String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();

Description

The parseBodyFragment method creates an empty shell document, and inserts the parsed HTML into the body element. If you used the normal Jsoup.parse(String html) method, you would generally get the same result, but explicitly treating the input as a body fragment ensures that any bozo HTML provided by the user is parsed into the body element.

The Document.body() method retrieves the element children of the document’s body element; it is equivalent to doc.getElementsByTag("body").

Stay safe

If you are going to accept HTML input from a user, you need to be careful to avoid cross-site scripting attacks. See the documentation for the Safelist based cleaner, and clean the input with clean(String bodyHtml, Safelist safelist).