Parsing a body fragment

Problem

You have a fragment of body HTML (e.g. a div containing a couple of p tags; as opposed to a full HTML document) that you want to parse. Perhaps it was provided by a user submitting a comment, or editing the body of a page in a CMS.

Solution

Use the Jsoup.parseBodyFragment(String html) method.

String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();

Description

The parseBodyFragment method creates an empty shell document, and inserts the parsed HTML into the body element. If you used the normal Jsoup.parse(String html) method, you would generally get the same result, but explicitly treating the input as a body fragment ensures that any bozo HTML provided by the user is parsed into the body element.

The Document.body() method retrieves the element children of the document's body element; it is equivalent to doc.getElementsByTag("body").

Stay safe

If you are going to accept HTML input from a user, you need to be careful to avoid cross-site scripting attacks. See the documentation for the Safelist based cleaner, and clean the input with clean(String bodyHtml, Safelist safelist).

Cookbook

Introduction

Parsing and traversing a Document

Input

Parse a document from a String
Parsing a body fragment
Load a Document from a URL
Load a Document from a File
Parse large documents efficiently with StreamParser

Extracting data

Modifying data

Cleaning HTML

Sanitize untrusted HTML (to prevent XSS)

Working with the web

Maintaining a request session