Skip to content
  • jsoup
  • News
  • Bugs
  • Discussion
  • Download
  • API Reference
  • Cookbook
  • Try jsoup
jsoup » Cookbook » Input » Parse a document from a String

Parse a document from a String

Jan 17, 2010

Problem

You have HTML in a Java String, and you want to parse that HTML to get at its contents, or to make sure it’s well formed, or to modify it. The String may have come from user input, a file, or from the web.

Solution

Use the static Jsoup.parse(String html) method, or Jsoup.parse(String html, String baseUri) if the page came from the web, and you want to get at absolute URLs (see Working with URLs).

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

Description

The parse(String html, String baseUri) method parses the input HTML into a new Document. The base URI argument is used to resolve relative URLs into absolute URLs, and should be set to the URL where the document was fetched from. If that’s not applicable, or if you know the HTML has a base element, you can use the parse(String html) method.

As long as you pass in a non-null string, you’re guaranteed to have a successful, sensible parse, with a Document containing (at least) a head and a body element.

Once you have a Document, you can get at the data using the appropriate methods in Document and its supers Element and Node.

Cookbook

Introduction

  1. Parsing and traversing a Document

Input

  1. Parse a document from a String
  2. Parsing a body fragment
  3. Load a Document from a URL
  4. Load a Document from a File
  5. Parse large documents efficiently with StreamParser

Extracting data

  1. Use DOM methods to navigate a document
  2. Use CSS selectors to find elements
  3. Use XPath selectors to find elements and nodes
  4. Extract attributes, text, and HTML from elements
  5. Working with relative and absolute URLs
  6. Example program: list links

Modifying data

  1. Set attribute values
  2. Set the HTML of an element
  3. Setting the text content of elements

Cleaning HTML

  1. Sanitize untrusted HTML (to prevent XSS)

Working with the web

  1. Maintaining a request session
jsoup HTML parser © 2009 - 2026 Jonathan Hedley