Parse large documents efficiently with StreamParser
Problem
You need to parse an HTML or XML document that is too large to fit entirely into memory, or you want to process elements progressively as they are encountered. A typical use case is extracting specific elements from a large document, or handling streamed HTML from a network source efficiently.
Traditional Java SAX parsers offer efficient streaming parsing for XML and HTML, but they lack an ergonomic way to traverse or manipulate elements like a DOM parser. Meanwhile, standard DOM parsers, such as Jsoup.parse(
, require loading the entire document into memory, which may be inefficient for large files.
Solution
Use the StreamParser
, which allows you to parse parsing an HTML or XML document in an event driven hybrid DOM + SAX style. Elements are emitted as they are completed, enabling efficient memory use and incremental processing. This hybrid approach allows you to process elements as they arrive, including their children and ancestors, while still leveraging jsoup's intuitive API.
This makes StreamParser a viable alternative to traditional SAX parsers while providing a more ergonomic and familiar API. And jsoup's robust handling of malformed HTML and XML ensures that real-world documents can be processed effectively.
try (StreamParser streamer = Jsoup.connect("https://example.com/large.html")
.execute()
.streamParser()) {
Element el;
while ((el = streamer.selectNext("article")) != null) {
// Will include the children of <article>
System.out.println("Processing article: " + el.text());
el.remove(); // Keep memory usage low by discarding processed elements
}
}
Description
Unlike the default jsoup parse
method, which constructs a full DOM tree in memory, StreamParser
allows for progressive parsing:
- Elements are fully formed and emitted as they are completed.
- The parser can run in an iterator-like fashion with
selectNext(
to fetch elements as needed.query) - The DOM tree can be pruned during parsing to save memory.
- The
document(
method provides access to the partially built document.) - Parsing can be stopped early with
stop(
if only a portion of the document is needed.) - The backing input (a URL connection, or a file) is read incrementally as the parse proceeds, reducing buffer bloat.
A StreamParser can be reused via a new parse(
, but is not thread-safe for concurrent inputs. New parsers should be used in each thread.
If created via Connection.Response#streamParser(
, or another Reader that is I/O backed, the iterator and stream consumers will throw an java.io.UncheckedIOException
if the underlying Reader errors during read.
The StreamParser wraps an underlying HTML or XML parser, so the same configuration options can be used as with the standard Jsoup.parse
method.
Examples
Process a file in chunks
Let's say we have an XML file with a bunch of <book>
chunks, each with many <chapter>
elements, and loading it all into the DOM at once might run out of memory. Parse the file incrementally using DataUtil.streamParser(
. Then process the file in chunks by iterating on selectNext(
:
static void streamChunks(Path path) throws IOException {
try (StreamParser streamer = DataUtil.streamParser(
path, StandardCharsets.UTF_8, "https://example.com", Parser.xmlParser())) {
Element el;
var seenChunks = 0;
while ((el = streamer.selectNext("book")) != null) {
// do something more useful! The element will have all its children elements
Elements chapters = el.select("chapter");
// remove this chunk once used to keep DOM light and not run out of memory
el.remove();
seenChunks++;
}
Document doc = streamer.document(); // the completed doc, will just be a shell
log("Title", doc.expectFirst("title"));
log("Seen chunks", seenChunks);
}
}
Parse just the metadata of a website
Assume we are building a link preview tool. All the data we need is in the head section of a page, and so there's no need to fetch and parse the complete page. Make the request using Jsoup.connect(
, and stream parse it via Response.streamParser(
.
This example will fetch a given URL, parse only the <head>
contents and use those, and then cleanly close the request:
static void selectMeta(String url) throws IOException {
try (StreamParser streamer = Jsoup.connect(url).execute().streamParser()) {
Element head = streamer.selectFirst("head");
if (head == null) return;
log("Title", head.select("title").text());
log("Description", head.select("meta[name=description]").attr("content"));
log("Image", head.select("meta[name=twitter:image]").attr("content"));
}
}
Minify the loaded DOM by removing empty text nodes
This example shows a way to progressively parse an input and remove redundant empty textnodes during the parse, resulting in a (somewhat) minified DOM:
static void minifyDocument() {
String html = "<table><tr> <td>a</td> <td>a</td> <td>a</td> <td>a</td> </tr>";
StreamParser streamer = new StreamParser(Parser.htmlParser()).parse(html, "");
streamer.stream()
.filter(Element::isBlock)
.forEach(el -> {
List<TextNode> textNodes = el.textNodes();
for (TextNode textNode : textNodes) {
if (textNode.isBlank())
textNode.remove();
}
});
Document minified = streamer.document();
System.out.println(minified.body());
}
Conclusion
The StreamParser provides a practical solution for handling large or streamed XML and HTML documents efficiently, balancing the benefits of both SAX and DOM parsing. Whether you need to extract elements incrementally, reduce memory consumption, or selectively parse content, StreamParser offers a flexible alternative to traditional Java SAX parsers while maintaining the familiar API and robust parsing capabilities of jsoup.
Cookbook
Introduction
Input
- Parse a document from a String
- Parsing a body fragment
- Load a Document from a URL
- Load a Document from a File
- Parse large documents efficiently with StreamParser
Extracting data
- Use DOM methods to navigate a document
- Use CSS selectors to find elements
- Use XPath selectors to find elements and nodes
- Extract attributes, text, and HTML from elements
- Working with relative and absolute URLs
- Example program: list links