Parse large documents efficiently with StreamParser

Problem

You need to parse an HTML or XML document that is too large to fit entirely into memory, or you want to process elements progressively as they are encountered. A typical use case is extracting specific elements from a large document, or handling streamed HTML from a network source efficiently.

Traditional Java SAX parsers offer efficient streaming parsing for XML and HTML, but they lack an ergonomic way to traverse or manipulate elements like a DOM parser. Meanwhile, standard DOM parsers, such as Jsoup.parse(), require loading the entire document into memory, which may be inefficient for large files.

Solution

Use the StreamParser, which allows you to parse parsing an HTML or XML document in an event driven hybrid DOM + SAX style. Elements are emitted as they are completed, enabling efficient memory use and incremental processing. This hybrid approach allows you to process elements as they arrive, including their children and ancestors, while still leveraging jsoup's intuitive API.

This makes StreamParser a viable alternative to traditional SAX parsers while providing a more ergonomic and familiar API. And jsoup's robust handling of malformed HTML and XML ensures that real-world documents can be processed effectively.

try (StreamParser streamer = Jsoup.connect("https://example.com/large.html")
    .execute()
    .streamParser()) {
    
    Element el;
    while ((el = streamer.selectNext("article")) != null) {
        // Will include the children of <article>
        System.out.println("Processing article: " + el.text());
        el.remove(); // Keep memory usage low by discarding processed elements
    }
}

Description

Unlike the default jsoup parse method, which constructs a full DOM tree in memory, StreamParser allows for progressive parsing:

Elements are fully formed and emitted as they are completed.
The parser can run in an iterator-like fashion with selectNext(query) to fetch elements as needed.
The DOM tree can be pruned during parsing to save memory.
The document() method provides access to the partially built document.
Parsing can be stopped early with stop() if only a portion of the document is needed.
The backing input (a URL connection, or a file) is read incrementally as the parse proceeds, reducing buffer bloat.

A StreamParser can be reused via a new parse(Reader, String), but is not thread-safe for concurrent inputs. New parsers should be used in each thread.

If created via Connection.Response#streamParser(), or another Reader that is I/O backed, the iterator and stream consumers will throw an java.io.UncheckedIOException if the underlying Reader errors during read.

The StreamParser wraps an underlying HTML or XML parser, so the same configuration options can be used as with the standard Jsoup.parse method.

Examples

Process a file in chunks

Let's say we have an XML file with a bunch of <book> chunks, each with many <chapter> elements, and loading it all into the DOM at once might run out of memory. Parse the file incrementally using DataUtil.streamParser(...). Then process the file in chunks by iterating on selectNext(cssquery):

static void streamChunks(Path path) throws IOException {
try (StreamParser streamer = DataUtil.streamParser(
    path, StandardCharsets.UTF_8, "https://example.com", Parser.xmlParser())) {

    Element el;
    var seenChunks = 0;
    while ((el = streamer.selectNext("book")) != null) {
        // do something more useful! The element will have all its children elements
        Elements chapters = el.select("chapter");
        // remove this chunk once used to keep DOM light and not run out of memory
        el.remove();
        seenChunks++;
    }

    Document doc = streamer.document(); // the completed doc, will just be a shell
    log("Title", doc.expectFirst("title"));
    log("Seen chunks", seenChunks);
}
}

Parse just the metadata of a website

Assume we are building a link preview tool. All the data we need is in the head section of a page, and so there's no need to fetch and parse the complete page. Make the request using Jsoup.connect(url), and stream parse it via Response.streamParser().

This example will fetch a given URL, parse only the <head> contents and use those, and then cleanly close the request:

static void selectMeta(String url) throws IOException {
try (StreamParser streamer = Jsoup.connect(url).execute().streamParser()) {
    Element head = streamer.selectFirst("head");
    if (head == null) return;

    log("Title", head.select("title").text());
    log("Description", head.select("meta[name=description]").attr("content"));
    log("Image", head.select("meta[name=twitter:image]").attr("content"));
}
}

Minify the loaded DOM by removing empty text nodes

This example shows a way to progressively parse an input and remove redundant empty textnodes during the parse, resulting in a (somewhat) minified DOM:

static void minifyDocument() {
String html = "<table><tr> <td>a</td> <td>a</td> <td>a</td> <td>a</td> </tr>";
StreamParser streamer = new StreamParser(Parser.htmlParser()).parse(html, "");

streamer.stream()
    .filter(Element::isBlock)
    .forEach(el -> {
        List<TextNode> textNodes = el.textNodes();
        for (TextNode textNode : textNodes) {
            if (textNode.isBlank())
                textNode.remove();
        }
    });

Document minified = streamer.document();
System.out.println(minified.body());
}

Conclusion

The StreamParser provides a practical solution for handling large or streamed XML and HTML documents efficiently, balancing the benefits of both SAX and DOM parsing. Whether you need to extract elements incrementally, reduce memory consumption, or selectively parse content, StreamParser offers a flexible alternative to traditional Java SAX parsers while maintaining the familiar API and robust parsing capabilities of jsoup.

Cookbook

Introduction

Parsing and traversing a Document

Input

Parse a document from a String
Parsing a body fragment
Load a Document from a URL
Load a Document from a File
Parse large documents efficiently with StreamParser

jsoup