Package org.jsoup.parser

Class StreamParser

java.lang.Object
org.jsoup.parser.StreamParser
All Implemented Interfaces:
Closeable, AutoCloseable

public class StreamParser extends Object implements Closeable
A StreamParser provides a progressive parse of its input. As each Element is completed, it is emitted via a Stream or Iterator interface. Elements returned will be complete with all their children, and an (empty) next sibling, if applicable.

Elements (or their children) may be removed from the DOM during the parse, for e.g. to conserve memory, providing a mechanism to parse an input document that would otherwise be too large to fit into memory, yet still providing a DOM interface to the document and its elements.

Additionally, the parser provides a selectFirst(String query) / selectNext(String query), which will run the parser until a hit is found, at which point the parse is suspended. It can be resumed via another select() call, or via the stream() or iterator() methods.

Once the input has been fully read, the input Reader will be closed. Or, if the whole document does not need to be read, call stop() and close().

The document() method will return the Document being parsed into, which will be only partially complete until the input is fully consumed.

A StreamParser can be reused via a new parse(Reader, String), but is not thread-safe for concurrent inputs. New parsers should be used in each thread.

If created via Connection.Response.streamParser(), or another Reader that is I/O backed, the iterator and stream consumers will throw an UncheckedIOException if the underlying Reader errors during read.

The StreamParser interface is currently in beta and may change in subsequent releases. Feedback on the feature and how you're using it is very welcome via the jsoup discussions.

Since:
1.18.1
  • Constructor Summary

    Constructors
    Constructor
    Description
    StreamParser(Parser parser)
    Construct a new StreamParser, using the supplied base Parser.
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    close()
    Closes the input and releases resources including the underlying parser and reader.
    Runs the parser until the input is fully read, and returns the completed Document.
    List<Node>
    When initialized as a fragment parse, runs the parser until the input is fully read, and returns the completed fragment child nodes.
    Get the current Document as it is being parsed.
    expectFirst(String query)
    Just like selectFirst(String), but if there is no match, throws an IllegalArgumentException.
    expectNext(String query)
    Just like selectFirst(String), but if there is no match, throws an IllegalArgumentException.
    Iterator<Element>
    Returns an Iterator of Elements, with the input being parsed as each element is consumed.
    parse(Reader input, String baseUri)
    Provide the input for a Document parse.
    parse(String input, String baseUri)
    Provide the input for a Document parse.
    parseFragment(Reader input, @Nullable Element context, String baseUri)
    Provide the input for a fragment parse.
    parseFragment(String input, @Nullable Element context, String baseUri)
    Provide the input for a fragment parse.
    @Nullable Element
    selectFirst(String query)
    Finds the first Element that matches the provided query.
    @Nullable Element
    Finds the first Element that matches the provided query.
    @Nullable Element
    selectNext(String query)
    Finds the next Element that matches the provided query.
    @Nullable Element
    Finds the next Element that matches the provided query.
    stop()
    Flags that the parse should be stopped; the backing iterator will not return any more Elements.
    Stream<Element>
    stream()
    Creates a Stream of Elements, with the input being parsed as each element is consumed.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • StreamParser

      public StreamParser(Parser parser)
      Construct a new StreamParser, using the supplied base Parser.
      Parameters:
      parser - the configured base parser
  • Method Details

    • parse

      public StreamParser parse(Reader input, String baseUri)
      Provide the input for a Document parse. The input is not read until a consuming operation is called.
      Parameters:
      input - the input to be read.
      baseUri - the URL of this input, for absolute link resolution
      Returns:
      this parser, for chaining
    • parse

      public StreamParser parse(String input, String baseUri)
      Provide the input for a Document parse. The input is not read until a consuming operation is called.
      Parameters:
      input - the input to be read
      baseUri - the URL of this input, for absolute link resolution
      Returns:
      this parser
    • parseFragment

      public StreamParser parseFragment(Reader input, @Nullable Element context, String baseUri)
      Provide the input for a fragment parse. The input is not read until a consuming operation is called.
      Parameters:
      input - the input to be read
      context - the optional fragment context element
      baseUri - the URL of this input, for absolute link resolution
      Returns:
      this parser
      See Also:
    • parseFragment

      public StreamParser parseFragment(String input, @Nullable Element context, String baseUri)
      Provide the input for a fragment parse. The input is not read until a consuming operation is called.
      Parameters:
      input - the input to be read
      context - the optional fragment context element
      baseUri - the URL of this input, for absolute link resolution
      Returns:
      this parser
      See Also:
    • stream

      public Stream<Element> stream()
      Creates a Stream of Elements, with the input being parsed as each element is consumed. Each Element returned will be complete (that is, all of its children will be included, and if it has a next sibling, that (empty) sibling will exist at Element.nextElementSibling()). The stream will be emitted in document order as each element is closed. That means that child elements will be returned prior to their parents.

      The stream will start from the current position of the backing iterator and the parse.

      When consuming the stream, if the Reader that the Parser is reading throws an I/O exception (for example a SocketTimeoutException), that will be emitted as an UncheckedIOException

      Returns:
      a stream of Element objects
      Throws:
      UncheckedIOException - if the underlying Reader excepts during a read (in stream consuming methods)
    • iterator

      public Iterator<Element> iterator()
      Returns an Iterator of Elements, with the input being parsed as each element is consumed. Each Element returned will be complete (that is, all of its children will be included, and if it has a next sibling, that (empty) sibling will exist at Element.nextElementSibling()). The elements will be emitted in document order as each element is closed. That means that child elements will be returned prior to their parents.

      The iterator will start from the current position of the parse.

      The iterator is backed by this StreamParser, and the resources it holds.

      Returns:
      a stream of Element objects
    • stop

      public StreamParser stop()
      Flags that the parse should be stopped; the backing iterator will not return any more Elements.
      Returns:
      this parser
    • close

      public void close()
      Closes the input and releases resources including the underlying parser and reader.

      The parser will also be closed when the input is fully read.

      The parser can be reused with another call to parse(Reader, String).

      Specified by:
      close in interface AutoCloseable
      Specified by:
      close in interface Closeable
    • document

      public Document document()
      Get the current Document as it is being parsed. It will be only partially complete until the input is fully read. Structural changes (e.g. insert, remove) may be made to the Document contents.
      Returns:
      the (partial) Document
    • complete

      public Document complete() throws IOException
      Runs the parser until the input is fully read, and returns the completed Document.
      Returns:
      the completed Document
      Throws:
      IOException - if an I/O error occurs
    • completeFragment

      public List<Node> completeFragment() throws IOException
      When initialized as a fragment parse, runs the parser until the input is fully read, and returns the completed fragment child nodes.
      Returns:
      the completed child nodes
      Throws:
      IOException - if an I/O error occurs
      See Also:
    • selectFirst

      public @Nullable Element selectFirst(String query) throws IOException
      Finds the first Element that matches the provided query. If the parsed Document does not already have a match, the input will be parsed until the first match is found, or the input is completely read.
      Parameters:
      query - the Selector query.
      Returns:
      the first matching Element, or null if there's no match
      Throws:
      IOException - if an I/O error occurs
    • expectFirst

      public Element expectFirst(String query) throws IOException
      Just like selectFirst(String), but if there is no match, throws an IllegalArgumentException. This is useful if you want to simply abort processing on a failed match.
      Parameters:
      query - the Selector query.
      Returns:
      the first matching element
      Throws:
      IllegalArgumentException - if no match is found
      IOException - if an I/O error occurs
    • selectFirst

      public @Nullable Element selectFirst(Evaluator eval) throws IOException
      Finds the first Element that matches the provided query. If the parsed Document does not already have a match, the input will be parsed until the first match is found, or the input is completely read.
      Parameters:
      eval - the Selector evaluator.
      Returns:
      the first matching Element, or null if there's no match
      Throws:
      IOException - if an I/O error occurs
    • selectNext

      public @Nullable Element selectNext(String query) throws IOException
      Finds the next Element that matches the provided query. The input will be parsed until the next match is found, or the input is completely read.
      Parameters:
      query - the Selector query.
      Returns:
      the next matching Element, or null if there's no match
      Throws:
      IOException - if an I/O error occurs
    • expectNext

      public Element expectNext(String query) throws IOException
      Just like selectFirst(String), but if there is no match, throws an IllegalArgumentException. This is useful if you want to simply abort processing on a failed match.
      Parameters:
      query - the Selector query.
      Returns:
      the first matching element
      Throws:
      IllegalArgumentException - if no match is found
      IOException - if an I/O error occurs
    • selectNext

      public @Nullable Element selectNext(Evaluator eval) throws IOException
      Finds the next Element that matches the provided query. The input will be parsed until the next match is found, or the input is completely read.
      Parameters:
      eval - the Selector evaluator.
      Returns:
      the next matching Element, or null if there's no match
      Throws:
      IOException - if an I/O error occurs