jsoup Java HTML Parser release 1.11.1

2017-Nov-4

This one goes to eleven! jsoup 1.11.1 features a 30% lower DOM memory footprint, streaming network HTML parsing, faster HTML generation, and a bunch of other improvements and bug fixes.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

Download jsoup now.

Improvements

When loading content from a URL or a file, the content is now parsed as it streams in from the network or disk, rather than being fully buffered before parsing. This substantially reduces memory consumption & large garbage objects when loading large files. Note that this change means that a response, once parsed, may not be parsed again from the same response object unless you call Connection.Response.bufferUp() first, which will buffer the full response into memory.
Updated language level to Java 7 from Java 5. To maintain Android support (of minversion 8), try-with-resources are not used.
Added Connection.Response.bodyStream(), a method to get the response body as an input stream. This is useful for saving a large response straight to a file, without buffering fully into memory first.
Performance improvements in text and HTML generation (through less GC).
Reduced memory consumption of text, scripts, and comments in the DOM by 40%, by refactoring the node hierarchy to not track childnodes or attributes by default for lead nodes. For the average document, that's about a 30% memory reduction.
Reduced memory consumption of Elements by refactoring their Attributes to be a simple pair of arrays, vs a LinkedHashSet.
Added support for Element.selectFirst(), to efficiently find the first matching element.
Added Element.appendTo(parent) to simplify slinging elements about.
Added support for multiple headers with the same name in Jsoup.Connect
Added Element.shallowClone() and Node.shallowClone(), to allow cloning nodes without getting all their children.
Updated Element.text() and the :contains(text) selector to consider   character as spaces.
Updated Jsoup.connect().timeout() to implement a total connect + combined read timeout. Previously it specified connect and buffer read times only, so to implement a combined total timeout, you had to have another thread send an interupt.
Improved performance of Node.addChildren() (was quadratic)
Added missing support for template tags in tables
In Jsoup.Connect file uploads, added the ability to set the uploaded files' mimetype.
Improved Node traversal, including less object creation, and partial and filtering traversor support.

Bug Fixes

Bugfix: if a document was was redecoded after character set detection, the HTML parser was not reset correctly, which could lead to an incorrect DOM.
Bugfix: attributes with the same name but different case would be incorrectly treated as different attributes.
Bugfix: self-closing tags for known empty elements were incorrectly treated as errors.
Bugfix: fixed an issue where a self-closing title, noframes, or style tag would cause the rest of the page to be incorrectly parsed as data or text.
Bugfix: fixed an issue with unknown mixed-case tags
Bugfix: fixed an issue where the entity resources were left open after startup, causing a warning.
Bugfix: fixed an issue where Element.getElementsByIndexLessThan(index) would incorrectly provide the root element
Improved parse time for pages with exceptionally deeply nested tags.
Improvement / workaround: modified the Entities implementation to load its data from a .class vs from a jar resource. Faster, and safer on Android.

Many thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.