jsoup Java HTML Parser release 1.11.1
2017-Nov-4
This one goes to eleven! jsoup 1.11.1 features a 30% lower DOM memory footprint, streaming network HTML parsing, faster HTML generation, and a bunch of other improvements and bug fixes.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
Download jsoup now.
Improvements
- When loading content from a URL or a file, the content is now parsed as it streams in from the network or disk, rather than being fully buffered before parsing. This substantially reduces memory consumption & large garbage objects when loading large files. Note that this change means that a response, once parsed, may not be parsed again from the same response object unless you call
Connection.Response.bufferUp()
first, which will buffer the full response into memory. - Updated language level to Java 7 from Java 5. To maintain Android support (of minversion 8),
try-with-resources
are not used. - Added
Connection.Response.bodyStream()
, a method to get the response body as an input stream. This is useful for saving a large response straight to a file, without buffering fully into memory first. - Performance improvements in text and HTML generation (through less GC).
- Reduced memory consumption of text, scripts, and comments in the DOM by 40%, by refactoring the node hierarchy to not track childnodes or attributes by default for lead nodes. For the average document, that's about a 30% memory reduction.
- Reduced memory consumption of
Element
s by refactoring theirAttributes
to be a simple pair of arrays, vs aLinkedHashSet
. - Added support for
Element.selectFirst()
, to efficiently find the first matching element. - Added
Element.appendTo(parent)
to simplify slinging elements about. - Added support for multiple headers with the same name in
Jsoup.Connect
- Added
Element.shallowClone()
andNode.shallowClone()
, to allow cloning nodes without getting all their children. - Updated
Element.text()
and the:contains(text)
selector to consider
character as spaces. - Updated
Jsoup.connect().timeout()
to implement a total connect + combined read timeout. Previously it specified connect and buffer read times only, so to implement a combined total timeout, you had to have another thread send an interupt. - Improved performance of
Node.addChildren()
(was quadratic) - Added missing support for template tags in tables
- In
Jsoup.Connect
file uploads, added the ability to set the uploaded files' mimetype. - Improved Node traversal, including less object creation, and partial and filtering traversor support.
Bug Fixes
- Bugfix: if a document was was redecoded after character set detection, the HTML parser was not reset correctly, which could lead to an incorrect DOM.
- Bugfix: attributes with the same name but different case would be incorrectly treated as different attributes.
- Bugfix: self-closing tags for known empty elements were incorrectly treated as errors.
- Bugfix: fixed an issue where a self-closing title, noframes, or style tag would cause the rest of the page to be incorrectly parsed as data or text.
- Bugfix: fixed an issue with unknown mixed-case tags
- Bugfix: fixed an issue where the entity resources were left open after startup, causing a warning.
- Bugfix: fixed an issue where
Element.getElementsByIndexLessThan(index)
would incorrectly provide the root element - Improved parse time for pages with exceptionally deeply nested tags.
- Improvement / workaround: modified the
Entities
implementation to load its data from a .class vs from a jar resource. Faster, and safer on Android.
Many thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.