jsoup 1.6.2 released

2012-Mar-27 I am very happy to announce that jsoup 1.6.2 has been released and is now available for download. It's been a long time between releases, and as some level of recompense I'm launching a great big bag of bug fixes, a relaxed XML parse mode, functionality tweaks, and memory improvements.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

Relaxed XML parser

Sometimes you want the convenience of the jsoup parser API, but for XML content. To date, jsoup hasn't supported that use-case, as it always enforces a valid HTML parse and tree. New in 1.6.2 is a relaxed XML parser, which ignores HTML parsing rules and creates a simple tree from the input.

Improvements

Added a simplified XML parsing mode, which can usefully parse valid and invalid XML, but does not enforce any HTML document structure or special tag behaviour.

Added the optional ability to track errors when tokenising and parsing.

Added Jsoup.connect.cookies(Map) method, to set multiple cookies at once, possibly from a prior request.

Added Element.textNodes() and Element.dataNodes(), to easily access an element's children text nodes and data nodes.

Added an example program that demonstrates how to format HTML as plain-text, and the use of the NodeVisitor interface.

Added Node.traverse() and Elements.traverse() methods, to iterate through a node's descendants.

Updated Jsoup.connect() so that when requests made as POSTs are redirected, the redirect is followed as a GET.

Updated the Cleaner and whitelists to optionally preserve related links in elements, instead of converting them to absolute links.

Updated the Cleaner to support custom allowed protocols such as "cid:" and "data:".

Updated handling of base href tags, to act on only the first one seen when parsing, to align with modern browsers.

Updated Node.setBaseUri(), to recursively set on all the node's descendants.

Bug fixes

Fixed an issue where all HTML parse errors where being tracked as new objects, creating high memory pressure on low-memory devices.

Fixed handling of null characters within comments.

Tweaked escaped entity detection in attributes to not treat &entity_... as an entity form.

Fixed doctype tokeniser to allow whitespace between name and public identifier.

Fixed issue where comments within a table tag would be duplicate-fostered into body.

Fixed an issue where a spurious byte-order-mark at the start of a document would cause the parser to miss head contents.

Fixed an issue where content after a frameset could cause a NPE crash. Now correctly implements spec and ignores the trailing content.

Tweaked whitespace checks to align with HTML spec.

Tweaked HTML output of closing script and style tags to not add an extraneous newline when pretty-printing.

Substantially reduced default memory allocation within Node.outerHtml, to reduce memory pressure when serialising smaller DOMs.

Many thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.