jsoup release 1.7.2

2013-Jan-27

jsoup 1.7.2 introduces selectors for structural pseudo CSS classes, full support for international supplementary characters, and a raft of improvements and bug fixes.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

Improvements

Added support for supplementary characters outside of the Basic Multilingual Plane.
Added support for structural pseudo CSS selectors, including :first-child, :last-child, :nth-child, :nth-last-child, :first-of-type, :last-of-type, :nth-of-type, :nth-last-of-type, :only-child, :only-of-type, :empty, and :root
Added a maximum body response size to Jsoup.Connection, to prevent running out of memory when trying to read extremely large documents. The default is 1MB.
Refactored the HTML Cleaner to traverse rather than recurse child nodes, to avoid the risk of overflowing the stack.
Added Element.insertChildren(int, java.util.Collection), to easily insert a list of child nodes at a specific index.
Added Node.childNodesCopy(), to create an independent copy of a Node's children.
When parsing in XML mode, preserve XML declarations (<?xml ... ?>).
Introduced Parser.parseXmlFragment(), to allow easy parsing of XML fragments.
Allow Whitelist test methods to be extended
Added Document.OutputSettings.outline mode, to aid HTML debugging by printing out in outline mode, similar to browser HTML inspectors.
When parsing, allow all tags to self-close. Tags that aren't expected to self-close will get an end tag.

Bug Fixes

Fixed an issue when parsing <textarea>/RCData tags containing unescaped closing tags that would drop the trailing >.
When cloning an Element, reset the classnames set so as not to hold a pointer to the source's.
Corrected the javadoc for Element#child() to note that it can throw IndexOutOfBounds.
Limit how far up the stack the formatting adoption agency algorithm will travel, to prevent the chance of a run-away parse when the HTML stack is hopelessly deep.
Modified Element.text() to build text by traversing child nodes rather than recursing. This avoids stack-overflow errors when the DOM is very deep and the VM stack-size is low.

Many thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.