jsoup release 1.7.2
2013-Jan-27
jsoup 1.7.2 introduces selectors for structural pseudo CSS classes, full support for international supplementary characters, and a raft of improvements and bug fixes.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
Improvements
- Added support for supplementary characters outside of the Basic Multilingual Plane.
- Added support for structural pseudo CSS selectors, including
:first-child,:last-child,:nth-child,:nth-last-child,:first-of-type,:last-of-type,:nth-of-type,:nth-last-of-type,:only-child,:only-of-type,:empty, and:root - Added a maximum body response size to
Jsoup.Connection, to prevent running out of memory when trying to read extremely large documents. The default is 1MB. - Refactored the HTML
Cleanerto traverse rather than recurse child nodes, to avoid the risk of overflowing the stack. - Added
Element.insertChildren(, to easily insert a list of child nodes at a specific index.int, java.util.Collection) - Added
Node.childNodesCopy(, to create an independent copy of a Node's children.) - When parsing in XML mode, preserve XML declarations (
<?xml ... ?>). - Introduced
Parser.parseXmlFragment(, to allow easy parsing of XML fragments.) - Allow
Whitelisttest methods to be extended - Added
Document.OutputSettings.outlinemode, to aid HTML debugging by printing out in outline mode, similar to browser HTML inspectors. - When parsing, allow all tags to self-close. Tags that aren't expected to self-close will get an end tag.
Bug Fixes
- Fixed an issue when parsing
<textarea>/RCDatatags containing unescaped closing tags that would drop the trailing>. - When cloning an
Element, reset the classnames set so as not to hold a pointer to the source's. - Corrected the javadoc for
Element#child(to note that it can throw) IndexOutOfBounds. - Limit how far up the stack the formatting adoption agency algorithm will travel, to prevent the chance of a run-away parse when the HTML stack is hopelessly deep.
- Modified
Element.text(to build text by traversing child nodes rather than recursing. This avoids stack-overflow errors when the DOM is very deep and the VM stack-size is low.)
Many thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.