jsoup release 1.7.2
2013-Jan-27
jsoup 1.7.2 introduces selectors for structural pseudo CSS classes, full support for international supplementary characters, and a raft of improvements and bug fixes.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
Improvements
- Added support for supplementary characters outside of the Basic Multilingual Plane.
- Added support for structural pseudo CSS selectors, including
:first-child
,:last-child
,:nth-child
,:nth-last-child
,:first-of-type
,:last-of-type
,:nth-of-type
,:nth-last-of-type
,:only-child
,:only-of-type
,:empty
, and:root
- Added a maximum body response size to
Jsoup.Connection
, to prevent running out of memory when trying to read extremely large documents. The default is 1MB. - Refactored the HTML
Cleaner
to traverse rather than recurse child nodes, to avoid the risk of overflowing the stack. - Added
Element.insertChildren(int, java.util.Collection)
, to easily insert a list of child nodes at a specific index. - Added
Node.childNodesCopy()
, to create an independent copy of a Node's children. - When parsing in XML mode, preserve XML declarations (
<?xml ... ?>
). - Introduced
Parser.parseXmlFragment()
, to allow easy parsing of XML fragments. - Allow
Whitelist
test methods to be extended - Added
Document.OutputSettings.outline
mode, to aid HTML debugging by printing out in outline mode, similar to browser HTML inspectors. - When parsing, allow all tags to self-close. Tags that aren't expected to self-close will get an end tag.
Bug Fixes
- Fixed an issue when parsing
<textarea>/RCData
tags containing unescaped closing tags that would drop the trailing>
. - When cloning an
Element
, reset the classnames set so as not to hold a pointer to the source's. - Corrected the javadoc for
Element#child()
to note that it can throwIndexOutOfBounds
. - Limit how far up the stack the formatting adoption agency algorithm will travel, to prevent the chance of a run-away parse when the HTML stack is hopelessly deep.
- Modified
Element.text()
to build text by traversing child nodes rather than recursing. This avoids stack-overflow errors when the DOM is very deep and the VM stack-size is low.
Many thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.