jsoup release 1.7.2

2013-Jan-27

jsoup 1.7.2 introduces selectors for structural pseudo CSS classes, full support for international supplementary characters, and a raft of improvements and bug fixes.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

Improvements

  • Added support for supplementary characters outside of the Basic Multilingual Plane.
  • Added support for structural pseudo CSS selectors, including :first-child, :last-child, :nth-child, :nth-last-child, :first-of-type, :last-of-type, :nth-of-type, :nth-last-of-type, :only-child, :only-of-type, :empty, and :root
  • Added a maximum body response size to Jsoup.Connection, to prevent running out of memory when trying to read extremely large documents. The default is 1MB.
  • Refactored the HTML Cleaner to traverse rather than recurse child nodes, to avoid the risk of overflowing the stack.
  • Added Element.insertChildren(int, java.util.Collection), to easily insert a list of child nodes at a specific index.
  • Added Node.childNodesCopy(), to create an independent copy of a Node's children.
  • When parsing in XML mode, preserve XML declarations (<?xml ... ?>).
  • Introduced Parser.parseXmlFragment(), to allow easy parsing of XML fragments.
  • Allow Whitelist test methods to be extended
  • Added Document.OutputSettings.outline mode, to aid HTML debugging by printing out in outline mode, similar to browser HTML inspectors.
  • When parsing, allow all tags to self-close. Tags that aren't expected to self-close will get an end tag.

Bug Fixes

  • Fixed an issue when parsing <textarea>/RCData tags containing unescaped closing tags that would drop the trailing >.
  • When cloning an Element, reset the classnames set so as not to hold a pointer to the source's.
  • Corrected the javadoc for Element#child() to note that it can throw IndexOutOfBounds.
  • Limit how far up the stack the formatting adoption agency algorithm will travel, to prevent the chance of a run-away parse when the HTML stack is hopelessly deep.
  • Modified Element.text() to build text by traversing child nodes rather than recursing. This avoids stack-overflow errors when the DOM is very deep and the VM stack-size is low.

Many thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.