jsoup pep release 1.7.1

2012-Sep-23 Full of vim and vigour: parsing HTML 2.3x faster, jsoup 1.7.1 is now available for download. I have profiled the parse execution of thousands of documents, optimised every hotspot to streamline the parser, and significantly minimized node memory consumption. Along the way, I've also trimmed the retained heap memory when retrieving data from parsed documents, reduced garbage collection when selecting elements, and have removed lock contention to allow jsoup to run concurrently on as many threads as are available.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

This release of jsoup brings a number of functional improvements in addition to its speed and stability changes.

Improvements

  • Improved parse time, now 2.3x faster than previous release, with lower memory consumption.
  • Reduced memory consumption and garbage collection when selecting elements.
  • Removed an unnecessary synchronisation in Tag.valueOf, allowing multi-threaded parsing to run faster.
  • Whitespace normalise document.title() output.
  • In Jsoup.connect, fail faster if the return content type is not supported.
  • Made entity decoding less greedy, so that non-entities are less likely to be incorrectly treated as entities.
  • In Jsoup.connect, enforce a connection disconnect after every connect. This precludes keep-alive connections to the same host, but in practise many implementations will leak connections, particularly on error.
  • If a server doesn't specify a content-type header, treat that as OK.
  • If a server returns an unsupported character-set header, attempt to decode the content with the default charset (UTF8), instead of bailing with an unsupported charset exception.

Bug fixes

  • Fixed an issue when determining the Windows-1254 character-set from a meta tag when run in the Turkish locale.
  • Fixed whitespace preservation in textarea tags.
  • Fixed an issue that prevented frameset documents to be cleaned by the Cleaner.
  • Fixed an issue when normalising whitespace for strings containing high-surrogate characters.

Many thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.