HTML5 parser out of beta: jsoup 1.6.1 released

2011-Jul-02 I am very happy to announce that jsoup 1.6.1 has been released and is now available for download.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

HTML5 parser out of beta

This release of jsoup includes a completely re-implemented parser, based on the WHATWG HTML5 specification. jsoup now parses HTML exactly like modern browsers such as Chrome, Firefox, and Safari parse HTML. This helps users scrape data more readily, and improves HTML tidying.

This release is a stabilised version of the 1.6.0 beta release.

Improvements and bug fixes since 1.6.0

  • Fixed Java 1.5 (and Android 2.2) compatibility.
  • Fixed an issue when parsing <script> tags in body where the tokeniser wouldn't switch to the InScript state, which meant that data wasn't parsed correctly.
  • Fixed an issue with a missing quote when serialising DocumentType nodes.
  • Fixed issue where a single 0 character was lexed incorrectly as a null character.
  • Fixed normalisation of carriage returns to newlines on input HTML.
  • Disabled memory mapped files when loading files from disk, to improve compatibility in Windows environments.

If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.