Brand new HTML5 parser: jsoup 1.6.0 released

2011-Jun-13 I am very happy to announce that jsoup 1.6.0 has been released and is now available for download.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

New HTML5 parser

This release of jsoup includes a completely re-implemented parser, based on the WHATWG HTML5 specification. jsoup now parses HTML exactly like modern browsers such as Chrome, Firefox, and Safari parse HTML. This helps users scrape data more readily, and improves HTML tidying.

As this is such a large change since the previous 1.5.2 release, jsoup is being released as version 1.6.0. The .0 denotes a beta release: if you run into problems parsing documents, please file a bug, particularly if parsing under 1.5.2 worked OK.

Other improvements and bug fixes

  • When parsing files from disk, files are loaded via memory mapping, to increase parse speed.
  • Reduced memory overhead and lowered garbage collector pressure with Attribute, Node and Element model optimisations.
  • Fixed cookie handling issue in jsoup.Connect where empty cookies would cause a validation exception.
  • Added jsoup.Connect configuration options to allow HTTP errors to be ignored, and the content-type to be ignored.
  • Now handles unclosed <title> tags in document by breaking out of the title at the next start tag, instead of eating up to the end of the document.
  • Added OSGi bundle support to the jsoup package jar. If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.