HTML5 and international support: jsoup version 1.2.3

2010-Aug-04 I am delighted to announce that jsoup version 1.2.3 is now available for download.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

HTML5 support

While jsoup has always included implicit support for HTML5 tags, this release introduces explicit tag definitions. This ensures that when out-of-spec HTML5 is found (e.g. badly nested, or incorrectly parented), jsoup will create an in-spec parse tree.

HTML5 Datasets are now supported with the Element.dataset() method that provides a convenient map view of an element's dataset.

Improved international support

When parsing HTML from a file or a URL, jsoup will now automatically detect the document's character set, and decode the input appropriately before parsing.

You can also also define the document's output character set with the Document.outputSettings().charset(String) method. This controls which characters will be HTML escaped on output, and which will be kept as-is. The output charset defaults to the input charset.

Other improvements and bug fixes

I've added two new selectors:

  • namespace|element finds elements by tagname in a namespace
  • [^attributePrefix] finds elements that have an attribute name starting with a prefix

Also:

  • Added support for namespaced elements (<fb:name>) and selectors to find them (fb|name)
  • Improved implicit table element handling (particularly around thead, tbody, and tfoot).
  • Improved HTML output format for empty elements and auto-detected self closing tags
  • Changed DT & DD tags to block-mode tags, to follow practice over spec
  • Added support for tag names with - and _ (<abc_foo>, <abc-foo>)
  • Handle tags with internal trailing space (<foo >)
  • Fixed support for character class regular expressions in the [attr=~regex] selector

All told, this is a big release for jsoup. Many thanks to everyone who has contributed by sending in your suggestions, questions, and bugs; by writing about jsoup on community sites and blogs; and simply by using it.

If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.