HTML5 and international support: jsoup version 1.2.3
2010-Aug-04 I am delighted to announce that jsoup version 1.2.3 is now available for download.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
While jsoup has always included implicit support for HTML5 tags, this release introduces explicit tag definitions. This ensures that when out-of-spec HTML5 is found (e.g. badly nested, or incorrectly parented), jsoup will create an in-spec parse tree.
HTML5 Datasets are now supported with the
Element.dataset() method that provides a convenient map view of an element's dataset.
Improved international support
When parsing HTML from a file or a URL, jsoup will now automatically detect the document's character set, and decode the input appropriately before parsing.
You can also also define the document's output character set with the
Document.outputSettings().charset(String) method. This controls which characters will be HTML escaped on output, and which will be kept as-is. The output charset defaults to the input charset.
Other improvements and bug fixes
I've added two new selectors:
namespace|elementfinds elements by tagname in a namespace
[^attributePrefix]finds elements that have an attribute name starting with a prefix
- Added support for namespaced elements (
<fb:name>) and selectors to find them (
- Implemented the
- Improved implicit table element handling (particularly around
- Improved HTML output format for empty elements and auto-detected self closing tags
- Changed DT & DD tags to block-mode tags, to follow practice over spec
- Added support for tag names with
- Handle tags with internal trailing space (
- Fixed support for character class regular expressions in the
All told, this is a big release for jsoup. Many thanks to everyone who has contributed by sending in your suggestions, questions, and bugs; by writing about jsoup on community sites and blogs; and simply by using it.