jsoup release 1.10.1

2016-Oct-23

jsoup 1.10.1 now includes support to optionally preserve the case of tags and attributes, a much requested feature. When using the HTML parser, the default continues to lower case both tags and attributes, and to preserve them when using the XML parser. These settings are controlled using the new ParseSettings class provided to the Parser. Selectors remain case-insensitive.

This release also includes improved HTML specification compliance, better handling of real-world HTML and XML, lower memory use on Android, and a range of bug-fixes.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

Download jsoup now.

Improvements

  • Improved support for extended HTML entities, including supplemental characters and multiple character references. Also reduced memory consumption of the entity tables.
  • Added support for *|E wildcard namespace selectors.
  • Added support for setting multiple connection headers in Jsoup.connect at once with Connection.headers(Map)
  • Added support for setting/overriding the response character set in Connection.Response, for cases where the charset is not defined by the server, or is defined incorrectly.
  • Improved the performance of class selectors by reducing memory allocation and garbage collection.
  • Improved performance of HTML output by reducing the creation of temporary attribute list iterators.

Fixes

  • Fixed an issue when converting to the W3CDom XML, where valid (but ugly) HTML attribute names containing characters like " could not be converted into valid XML attribute names. These attribute names are now normalized if possible, or not added to the XML DOM.
  • Fixed an OOB exception when loading an empty-body URL and parsing with the XML parser.
  • Fixed an issue where attribute names starting with a slash would be parsed incorrectly.
  • Don't reuse charset encoders from OutputSettings, to make threadsafe.
  • Fixed an issue in connections with a requestBody where a custom content-type header could be ignored.

Many thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.