jsoup Java HTML Parser release 1.11.3

2018-Apr-15

jsoup 1.11.3 is out now, with a range of bug fixes and improvements for interoperability with hopeless HTML and substandard servers.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

Download jsoup now.

Improvements

  • CDATA sections are now treated as whitespace preserving (regardless of the containing element), and are round-tripped into output HTML.
  • Added support for Deflate encoding.
  • When parsing <pre> tags, skip the first newline if present.
  • Support nested quotes for attribute selection queries.
  • Character references from Windows-1252 that are not valid Unicode are mapped to the appropriate Unicode replacement.
  • Accept a custom SSL socket factory in Jsoup.Connection. Note that Connection.validateTLSCertificates() will be removed in the next release; Connection.sslSocketFactory(SSLSocketFactory sslSocketFactory) provides a path to implement a workaround if you need to keep using a similar approach.

Bug Fixes

  • Bugfix: A Mark has been invalidated exception was thrown when parsing some URLs on Android <= 6.
  • Bugfix: The Element.text() for <div>One</div>Two was OneTwo, not One Two.
  • Bugfix: boolean attributes with empty string values were not collapsing in HTML output.
  • Bugfix: when using the XML Parser set to lowercase normalize tags, uppercase closing tags were not correctly handled.
  • Bugfix: when parsing from a URL, an end tag could be read incorrectly if it started on a buffer boundary.
  • Bugfix: when parsing from a URL, if the remote server failed to complete its write (i.e. it writes less than the Content Length header promised on a gzipped stream), the parse method would incorrectly throw an unchecked exception. It now throws the declared IOException.
  • Bugfix: leaf nodes (such as text nodes) where throwing an unsupported operation exception on childNodes(), instead of just returning an empty list.
  • Bugfix: documents with a leading UTF-8 BOM did not have that BOM consumed, so it acted as a zero width no-break space, which could impact the parse tree.
  • Bugfix: when parsing an invalid XML declaration, the parse would fail.

Many thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.