jsoup Java HTML Parser release 1.11.3
2018-Apr-15
jsoup 1.11.3 is out now, with a range of bug fixes and improvements for interoperability with hopeless HTML and substandard servers.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
Download jsoup now.
Improvements
CDATA
sections are now treated as whitespace preserving (regardless of the containing element), and are round-tripped into output HTML.- Added support for
Deflate
encoding. - When parsing
<pre>
tags, skip the first newline if present. - Support nested quotes for attribute selection queries.
- Character references from Windows-1252 that are not valid Unicode are mapped to the appropriate Unicode replacement.
- Accept a custom SSL socket factory in
Jsoup.Connection
. Note thatConnection.validateTLSCertificates()
will be removed in the next release;Connection.sslSocketFactory(SSLSocketFactory sslSocketFactory)
provides a path to implement a workaround if you need to keep using a similar approach.
Bug Fixes
- Bugfix: A
Mark has been invalidated
exception was thrown when parsing some URLs on Android <= 6. - Bugfix: The
Element.text()
for<div>One</div>Two
wasOneTwo
, notOne Two
. - Bugfix: boolean attributes with empty string values were not collapsing in HTML output.
- Bugfix: when using the XML Parser set to lowercase normalize tags, uppercase closing tags were not correctly handled.
- Bugfix: when parsing from a URL, an end tag could be read incorrectly if it started on a buffer boundary.
- Bugfix: when parsing from a URL, if the remote server failed to complete its write (i.e. it writes less than the Content Length header promised on a gzipped stream), the parse method would incorrectly throw an unchecked exception. It now throws the declared
IOException
. - Bugfix: leaf nodes (such as text nodes) where throwing an unsupported operation exception on
childNodes()
, instead of just returning an empty list. - Bugfix: documents with a leading UTF-8 BOM did not have that BOM consumed, so it acted as a zero width no-break space, which could impact the parse tree.
- Bugfix: when parsing an invalid XML declaration, the parse would fail.
Many thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.