jsoup Java HTML Parser release 1.12.1

2019-May-12

jsoup 1.12.1 is out now, with a great set of usability improvements, speed and memory efficiency improvements, and bug fixes.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

Download jsoup now.

Changes

  • Change: removed deprecated method to disable TLS cert checking in Connection.validateTLSCertificates().
  • Change: some internal methods have been rearranged; if you extended any of the Jsoup internals you may need to make updates.
  • Updated jetty-server (which is used for integration tests) to latest 9.2 series (9.2.28).

Improvements

  • Improvement: documents now remember their parser, so when later manipulating them, the correct HTML or XML tree builder is reused, as are the parser settings like case preservation.
  • Improvement: Jsoup now detects the character set of the input if specified in an XML Declaration, when using the HTML parser. Previously that only happened when the XML parser was specified.
  • Improvement: if the document's input character set does not support encoding, flip it to one that does.
  • Improvement: if a start tag is missing a > and a new tag is seen with a <, treat that as a new tag. (This differs from the HTML5 spec, which would make at attribute with a name beginning with <, but in practice this impacts too many pages.
  • Improvement: performance tweaks when parsing start tags, data, tables.
  • Improvement: added Element.nextElementSiblings() and Element.previousElementSiblings()
  • Improvement: treat center tags as block tags.
  • Improvement: allow forms to be submitted with Content-Type=multipart/form-data without requiring a file upload; automatically set the mime boundary.
  • Improvement: Jsoup will now detect if an input file or URL is binary, and will refuse to attempt to parse it, with an IO Exception. This prevents runaway processing time and wasted effort creating meaningless parsed DOM trees.

Bug Fixes

  • Bugfix: when using the tag case preserving parsing settings, certain HTML tree building rules where not followed for upper case tags.
  • Bugfix: when converting a Jsoup document to a W3C DOM, if an element is namespaced but not in a defined namespace, set it to the global namespace.
  • Bugfix: attributes created with the Attribute constructor with just spaces for names would incorrectly pass validation.
  • Bugfix: some pseudo XML Declarations were incorrectly handled when using the XML Parser, leading to an IOOB exception when parsing.
  • Bugfix: when parsing URL parameter names in an attribute that is not correctly HTML encoded, and near the end of the current buffer, those parameters may be incorrectly dropped. (Improved CharacterReader mark/reset support.)
  • Bugfix: boolean attribute values would be returned as null, vs an empty string, when accessed via the Attribute#getValue() method.
  • Bugix: orphan Attribute objects (i.e. created outside of a parse or an Element) would throw an NPE on Attribute#setValue(val)
  • Bugfix: Element.shallowClone() was not making a clone of its attributes.
  • Bugfix: fixed an ArrayIndexOutOfBoundsException in HttpConnection.looksLikeUtf8() when testing small strings in specific character ranges.

Many thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.