jsoup Java HTML Parser release 1.14.2

2021-Aug-15

Caught by the fuzz! jsoup 1.14.2 is out now, and includes a set of parser bug fixes and improvements for handling rough HTML and XML, as identified by the Jazzer JVM fuzzer. This release also includes other fixes and improvements.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

Guided fuzzing is a testing method that, starting from a defined corpus, generates millions of different input files, using the instrumented codebase to steer the test harness. It attempts to adversarially create input content that leads to slow performance or unexpected exceptions. This approach finds areas in the parser that can be improved, leading to a faster and more robust implementation of jsoup.

This testing has identified particular content that could result in longer than usual parse times, or could result in unexpected exceptions including Stack Overflow, Null Pointer, and Index out of Bounds exceptions. Depending on how the parser was used, that could potentially contribute to denial of service attacks. Versions of jsoup before 1.14.2 are susceptible. We recommend that all users upgrade to this new version.

Download jsoup now.

Improvements

  • Improvement: support Pattern.quote \Q and \E escapes in the selector regex matchers. #1536
  • Improvement: Element.absUrl() now supports tel: URLs, and other URLs that are already absolute but that Java does not have input stream handlers for. #1610

Bug Fixes

  • Bugfix: when serializing output, escape characters that are in the 0 < 0x20 range. This improves XML output compatibility, and makes HTML output with these characters easier to read (as they're otherwise invisible). #1556
  • Bugfix: the *|el wildcard namespace selector now also matches elements with no namespace. #1565
  • Bugfix: corrected a potential case of the parser input stream not being closed immediately on a read exception.
  • Bugfix: when making a HTTP POST, if the request write fails, make sure the connection is immediately cleaned up.
  • Bugfix: in the XML parser, XML processing instructions without attributes would be serialized as if they did. #770
  • Bugfix: updated the HtmlTreeParser resetInsertionMode to the current spec for supported elements. #1491
  • Bugfix: fixed an NPE when parsing fragment HTML into a standalone table element. #1603
  • Bugfix: fixed an NPE when parsing fragment heading HTML into a standalone p element. #1601
  • Bugfix: fixed an IOOB when parsing a formatting fragment into a standalone p element. #1602
  • Bugfix: tag names must start with an ascii-alpha character. #1006

Fuzz Fixes

  • Bugfix [Fuzz]: fixed a slow parse when a tag or an attribute name has thousands of null characters in it. #1580
  • Bugfix [Fuzz]: the adoption agency algorithm can have an incorrect bookmark position #1576
  • Bugfix [Fuzz]: malformed HTML could result in null elements on stack #1579
  • Bugfix [Fuzz]: malformed deeply nested table elements could create a stack overflow. #1577
  • Bugfix [Fuzz]: Speed optimized malformed HTML creating elements with thousands of elements - limit the attribute count per element when parsing to 512 (in real-world HTML, P99 is ~ 8). #1578
  • Bugfix [Fuzz]: Speed improvement for the foster formatting elements algo, by limiting how far up a crafted stack to scan. #1593
  • Bugfix [Fuzz]: Speed improvement when parsing crafted HTML when transferring form attributes. #1595
  • Bugfix [Fuzz]: Speed improvement when the stack was thousands of items deep, and non-matching close tags sent. #1596
  • Bugfix [Fuzz]: Speed improvement when an attribute name is 600K of quote characters or otherwise needs accumulation vs being able to read in one hit. #1605
  • Bugfix [Fuzz]: Speed improvement when closing missing empty tags (in XML comment processed as HTML) when thousands deep in stack. #1606
  • Bugfix [Fuzz]: Fix a potential stack-overflow in the parser given crafted HTML, when the parser looped in the InSelectInTable state.
  • Bugfix [Fuzz]: Fix an IOOB when the HTML root was cleared from the stack and then attributes were merged onto it. #1611
  • Bugfix [Fuzz]: Improved the speed of parsing when crafted HTML contains hundreds of active formatting elements that were copied for all new elements (similar to an amplification attack). The number of considered active formatting elements that will be cloned when mis-nested is now capped to 12. #1613

My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.

You can also follow me (@jhy) on Twitter to receive occasional notes about jsoup releases.