jsoup Java HTML Parser release 1.14.3

2021-Sep-30

jsoup 1.14.3 is out now, adding native XPath selector support, and also includes a number of bug fixes, improvements, and performance enhancements.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

Download jsoup now.

Improvements

  • Added native XPath support with Element.selectXpath(String) #1629
  • Added full support for the <template> tag, up to the HTML5 parser spec. #1634
  • Added support in CharacterReader to track newlines, so that parse errors can be reported more intuitively. #1624
  • Tracked parse errors now have more details, including the erroneous token, to help clarify the errors.
  • Speed and memory optimizations for the :has(subquery) selector.
  • The :contains(text) and :containsOwn(text) selectors are now whitespace normalized, aligning to the document text that they are matching against. #876
  • In Element, speed optimized adopting all of an element's child nodes into a currently empty element. Improves the HTML adoption agency algorithm when adopting elements with many children. #1638
  • Increased the parse speed when in RCData (e.g. <title>) and unescaped <tag> tokens are found, by memoizing the </title> scan and reducing GC. #1644
  • When parsing custom tags (in HTML or XML), added a flyweight cache on Tag.valueOf(String) to reduce memory overhead when many tags are repeated. Also tuned other areas of the parser when many very deeply stacked custom elements were present. #1646

Bug Fixes

  • The OSGi bundle meta-data incorrectly set a version on the import of javax.annotation (used as a build-time dependency for nullability assertions). #1616
  • When tracking errors or checking for validity in the Cleaner, errors were incorrectly raised for missing optional closing tags.
  • The Attributes.equals() method was sensitive to the order of its contents, but it should not be. #1492
  • When the HTML parser was configured to preserve case, Element text methods would miss adding whitespace for BR tags.
  • Attribute names are now normalized & validated correctly for the specific output syntax (HTML or XML). Previously, syntactically invalid attribute names could be output by the html() methods. Such attributes are still available in the DOM, and will be normalized if possible on output. #1474
  • Bugfix [Fuzz]: fixed an IOOB when an empty select tag was followed by a body tag that needed reparenting. #1639

Build Improvements

My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.

You can also follow me (@jhy) on Twitter to receive occasional notes about jsoup releases.