jsoup Java HTML Parser release 1.17.2

2023-Dec-29

jsoup 1.17.2 is out now, with improvements around attribute source position tracking, and a range of bug fixes.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

Deprecation note: over the last few releases, a number of mostly internal methods have been deprecated. Please review your use of any of these methods and migrate away from them now if applicable. These will be removed in a following release.

Download jsoup now.

Improvements

Bug Fixes

  • Mixed-cased source position: When tracking the source position of attributes, if the source attribute name was mix-cased but the parser was lower-case normalizing attribute names, the source position for that attribute was not tracked correctly. 2067
  • Source position NPE: When tracking the source position of a body fragment parse, a null pointer exception was thrown. 2068
  • Multi-point emoji entity: A multi-point encoded emoji entity may be incorrectly decoded to the replacement character. 2074
  • Selector sub-expressions: (Regression) in a selector like parent [attr=va], other, the , OR was binding to [attr=va] instead of parent [attr=va], causing incorrect selections. The fix includes a EvaluatorDebug class that generates a sexpr to represent the query, allowing simpler and more thorough query parse tests. 2073
  • XML CData output: When generating XML-syntax output from parsed HTML, script nodes containing (pseudo) CData sections would have an extraneous CData section added, causing script execution errors. Now, the data content is emitted in a HTML/XML/XHTML polyglot format, if the data is not already within a CData section. 2078
  • Thread safety: The :has evaluator held a non-thread-safe Iterator, and so if an Evaluator object was shared across multiple concurrent threads, a NoSuchElement exception may be thrown, and the selected results may be incorrect. Now, the iterator object is a thread-local. 2088