jsoup Java HTML Parser release 1.17.2
2023-Dec-29
jsoup 1.17.2 is out now, with improvements around attribute source position tracking, and a range of bug fixes.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Deprecation note: over the last few releases, a number of mostly internal methods have been deprecated. Please review your use of any of these methods and migrate away from them now if applicable. These will be removed in a following release.
Download jsoup now.
Improvements
- Attribute object accessors: Added
Element.attribute(String)
andAttributes.attribute(String)
to more simply obtain anAttribute
object. 2069 - Attribute source tracking: If source tracking is on, and an Attribute's key is changed ( via
Attribute.setKey(String)
), the source range is now still tracked inAttribute.sourceRange()
. 2070 - Wildcard attribute selector: Added support for the
[*]
element with any attribute selector. And also restored support for selecting by an empty attribute name prefix ([^]
). 2079
Bug Fixes
- Mixed-cased source position: When tracking the source position of attributes, if the source attribute name was mix-cased but the parser was lower-case normalizing attribute names, the source position for that attribute was not tracked correctly. 2067
- Source position NPE: When tracking the source position of a body fragment parse, a null pointer exception was thrown. 2068
- Multi-point emoji entity: A multi-point encoded emoji entity may be incorrectly decoded to the replacement character. 2074
- Selector sub-expressions: (Regression) in a selector like
parent [attr=va], other
, the, OR
was binding to[attr=va]
instead ofparent [attr=va]
, causing incorrect selections. The fix includes a EvaluatorDebug class that generates a sexpr to represent the query, allowing simpler and more thorough query parse tests. 2073 - XML CData output: When generating XML-syntax output from parsed HTML, script nodes containing (pseudo) CData sections would have an extraneous CData section added, causing script execution errors. Now, the data content is emitted in a HTML/XML/XHTML polyglot format, if the data is not already within a CData section. 2078
- Thread safety: The
:has
evaluator held a non-thread-safe Iterator, and so if an Evaluator object was shared across multiple concurrent threads, a NoSuchElement exception may be thrown, and the selected results may be incorrect. Now, the iterator object is a thread-local. 2088