jsoup Java HTML Parser release 1.14.3
2021-Sep-30
jsoup 1.14.3 is out now, adding native XPath selector support, and also includes a number of bug fixes, improvements, and performance enhancements.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Download jsoup now.
Improvements
- Added native XPath support with
Element.selectXpath(String)
#1629
- Added full support for the
<template>
tag, up to the HTML5 parser spec.#1634
- Added support in
CharacterReader
to track newlines, so that parse errors can be reported more intuitively.#1624
- Tracked parse errors now have more details, including the erroneous token, to help clarify the errors.
- Speed and memory optimizations for the
:has(subquery)
selector. - The
:contains(text)
and:containsOwn(text)
selectors are now whitespace normalized, aligning to the document text that they are matching against.#876
- In
Element
, speed optimized adopting all of an element's child nodes into a currently empty element. Improves the HTML adoption agency algorithm when adopting elements with many children.#1638
- Increased the parse speed when in
RCData
(e.g.<title>
) and unescaped<tag>
tokens are found, by memoizing the</title>
scan and reducing GC.#1644
- When parsing custom tags (in HTML or XML), added a flyweight cache on
Tag.valueOf(String)
to reduce memory overhead when many tags are repeated. Also tuned other areas of the parser when many very deeply stacked custom elements were present.#1646
Bug Fixes
- The OSGi bundle meta-data incorrectly set a version on the import of javax.annotation (used as a build-time dependency for nullability assertions).
#1616
- When tracking errors or checking for validity in the
Cleaner
, errors were incorrectly raised for missing optional closing tags. - The
Attributes.equals()
method was sensitive to the order of its contents, but it should not be.#1492
- When the HTML parser was configured to preserve case,
Element
text methods would miss adding whitespace forBR
tags. - Attribute names are now normalized & validated correctly for the specific output syntax (HTML or XML). Previously, syntactically invalid attribute names could be output by the
html()
methods. Such attributes are still available in the DOM, and will be normalized if possible on output.#1474
- Bugfix [Fuzz]: fixed an IOOB when an empty select tag was followed by a body tag that needed reparenting.
#1639
Build Improvements
- Fixed nullability annotations for
Node.equals(Object)
and other equals methods.#1628
- Added JDK 17 to the CI builds.
#1641
My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.
You can also follow me (@jhy) on Twitter to receive occasional notes about jsoup releases.