jsoup Java HTML Parser release 1.18.2
2024-Nov-27
jsoup 1.18.2 is out now, with significant performance gains when parsing HTML inputs, plus a range of other improvements and fixes.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Download jsoup now.
Improvements
- Optimized the throughput and memory use throughout the input read and parse flows, with heap allocations and GC down between -6% and -89%, and throughput improved up to +143% for small inputs. Most inputs sizes will see throughput increases of ~ 20%. These performance improvements come through recycling the backing
byte[]
andchar[]
arrays used to read and parse the input. 2186 - Speed optimized
html()
andEntities.escape()
when the input contains UTF characters in a supplementary plane, by around 49%. 2183 - The form associated elements returned by
FormElement.elements()
now reflect changes made to the DOM, subsequently to the original parse. 2140 - In the
TreeBuilder
, theonNodeInserted()
andonNodeClosed()
events are now also fired for the outermost / rootDocument
node. This enables source position tracking on the Document node (which was previously unset). And it also enables the node traversor to see the outer Document node. 2182 - Selected Elements can now be position swapped inline using
Elements#set()
. 2212
Bug Fixes
Element.cssSelector()
would fail if the element's class contained a*
character. 2169- When tracking source ranges, a text node following an invalid self-closing element may be left untracked. 2175
- When a document has no doctype, or a doctype not named
html
, it should be parsed in Quirks Mode. 2197 - With a selector like
div:has(span + a)
, thehas()
component was not working correctly, as the inner combining query caused the evaluator to match those against the outer's siblings, not children. 2187 - A selector query that included multiple
:has()
components in a nested:has()
might incorrectly execute. 2131 - When cookie names in a response are duplicated, the simple view of cookies available via
Connection.Response#cookies()
will provide the last one set. Generally it is better to use the Jsoup.newSession method to maintain a cookie jar, as that applies appropriate path selection on cookies when making requests. 1831 - When parsing named HTML entities, base entities should resolve if they are a prefix of the input token (and not in an attribute). 2207
- Fixed incorrect tracking of source ranges for attributes merged from late-occurring elements that were implicitly created (
html
orbody
). 2204 - Follow the current HTML specification in the tokenizer to allow
<
as part of a tag name, instead of emitting it as a character node. 2230 - Similarly, allow a
<
as the start of an attribute name, vs creating a new element. The previous behavior was intended to parse closer to what we anticipated the author's intent to be, but that does not align to the spec or to how browsers behave. 1483
My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch via jsoup discussions, or with me directly.
You can also follow me (@jhy@tilde.zone) on Mastodon / Fediverse to receive occasional notes about jsoup releases.