jsoup Java HTML Parser release 1.16.2
2023-Oct-20
jsoup 1.16.2 is out now with faster CSS selector execution via a cost-based query planner, better support for math
and svg
elements, and a bunch of other improvements and bug fixes.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Download jsoup now.
Improvements
- Optimized the performance of complex CSS selectors, by adding a cost-based query planner. Evaluators are sorted by their relative execution cost, and executed in order of lower to higher cost. This speeds the matching process by ensuring that simpler evaluations (such as a tag name match) are conducted prior to more complex evaluations (such as an attribute regex, or a deep child scan with a :has).
- Added support for
<svg>
and<math>
tags (and their children). This includes tag namespaces and case preservation on applicable tags and attributes.#2008
- When converting jsoup Documents to W3C Documents in
W3CDom
, HTML documents will be placed in thehttp://www.w3.org/1999/xhtml
namespace by default, per the HTML5 spec. This can be controlled by settingW3CDom#namespaceAware(boolean false)
.#1848
- Speed optimized the Structural Evaluators by memoizing previous evaluations. Particularly the
~
(any preceding sibling) and:nth-of-type
selectors are improved.#1956
- Tweaked the performance of the
Element
nextElementSibling
,previousElementSibling
,firstElementSibling
,lastElementSibling
,firstElementChild
, and `lastElementChild. They now inplace filter/skip in the child-node list, vs having to allocate and scan a complete Element filtered list.
- Optimized internal methods that previously called
Element.children()
to use filter/skip child-node list accessors instead, reducing new Element List allocations.
- Tweaked the performance of parsing
:pseudo
selectors.
- When using the
:empty
pseudo-selector, blank textnodes are now considered empty. Previously, an element containing any whitespace was not considered empty.#1976
- In forms,
<input type="image">
should be excluded fromElement.formData()
(and hence from form submissions).#2010
Bug Fixes
- Bugfix:
form
elements and empty elements (such asimg
) did not have their attributes de-duplicated.#1950
- If
Document.OutputSettings
was cloned from a clone, an NPE would be thrown when used.#1964
- In
Jsoup.connect(String url)
, URL paths containing a %2B were incorrectly recoded to a '+', or a '+' was recoded to a ' '. Fixed by reverting to the previous behavior of not encoding supplied paths, other than normalizing to ASCII.#1952
- In
Jsoup.connect(String url)
, strings containing supplemental characters (e.g. emoji) were not URL escaped correctly.
- In
Jsoup.connect(String url)
, the ConstrainableInputStream would clear Thread interrupts when reading the body. This precluded callers from spawning a thread, running a number of requests for a length of time, then joining that thread after interrupting it.#1991
- When tracking HTML source positions, the closing tags for
H1
...H6
elements were not tracked correctly.#1987
- In
Jsoup.connect()
, aDELETE
method request did not support a request body.#1972
- When calling
Element.cssSelector()
on an extremely deeply nested element, aStackOverflowError
could occur. Further, aStackOverflowError
may occur when running the query.#2001
- Appending a node back to its original
Element
afterempty()
would throw an Index out of bounds exception. Also, now the child nodes that were removed have their parent node cleared, fully detaching them from the original parent.#2013
- In
Connection
when adding headers, the value may have been assumed to be an incorrectly decodedISO_8859_1
string, and re-encoded asUTF-8
. The value is now left as-is.
Changes
- Removed previously deprecated methods
Document.normalise()
,Element.forEach(org.jsoup.helper.Consumer<>)
,Node.forEach(org.jsoup.helper.Consumer<>)
, and theorg.jsoup.helper.Consumer
interface; the latter being a previously required compatibility shim prior to Android's de-sugaring support.
- The previous compatibility shim
org.jsoup.UncheckedIOException
is deprecated in favor of the now supportedjava.io.UncheckedIOException
. If you are catching the former, modify your code to catch the latter instead.#1989
- Blocked
noscript
tags from being added to Safelists, due to incompatibilities between parsers with and without script-mode enabled.
My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.
You can also follow me (@jhy@tilde.zone) on Mastodon / Fediverse to receive occasional notes about jsoup releases.