jsoup Java HTML Parser release 1.19.1
2025-Mar-04
jsoup 1.19.1 is out now, with support for http/2 network requests, performance improvements, some new API methods, and a host of other improvements and bug fixes.
jsoup is a Java library for working with real-world HTML and XML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Changes
- Added support for http/2 requests in
Jsoup.connect(
, when running on Java 11+, via the Java HttpClient implementation. #2257.) - In this version of jsoup, the default is to make requests via the HttpUrlConnection implementation: use
System.setProperty(
to enable making requests via the HttpClient (if available), which will enable"jsoup.useHttpClient", "true"); http/2
support. This will become the default in a later version of jsoup, so now is a good time to validate it. - If you are repackaging the jsoup jar in your deployment (i.e. creating a shaded- or a fat-jar), make sure to specify that as a Multi-Release JAR.
- If the
HttpClient
impl is not available in your JRE, requests will continue to be made viaHttpURLConnection
(inhttp/1.1
mode).
- In this version of jsoup, the default is to make requests via the HttpUrlConnection implementation: use
- Updated the minimum Android API Level validation from 10 to 21. As with previous jsoup versions, Android developers need to enable core library desugaring. The minimum Java version remains Java 8. #2173
- Removed previously deprecated class:
org.jsoup.UncheckedIOException
(replace withjava.io.UncheckedIOException
); moved previously deprecated methodElement Element#forEach(
toConsumer) void Element#forEach(
. #2246Consumer()) - Deprecated the methods
Document#updateMetaCharsetElement(
andboolean) Document#updateMetaCharsetElement(
, as the setting had no effect. When) Document#charset(
is called, the document's meta charset or XML encoding instruction is always set. #2247Charset)
Improvements
- When cleaning HTML with a
Safelist
that preserves relative links, theisValid(
method will now consider these links valid. Additionally, the enforced attribute) rel=nofollow
will only be added to external links when configured in the safelist. #2245 - Added
Element#selectStream(
andString query) Element#selectStream(
methods, that return aEvaluator) Stream
of matching elements. Elements are evaluated and returned as they are found, and the stream can be terminated early. #2092 Element
objects now implementIterable
, enabling them to be used in enhanced for loops.- Added support for fragment parsing from a
Reader
viaParser#parseFragmentInput(
. #1177Reader, Element, String) - Reintroduced CLI executable examples, in
jsoup-examples.jar
. #1702 - Optimized performance of selectors like
#id .class
(and other similar descendant queries) by around 4.6x, by better balancing the Ancestor evaluator's cost function in the query planner. #2254 - Removed the legacy parsing rules for
<isindex>
tags, which would autovivify aform
element with labels. This is no longer in the spec. - Added
Elements.selectFirst(
andString cssQuery) Elements.expectFirst(
, to select the first matching element from anString cssQuery) Elements
list. #2263 - When parsing with the XML parser, XML Declarations and Processing Instructions are directly handled, vs bouncing through the HTML parser's bogus comment handler. Serialization for non-doctype declarations no longer end with a spurious
!
. #2275 - When converting parsed HTML to XML or the W3C DOM, element names containing
<
are normalized to_
to ensure valid XML. For example,<foo<bar>
becomes<foo_bar>
, as XML does not allow<
in element names, but HTML5 does. #2276 - Reimplemented the HTML5 Adoption Agency Algorithm to the current spec. This handles mis-nested formating / structural elements. #2278
Bug Fixes
- If an element has an
;
in an attribute name, it could not be converted to a W3C DOM element, and so subsequent XPath queries could miss that element. Now, the attribute name is more completely normalized. #2244 - For backwards compatibility, reverted the internal attribute key for doctype names to "name". #2241
- In
Connection
, skip cookies that have no name, rather than throwing a validation exception. #2242 - When running on JDK 1.8, the error
java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip(
could be thrown when calling)Ljava/nio/ByteBuffer; Response#body(
after parsing from a URL and the buffer size was exceeded. #2250) - For backwards compatibility, allow
null
InputStream inputs toJsoup.parse(
, by returning an emptyInputStream stream, ...) Document
. #2252 - A
template
tag containing anli
within an openli
would be parsed incorrectly, as it was not recognized as a "special" tag (which have additional processing rules). Also, added the SVG and MathML namespace tags to the list of special tags. #2258 - A
template
tag containing abutton
within an openbutton
would be parsed incorrectly, as the "in button scope" check was not aware of thetemplate
element. Corrected other instances including MathML and SVG elements, also. #2271 - An
:nth-child
selector with a negative digit-less step, such as:nth-child(
, would be parsed incorrectly as a positive step, and so would not match as expected. #1147-n+2) - Calling
doc.charset(
on an empty XML document would throw ancharset) IndexOutOfBoundsException
. #2266 - Fixed a memory leak when reusing a nested
StructuralEvaluator
(e.g., a selector ancestor chain likeA B C
) by ensuring cache reset calls cascade to inner members. #2277 - Concurrent calls to
doc.clone(
were not supported. When a document was cloned, its).append(html) Parser
was not cloned but was a shallow copy of the original parser. #2281
My sincere thanks to everyone who contributed to this release! If you have any suggestions for the next release, I would love to hear them; please get in touch via jsoup discussions, or with me directly.
You can also follow me (@jhy@tilde.zone) on Mastodon / Fediverse to receive occasional notes about jsoup releases.
Download jsoup now.