jsoup Java HTML Parser release 1.21.1
2025-Jun-23
jsoup 1.21.1 is out now, featuring powerful new node selection capabilities that let you target specific DOM nodes like comments and text nodes using CSS selectors, dynamic tag customization through the new TagSet callback system, and improved defense against mutation XSS attacks with simplified attribute escaping. This release also brings HTTP/2 support by default, numerous API improvements for better developer experience, and fixes for several edge-case parsing issues.
jsoup is a Java library for working with real-world HTML and XML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Changes
- Removed previously deprecated methods. #2317
- Deprecated the
:matchText
pseduo-selector due to its side effects on the DOM; use the new::textnode
selector and theElement#selectNodes(
method instead. #2343String css, Class<T> type) - Deprecated
Connection.Response#bufferUp(
in lieu of) Connection.Response#readFully(
which can throw a checked IOException.) - Deprecated internal methods
Validate#ensureNotNull(
(replaced by typedObject) Validate#expectNotNull(
); protected HTML appenders from Attribute and Node.T) - If you happen to be using any of the deprecated methods, please take the opportunity now to migrate away from them, as they will be removed in a future release.
Improvements
- Enhanced the
Selector
to support direct matching against nodes such as comments and text nodes. For example, you can now find an element that follows a specific comment:::comment:contains(
will selectprices) + p p
elements immediately after a<!-- prices: -->
comment. Supported types include::node
,::leafnode
,::comment
,::text
,::data
, and::cdata
. Node contextual selectors like::node:contains(
,text) :matches(
, andregex) :blank
are also supported. IntroducedElement#selectNodes(
andString css) Element#selectNodes(
for direct node selection. #2324String css, Class<T> nodeType) - Added
TagSet#onNewTag(
: register a callback that’s invoked for each new or cloned Tag when it’s inserted into the set. Enables dynamic tweaks of tag options (for example, marking all custom tags as self-closing, or everything in a given namespace as preserving whitespace). #2330Consumer<Tag> customizer) - Made
TokenQueue
andCharacterReader
autocloseable, to ensure that they will release their buffers back to the buffer pool, for later reuse. - Added
Selector#evaluatorOf(
, as a clearer way to obtain an Evaluator from a CSS query. An alias ofString css) QueryParser.parse(
.String css) - Custom tags (defined via the
TagSet
) in a foreign namespace (e.g. SVG) can be configured to parse as data tags. - Added
NodeVisitor#traverse(
to simplify node traversal calls (vs. importingNode) NodeTraversor
). - Updated the default user-agent string to improve compatibility. #2341
- The HTML parser now allows the specific text-data type (Data, RcData) to be customized for known tags. (Previously, that was only supported on custom tags.) #2326
- Added
Connection.Response#readFully(
as a replacement for) Connection.Response#bufferUp(
with an explicit IOException. Similarly, added) Connection.Response#readBody(
over) Connection.Response#body(
. Deprecated) Connection.Response#bufferUp(
. #2327) - When serializing HTML, the
<
and>
characters are now escaped in attributes. This helps prevent a class of mutation XSS attacks. #2337 - Changed
Connection
to prefer using the JDK's HttpClient over HttpUrlConnection, if available, to enable HTTP/2 support by default. Users can disable via-Djsoup.useHttpClient=false
. #2340
Bug Fixes
- The contents of a
script
in asvg
foreign context should be parsed as script data, not text. #2320 Tag#isFormSubmittable(
was updating the Tag's options. #2323) - The HTML pretty-printer would incorrectly trim whitespace when text followed an inline element in a block element. #2325
- Custom tags with hyphens or other non-letter characters in their names now work correctly as Data or RcData tags. Their closing tags are now tokenized properly. #2332
- When cloning an Element, the clone would retain the source's cached child Element list (if any), which could lead to incorrect results when modifying the clone's child elements. #2334
My sincere thanks to everyone who contributed to this release! If you have any suggestions for the next release, I would love to hear them; please get in touch via jsoup discussions, or with me directly.
You can also follow me (@jhy@tilde.zone) on Mastodon / Fediverse to receive occasional notes about jsoup releases.
Download jsoup now.