jsoup Java HTML Parser release 1.20.1
2025-Apr-29
jsoup 1.20.1 is out now, featuring improved HTML parse rules to align with modern browsers, improved XML namespace handling, and a redesigned HTML pretty-printer for better consistency and customizability. This release also delivers performance optimizations, new API enhancements such as flexible tag definitions via TagSet
, concise CSS selectors, and parser thread-safety improvements. Additionally, multiple bug fixes enhance XML serialization and W3C DOM interoperability.
jsoup is a Java library for working with real-world HTML and XML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Changes
- To better follow the HTML5 spec and current browsers, the HTML parser no longer allows self-closing tags (
<foo />
) to close HTML elements by default. Foreign content (SVG, MathML), and content parsed with the XML parser, still supports self-closing tags. If you need specific HTML tags to support self-closing, you can register a custom tag via theTagSet
configured inParser.tagSet(
, using) Tag#set(
. Standard void tags (such asTag.SelfClose) <img>
,<br>
, etc.) continue to behave as usual and are not affected by this change. #2300. - The following internal components have been deprecated. If you do happen to be using any of these, please take the opportunity now to migrate away from them, as they will be removed in jsoup 1.21.1.
ChangeNotifyingArrayList
,Document.updateMetaCharsetElement(
,) Document.updateMetaCharsetElement(
,boolean) HtmlTreeBuilder.isContentForTagData(
,String) Parser.isContentForTagData(
,String) Parser.setTreeBuilder(
,TreeBuilder) Tag.formatAsBlock(
,) Tag.isFormListed(
,) TokenQueue.addFirst(
,String) TokenQueue.chompTo(
,String) TokenQueue.chompToIgnoreCase(
,String) TokenQueue.consumeToIgnoreCase(
,String) TokenQueue.consumeWord(
,) TokenQueue.matchesAny(
String...)
Functional Improvements
- Rebuilt the HTML pretty-printer, to simplify and consolidate the implementation, improve consistency, support custom Tags, and provide a cleaner path for ongoing improvements. The specific HTML produced by the pretty-printer may be different from previous versions. #2286.
- Added the ability to define custom tags, and to modify properties of known tags, via the
TagSet
tag collection. Their properties can impact both the parse and how content is serialized (output as HTML or XML). #2285. Element.cssSelector(
will prefer to return shorter selectors by using ancestor IDs when available and unique. E.g.) #id > div > p
instead ofhtml > body > div > div > p
#2283.- Added
Elements.deselect(
,int index) Elements.deselect(
, andObject o) Elements.deselectAll(
methods to remove elements from the) Elements
list without removing them from the underlying DOM. Also addedElements.asList(
method to get a modifiable list of elements without affecting the DOM. (Individual Elements remain linked to the DOM.) #2100.) - Added support for sending a request body from an InputStream with
Connection.requestBodyStream(
. #1122.InputStream stream) - The XML parser now supports scoped xmlns: prefix namespace declarations, and applies the correct namespace to Tags and Attributes. Also, added
Tag#prefix(
,) Tag#localName(
,) Attribute#prefix(
,) Attribute#localName(
, and) Attribute#namespace(
to retrieve these. #2299.) - CSS identifiers are now escaped and unescaped correctly to the CSS spec.
Element#cssSelector(
will emit appropriately escaped selectors, and the QueryParser supports those. Added) Selector.escapeCssIdentifier(
and ` Selector.unescapeCssIdentifier(). #2297, #2305)
Structure and Performance Improvements
- Refactored the CSS
QueryParser
into a clearer recursive descent parser. #2310. - CSS selectors with consecutive combinators (e.g.
div >> p
) will throw an explicit parse exception. #2311. - Performance: reduced the shallow size of an Element from 40 to 32 bytes, and the NodeList from 32 to 24. #2307.
- Performance: reduced GC load of new StringBuilders when tokenizing input HTML. #2304.
- Made
Parser
instances threadsafe, so that inadvertent use of the same instance across threads will not lead to errors. For actual concurrency, useParser#newInstance(
per thread. #2314.)
Bug Fixes
- Element names containing characters invalid in XML are now normalized to valid XML names when serializing. #1496.
- When serializing to XML, characters that are invalid in XML 1.0 should be removed (not encoded). #1743.
- When converting a
Document
to the W3C DOM inW3CDom
, elements with an attribute in an undeclared namespace now get a declaration ofxmlns:prefix="undefined"
. This allows subsequent serialization to XML viaW3CDom.asString(
to succeed. #2087.) - The
StreamParser
could emit the final elements of a document twice, due to howonNodeCompleted
was fired when closing out the stack. #2295. - When parsing with the XML parser and error tracking enabled, the trailing
?
in<?xml version="1.0"?>
would incorrectly emit an error. #2298. - Calling
Element#cssSelector(
on an element with combining characters in the class or ID now produces the correct output. #1984.)
My sincere thanks to everyone who contributed to this release! If you have any suggestions for the next release, I would love to hear them; please get in touch via jsoup discussions, or with me directly.
You can also follow me (@jhy@tilde.zone) on Mastodon / Fediverse to receive occasional notes about jsoup releases.
Download jsoup now.