jsoup Java HTML Parser release 1.20.1

2025-Apr-29

jsoup 1.20.1 is out now, featuring improved HTML parse rules to align with modern browsers, improved XML namespace handling, and a redesigned HTML pretty-printer for better consistency and customizability. This release also delivers performance optimizations, new API enhancements such as flexible tag definitions via TagSet, concise CSS selectors, and parser thread-safety improvements. Additionally, multiple bug fixes enhance XML serialization and W3C DOM interoperability.

jsoup is a Java library for working with real-world HTML and XML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

Changes

Functional Improvements

Structure and Performance Improvements

  • Refactored the CSS QueryParser into a clearer recursive descent parser. #2310.
  • CSS selectors with consecutive combinators (e.g. div >> p) will throw an explicit parse exception. #2311.
  • Performance: reduced the shallow size of an Element from 40 to 32 bytes, and the NodeList from 32 to 24. #2307.
  • Performance: reduced GC load of new StringBuilders when tokenizing input HTML. #2304.
  • Made Parser instances threadsafe, so that inadvertent use of the same instance across threads will not lead to errors. For actual concurrency, use Parser#newInstance() per thread. #2314.

Bug Fixes

  • Element names containing characters invalid in XML are now normalized to valid XML names when serializing. #1496.
  • When serializing to XML, characters that are invalid in XML 1.0 should be removed (not encoded). #1743.
  • When converting a Document to the W3C DOM in W3CDom, elements with an attribute in an undeclared namespace now get a declaration of xmlns:prefix="undefined". This allows subsequent serialization to XML via W3CDom.asString() to succeed. #2087.
  • The StreamParser could emit the final elements of a document twice, due to how onNodeCompleted was fired when closing out the stack. #2295.
  • When parsing with the XML parser and error tracking enabled, the trailing ? in <?xml version="1.0"?> would incorrectly emit an error. #2298.
  • Calling Element#cssSelector() on an element with combining characters in the class or ID now produces the correct output. #1984.

My sincere thanks to everyone who contributed to this release! If you have any suggestions for the next release, I would love to hear them; please get in touch via jsoup discussions, or with me directly.

You can also follow me (@jhy@tilde.zone) on Mastodon / Fediverse to receive occasional notes about jsoup releases.

Download jsoup now.