jsoup Java HTML Parser release 1.15.4

2023-Feb-18

jsoup 1.15.4 is out now, and includes a bunch of improvements, particularly when pretty-printing HTML, and bug fixes.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

Download jsoup now.

Improvements

  • Added the ability to escape CSS selectors (tags, IDs, classes) to match elements that don't follow regular CSS syntax. For example, to match by classname <p class="one.two">, use document.select("p.one\\.two"); #838
  • When pretty-printing, wrap text that follows a <br> tag. #1858
  • When pretty-printing, normalize newlines that follow self-closing tags in custom tags. #1852
  • When pretty-printing, collapse non-significant whitespace between a block and an inline tag. #1802
  • In Element.forEach() and Node.forEachNode(), use java.util.function.Consumer instead of the previous Android compatibility shim org.jsoup.helper.Consumer. Subsequently, the latter has been deprecated. #1870
  • Added a new method Document.forms(), to conveniently retrieve a List<FormElement> containing the <form> elements in a document.

Bug Fixes

  • URLs containing characters such as [ and ] were not escaped correctly, and would throw a MalformedURLException when fetched. #1873
  • Element.cssSelector() would create invalid selectors for elements where the tag name, ID, or classnames needed to be escaped (e.g. if a class name contained a : or .). #1742
  • If a Node or an Element was replaced with itself, that node would incorrectly be orphaned. #1843
  • Form data on a previous request was copied to a new request in newRequest(), resulting in an accumulation of form data when executing multi-step form submissions, or data sent to later requests incorrectly. Now, newRequest() only copies session related settings (cookies, proxy settings, user-agent, etc) but not the request data nor the body. #1778
  • Fixed an issue in Safelist.removeAttributes() which could throw a ConcurrentModificationException when using the :all pseudo-attribute.

Changes

  • Deprecated the unused Document.normalise() method. Normalization occurs during the HTML tree construction, and no longer as a distinct phase.


My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.

You can also follow me (@jhy@tilde.zone) on Mastodon / Fediverse to receive occasional notes about jsoup releases.