jsoup Java HTML Parser release 1.15.1

2022-May-15

jsoup 1.15.1 is out now, and includes a number of bug fixes, improvements, and performance enhancements.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

Download jsoup now.

Changes

  • Change: removed previously deprecated methods and classes (including org.jsoup.safety.Whitelist; use org.jsoup.safety.Safelist instead).

Improvements

  • Improvement: when converting jsoup Documents to W3C Documents in W3CDom, preserve HTML valid attribute names if the input document is using the HTML syntax. (Previously, would always coerce using the more restrictive XML syntax.) #1648
  • Improvement: added the :containsWholeText(text) selector, to match against non-normalized Element text. That can be useful when elements can only be distinguished by e.g. specific case, or leading whitespace, etc. #1636
  • Improvement: added Element#wholeOwnText() to retrieve the original (non-normalized) ownText of an Element. Also added the :containsWholeOwnText(text) selector, to match against that. BR elements are now treated as newlines in the wholeText methods. #1636
  • Improvement: added the :matchesWholeText(regex) and :matchesWholeOwnText(regex) selectors, to match against whole (non-normalized, case sensitive) element text and own text, respectively. #1636
  • Improvement: when evaluating an XPath query against a context element, the complete document is now visible to the query, vs only the context element's sub-tree. This enables support for queries outside (parent or sibling) the element, e.g. ancestor-or-self::*. #1652
  • Improvement: allow a maxPaddingWidth on the indent level in OutputSettings when pretty printing. This defaults to 30 to limit the indent level for very deeply nested elements, and may be disabled by setting to -1. #1655
  • Improvement: when cloning a Node or an Element, the clone gets a cloned OwnerDocument containing only that clone, so as to preserve applicable settings, such as the Pretty Print settings. #763
  • Improvement: added a convenience method Jsoup.parse(File). #1693
  • Improvement: in the NodeTraversor, added default implementations for NodeVisitor.tail() and NodeFilter.tail(), so that code using only head() methods can be written as lambdas.
  • Improvement: in NodeTraversor, added support for removing nodes via Node.remove() during NodeVisitor.head(). #1699
  • Improvement: added Node.forEachNode(Consumer<Node>) and Element.forEach(Consumer<Element) methods, to efficiently traverse the DOM with a functional interface. #1700

Bug Fixes

  • Bugfix: boolean attribute names should be case-insensitive, but were not when the parser was configured to preserve case. #1656
  • Bugfix: when reading from SequenceInputStreams across the buffer, the input stream was closed too early, resulting in missed content. #1671
  • Bugfix: a comment with all dashes (<!----->) should not emit a parse error. #1667
  • Bugfix: when throwing a SelectorParseException for an invalid selector, don't try to String.format the input, as that could throw an IllegalFormatException. #1691
  • Bugfix: when serializing HTML with Pretty Print enabled, extraneous whitespace may be added on closing tags, or extra newlines may be added at the end of script blocks. #1688 #1689
  • Bugfix: when copy-creating a Safelist from another, perform a deep-copy of the original's settings, so that changes to the original after creation do not affect the copy. #1763
  • Bugfix [Fuzz]: speed improvement when parsing constructed HTML containing very deeply incorrectly stacked formatting elements with many attributes. #1695
  • Bugfix [Fuzz]: during parsing, a StackOverflowException was possible given crafted HTML with hundreds of nested table elements followed by invalid formatting elements. #1697

My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.

You can also follow me (@jhy) on Twitter to receive occasional notes about jsoup releases.