jsoup Java HTML Parser release 1.15.1
2022-May-15
jsoup 1.15.1 is out now, and includes a number of bug fixes, improvements, and performance enhancements.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Download jsoup now.
Changes
- Change: removed previously deprecated methods and classes (including org.jsoup.safety.Whitelist; use
org.jsoup.safety.Safelistinstead).
Improvements
- Improvement: when converting jsoup Documents to W3C Documents in
W3CDom, preserve HTML valid attribute names if the input document is using the HTML syntax. (Previously, would always coerce using the more restrictive XML syntax.)#1648 - Improvement: added the
:containsWholeText(selector, to match against non-normalizedtext) Elementtext. That can be useful when elements can only be distinguished by e.g. specific case, or leading whitespace, etc.#1636 - Improvement: added
Element#wholeOwnText(to retrieve the original (non-normalized) ownText of an) Element. Also added the:containsWholeOwnText(selector, to match against that.text) BRelements are now treated as newlines in the wholeText methods.#1636 - Improvement: added the
:matchesWholeText(andregex) :matchesWholeOwnText(selectors, to match against whole (non-normalized, case sensitive) element text and own text, respectively.regex) #1636 - Improvement: when evaluating an XPath query against a context element, the complete document is now visible to the query, vs only the context element's sub-tree. This enables support for queries outside (parent or sibling) the element, e.g.
ancestor-or-self::*.#1652 - Improvement: allow a
maxPaddingWidthon the indent level inOutputSettingswhen pretty printing. This defaults to 30 to limit the indent level for very deeply nested elements, and may be disabled by setting to -1.#1655 - Improvement: when cloning a
Nodeor anElement, the clone gets a clonedOwnerDocumentcontaining only that clone, so as to preserve applicable settings, such as the Pretty Print settings.#763 - Improvement: added a convenience method
Jsoup.parse(.File) #1693 - Improvement: in the
NodeTraversor, added default implementations forNodeVisitor.tail(and) NodeFilter.tail(, so that code using only) head(methods can be written as lambdas.) - Improvement: in
NodeTraversor, added support for removing nodes viaNode.remove(during) NodeVisitor.head(.) #1699 - Improvement: added
Node.forEachNode(andConsumer<Node>) Element.forEach(methods, to efficiently traverse the DOM with a functional interface.Consumer<Element) #1700
Bug Fixes
- Bugfix: boolean attribute names should be case-insensitive, but were not when the parser was configured to preserve case.
#1656 - Bugfix: when reading from
SequenceInputStreamsacross the buffer, the input stream was closed too early, resulting in missed content.#1671 - Bugfix: a comment with all dashes (
<!----->) should not emit a parse error.#1667 - Bugfix: when throwing a
SelectorParseExceptionfor an invalid selector, don't try to String.format the input, as that could throw anIllegalFormatException.#1691 - Bugfix: when serializing HTML with Pretty Print enabled, extraneous whitespace may be added on closing tags, or extra newlines may be added at the end of script blocks.
#1688#1689 - Bugfix: when copy-creating a
Safelistfrom another, perform a deep-copy of the original's settings, so that changes to the original after creation do not affect the copy.#1763 - Bugfix [Fuzz]: speed improvement when parsing constructed HTML containing very deeply incorrectly stacked formatting elements with many attributes.
#1695 - Bugfix [Fuzz]: during parsing, a
StackOverflowExceptionwas possible given crafted HTML with hundreds of nested table elements followed by invalid formatting elements.#1697
My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.
You can also follow me (@jhy) on Twitter to receive occasional notes about jsoup releases.