jsoup Java HTML Parser release 1.15.1
2022-May-15
jsoup 1.15.1 is out now, and includes a number of bug fixes, improvements, and performance enhancements.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Download jsoup now.
Changes
- Change: removed previously deprecated methods and classes (including org.jsoup.safety.Whitelist; use
org.jsoup.safety.Safelist
instead).
Improvements
- Improvement: when converting jsoup Documents to W3C Documents in
W3CDom
, preserve HTML valid attribute names if the input document is using the HTML syntax. (Previously, would always coerce using the more restrictive XML syntax.)#1648
- Improvement: added the
:containsWholeText(text)
selector, to match against non-normalizedElement
text. That can be useful when elements can only be distinguished by e.g. specific case, or leading whitespace, etc.#1636
- Improvement: added
Element#wholeOwnText()
to retrieve the original (non-normalized) ownText of anElement
. Also added the:containsWholeOwnText(text)
selector, to match against that.BR
elements are now treated as newlines in the wholeText methods.#1636
- Improvement: added the
:matchesWholeText(regex)
and:matchesWholeOwnText(regex)
selectors, to match against whole (non-normalized, case sensitive) element text and own text, respectively.#1636
- Improvement: when evaluating an XPath query against a context element, the complete document is now visible to the query, vs only the context element's sub-tree. This enables support for queries outside (parent or sibling) the element, e.g.
ancestor-or-self::*
.#1652
- Improvement: allow a
maxPaddingWidth
on the indent level inOutputSettings
when pretty printing. This defaults to 30 to limit the indent level for very deeply nested elements, and may be disabled by setting to -1.#1655
- Improvement: when cloning a
Node
or anElement
, the clone gets a clonedOwnerDocument
containing only that clone, so as to preserve applicable settings, such as the Pretty Print settings.#763
- Improvement: added a convenience method
Jsoup.parse(File)
.#1693
- Improvement: in the
NodeTraversor
, added default implementations forNodeVisitor.tail()
andNodeFilter.tail()
, so that code using onlyhead()
methods can be written as lambdas. - Improvement: in
NodeTraversor
, added support for removing nodes viaNode.remove()
duringNodeVisitor.head()
.#1699
- Improvement: added
Node.forEachNode(Consumer<Node>)
andElement.forEach(Consumer<Element)
methods, to efficiently traverse the DOM with a functional interface.#1700
Bug Fixes
- Bugfix: boolean attribute names should be case-insensitive, but were not when the parser was configured to preserve case.
#1656
- Bugfix: when reading from
SequenceInputStreams
across the buffer, the input stream was closed too early, resulting in missed content.#1671
- Bugfix: a comment with all dashes (
<!----->
) should not emit a parse error.#1667
- Bugfix: when throwing a
SelectorParseException
for an invalid selector, don't try to String.format the input, as that could throw anIllegalFormatException
.#1691
- Bugfix: when serializing HTML with Pretty Print enabled, extraneous whitespace may be added on closing tags, or extra newlines may be added at the end of script blocks.
#1688
#1689
- Bugfix: when copy-creating a
Safelist
from another, perform a deep-copy of the original's settings, so that changes to the original after creation do not affect the copy.#1763
- Bugfix [Fuzz]: speed improvement when parsing constructed HTML containing very deeply incorrectly stacked formatting elements with many attributes.
#1695
- Bugfix [Fuzz]: during parsing, a
StackOverflowException
was possible given crafted HTML with hundreds of nested table elements followed by invalid formatting elements.#1697
My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.
You can also follow me (@jhy) on Twitter to receive occasional notes about jsoup releases.