jsoup Java HTML Parser release 1.13.1
2020-Feb-29
jsoup 1.13.1 is out now, with significantly improved parse speed over 1.12.x, new features in Selectors, an important bugfix for those experiencing Mark Invalid exceptions, and many other improvements.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Download jsoup now.
Improvements
- Improvement: added
Element.closest()
, which walks up the tree to find the nearest element matching the selector. - Improvement: memory optimizations, reducing the retained size of a
Document
by ~ 39%, and allocations by ~ 9%: 1.Attributes
holder inElement
s is only created if the element has attributes 2. Only track thebaseUri
in an element when it is set via DOM to a new value for a given tree 3. After parsing, do not retain the input character reader (and associated buffers) in theDocument.parser
- Improvement: substantial parse speed improvements vs 1.12.x (bringing back to par with previous releases).
- Improvement: when pretty-printing, comments in inline tags are not pushed to a newline
- Improvement: added
Attributes.hasDeclaredValueForKey()
andAttributes.hasDeclaredValueForKeyIgnoreCase()
, to check if an attribute is set but has no value. Useful in place of the deprecated and removed BooleanAttribute class and instanceof test. - Improvement: removed old methods and classes that were marked deprecated in previous releases.
- Improvement: added
Element.select(Evaluator)
andElement.selectFirst(Evaluator)
, to allow re-use of a parsed CSS selector if using the same evaluator many times. - Improvement: added
Elements.forms()
,Elements.textNodes()
,Elements.dataNodes()
, andElements.comments()
, as a convenient way to get access to these node types directly from an element selection. - Improvement: preserve whitespace before html and head tag, if pretty-printing is off.
Bug Fixes
- Bugfix: in a
<select>
tag, a second<optgroup>
would not automatically close an earlier open<optgroup>
- Bugfix: in
CharacterReader
when parsing an input stream, could throw aMark Invalid
exception if the reader was marked, a bufferUp occurred, and then the reader was rewound. - Bugfix: empty tags and form tags did not have their attributes normalized (lower-cased by default)
- Bugfix: when preserve case was set to
on
, the HTML pretty-print formatter didn't indent capitalized tags correctly. - Bugfix: ensure that script and style contents are parsed into
DataNode
s, notTextNode
s, when in case-sensitive parse mode.
My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.
You can also follow me (@jhy) on Twitter to receive occasional notes about jsoup releases.