jsoup Java HTML Parser release 1.13.1

2020-Feb-29

jsoup 1.13.1 is out now, with significantly improved parse speed over 1.12.x, new features in Selectors, an important bugfix for those experiencing Mark Invalid exceptions, and many other improvements.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

Download jsoup now.

Improvements

  • Improvement: added Element.closest(), which walks up the tree to find the nearest element matching the selector.
  • Improvement: memory optimizations, reducing the retained size of a Document by ~ 39%, and allocations by ~ 9%: 1. Attributes holder in Elements is only created if the element has attributes 2. Only track the baseUri in an element when it is set via DOM to a new value for a given tree 3. After parsing, do not retain the input character reader (and associated buffers) in the Document.parser
  • Improvement: substantial parse speed improvements vs 1.12.x (bringing back to par with previous releases).
  • Improvement: when pretty-printing, comments in inline tags are not pushed to a newline
  • Improvement: added Attributes.hasDeclaredValueForKey() and Attributes.hasDeclaredValueForKeyIgnoreCase(), to check if an attribute is set but has no value. Useful in place of the deprecated and removed BooleanAttribute class and instanceof test.
  • Improvement: removed old methods and classes that were marked deprecated in previous releases.
  • Improvement: added Element.select(Evaluator) and Element.selectFirst(Evaluator), to allow re-use of a parsed CSS selector if using the same evaluator many times.
  • Improvement: added Elements.forms(), Elements.textNodes(), Elements.dataNodes(), and Elements.comments(), as a convenient way to get access to these node types directly from an element selection.
  • Improvement: preserve whitespace before html and head tag, if pretty-printing is off.

Bug Fixes

  • Bugfix: in a <select> tag, a second <optgroup> would not automatically close an earlier open <optgroup>
  • Bugfix: in CharacterReader when parsing an input stream, could throw a Mark Invalid exception if the reader was marked, a bufferUp occurred, and then the reader was rewound.
  • Bugfix: empty tags and form tags did not have their attributes normalized (lower-cased by default)
  • Bugfix: when preserve case was set to on, the HTML pretty-print formatter didn't indent capitalized tags correctly.
  • Bugfix: ensure that script and style contents are parsed into DataNodes, not TextNodes, when in case-sensitive parse mode.

My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.

You can also follow me (@jhy) on Twitter to receive occasional notes about jsoup releases.