jsoup Java HTML Parser release 1.15.2
2022-Jul-04
jsoup 1.15.2 is out now, and includes a new ability to track the original input source position through to parsed nodes, a number of bug fixes, other improvements, and performance enhancements.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Download jsoup now.
Improvements
- Improvement: added the ability to track the position (line, column, index) in the original input source from where a given node was parsed. Accessible via
Node.sourceRange(
and) Element.endSourceRange(
.) #1790
- Improvement: added
Element.firstElementChild(
,) Element.lastElementChild(
,) Node.firstChild(
,) Node.lastChild(
, as convenient accessors to those child nodes and elements.) - Improvement: added
Element.expectFirst(
, which is just like) Element.selectFirst(
, but instead of returning a null if there is no match, will throw an) IllegalArgumentException
. This is useful if you want to simply abort processing if an expected match is not found, such as in test cases. - Improvement: when pretty-printing HTML, doctypes are emitted on a newline if there is a preceding comment.
#1664
- Improvement: when pretty-printing, trim the leading and trailing spaces of textnodes in block tags when possible, so that they are indented correctly.
#1798
- Improvement: in
Element.selectXpath(
, disable namespace awareness. This makes it possible to always select elements by their simple local name, regardless of whether an) xmlns
attribute was set.#1801
Bug Fixes
- Bugfix: when using the
DataUtil.readToByteBuffer(
method, such as in) Connection.Response.body(
, if the document has not already been parsed and must be read fully, and there is any maximum buffer size being applied, only the default internal buffer size was read.) #1774
- Bugfix: when serializing HTML, newlines in elements descending from a
pre
tag were incorrectly skipped. That caused what should have been preformatted output to instead be a run of text.#1776
- Bugfix: when pretty-print serializing HTML, newlines separating phrasing content (e.g. a
<span>
tag within a<p>
tag would be incorrectly skipped, instead of normalized to a space. Additionally, improved space normalization between other end of line occurences, and whitespace handling after a closing</body>
#1787
My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.
You can also follow me (@jhy) on Twitter to receive occasional notes about jsoup releases.