jsoup Java HTML Parser release 1.15.2
2022-Jul-04
jsoup 1.15.2 is out now, and includes a new ability to track the original input source position through to parsed nodes, a number of bug fixes, other improvements, and performance enhancements.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Download jsoup now.
Improvements
- Improvement: added the ability to track the position (line, column, index) in the original input source from where a given node was parsed. Accessible via
Node.sourceRange()
andElement.endSourceRange()
.#1790
- Improvement: added
Element.firstElementChild()
,Element.lastElementChild()
,Node.firstChild()
,Node.lastChild()
, as convenient accessors to those child nodes and elements. - Improvement: added
Element.expectFirst()
, which is just likeElement.selectFirst()
, but instead of returning a null if there is no match, will throw anIllegalArgumentException
. This is useful if you want to simply abort processing if an expected match is not found, such as in test cases. - Improvement: when pretty-printing HTML, doctypes are emitted on a newline if there is a preceding comment.
#1664
- Improvement: when pretty-printing, trim the leading and trailing spaces of textnodes in block tags when possible, so that they are indented correctly.
#1798
- Improvement: in
Element.selectXpath()
, disable namespace awareness. This makes it possible to always select elements by their simple local name, regardless of whether anxmlns
attribute was set.#1801
Bug Fixes
- Bugfix: when using the
DataUtil.readToByteBuffer()
method, such as inConnection.Response.body()
, if the document has not already been parsed and must be read fully, and there is any maximum buffer size being applied, only the default internal buffer size was read.#1774
- Bugfix: when serializing HTML, newlines in elements descending from a
pre
tag were incorrectly skipped. That caused what should have been preformatted output to instead be a run of text.#1776
- Bugfix: when pretty-print serializing HTML, newlines separating phrasing content (e.g. a
<span>
tag within a<p>
tag would be incorrectly skipped, instead of normalized to a space. Additionally, improved space normalization between other end of line occurences, and whitespace handling after a closing</body>
#1787
My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.
You can also follow me (@jhy) on Twitter to receive occasional notes about jsoup releases.