jsoup Java HTML Parser release 1.15.2


jsoup 1.15.2 is out now, and includes a new ability to track the original input source position through to parsed nodes, a number of bug fixes, other improvements, and performance enhancements.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

  • Improvement: added the ability to track the position (line, column, index) in the original input source from where a given node was parsed. Accessible via Node.sourceRange() and Element.endSourceRange(). #1790
  • Improvement: added Element.firstElementChild(), Element.lastElementChild(), Node.firstChild(), Node.lastChild(), as convenient accessors to those child nodes and elements.
  • Improvement: added Element.expectFirst(), which is just like Element.selectFirst(), but instead of returning a null if there is no match, will throw an IllegalArgumentException. This is useful if you want to simply abort processing if an expected match is not found, such as in test cases.
  • Improvement: when pretty-printing HTML, doctypes are emitted on a newline if there is a preceding comment. #1664
  • Improvement: when pretty-printing, trim the leading and trailing spaces of textnodes in block tags when possible, so that they are indented correctly. #1798
  • Improvement: in Element.selectXpath(), disable namespace awareness. This makes it possible to always select elements by their simple local name, regardless of whether an xmlns attribute was set. #1801

Bug Fixes

  • Bugfix: when using the DataUtil.readToByteBuffer() method, such as in Connection.Response.body(), if the document has not already been parsed and must be read fully, and there is any maximum buffer size being applied, only the default internal buffer size was read. #1774
  • Bugfix: when serializing HTML, newlines in elements descending from a pre tag were incorrectly skipped. That caused what should have been preformatted output to instead be a run of text. #1776
  • Bugfix: when pretty-print serializing HTML, newlines separating phrasing content (e.g. a <span> tag within a <p> tag would be incorrectly skipped, instead of normalized to a space. Additionally, improved space normalization between other end of line occurences, and whitespace handling after a closing </body> #1787

My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.

