jsoup Java HTML Parser release 1.12.2

2020-Feb-08

jsoup 1.12.2 is out now, with a great set of improvements to connections, W3C interoperability, speed, and many bug fixes.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

Download jsoup now.

Improvements

  • Improvement: the :has() selector now supports relative selectors. For example, the query div:has(> a) will select all div elements that have at least one direct child a element.
  • Improvement: added Element chaining methods for various overridden methods on Node.
  • Improvement: ensure HTTP keepalives work when fetching content via Connection.Response.body() and Connection.Response.bodyAsBytes().
  • Improvement: set the default max body size in Connection to 2MB (up from 1MB) so fewer people get trimmed content if they have not set it, but still in sensible bounds. Also updated the default user-agent to improve default compatibility.
  • Improvement: dramatic speed improvement when bulk inserting child nodes into an element (wrapping contents).
  • Improvement: added Element.childrenSize() as a convenience to get the size of an element's element children.
  • Improvement: in W3CDom.asString(Document, Map<String, String>), allow the output mode to be specified as HTML or as XML. It will default to checking the content, and automatically selecting.
  • Improvement: added a Document.documentType() method, to get a doc's doctype.
  • Improvement: To DocumentType, added #name(), #publicID(), and #systemId() methods to fetch those fields.
  • Improvement: in W3CDom conversions from jsoup documents, retain the DocumentType, and be able to serialize it.

Bug Fixes

  • Bugfix: on pages fetch by Jsoup.Connection, a Mark Invalid exception might be incorrectly thrown, or the page may miss some data. This occurred on larger pages when the file transfer was chunked, and an invalid HTML entity happened to cross a chunk boundary.
  • Bugfix: if duplicate attributes in an element exist, retain the first vs the last attribute with the same name. Case aware (HTML case-insensitive names, XML are case-sensitive).
  • Bugfix: don't submit input type=button form elements.
  • Bugfix: handle error position reporting correctly and don't blow up in some edge cases.
  • Bugfix: handle the ^= (starts with) selector correctly when the prefix starts with a space.
  • Bugfix: don't strip out zero-width-joiners (or zero-width-non-joiners) when normalizing text. That breaks combined emoji (and other text semantics). 🤦‍♂️
  • Bugfix: Evaluator.TagEndsWith (namespaced elements) and Tag disagreed in case-sensitivity. Now correctly matches case-insensitively.
  • Bugfix: Don't throw an exception if a selector ends in a space, just trim it.
  • Bugfix: HTML parser adds redundant text when parsing self-closing textarea.
  • Bugfix: Don't add spurious whitespace or newlines to HTML or text for inline tags.
  • Bugfix: TextNode.outerHtml() wouldn't normalize correctly without a parent.
  • Bugfix: Removed binary input detection as it was causing too many false positives.
  • Bugfix: when cloning a TextNode, if .attributes() was hit before the clone() method, the text value would only be a shallow clone.
  • Various code hygiene updates.

My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.