jsoup Java HTML Parser release 1.14.1

2021-Jul-10

jsoup 1.14.1 is out now, with simple request session management, increased parse robustness, and a ton of other improvements, speed-ups, and bug fixes.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

Please note the changes indicated below as in some circumstances you may need to modify your build or codebase to upgrade.

Download jsoup now.

Changes

Change: updated the minimum supported Java version from Java 7 to Java 8.
Change: updated the minimum Android API level from 8 to 10.
Change: although Node.childNodes() returns an UnmodifiableList as a view into its children, it was still directly backed by the internal child list. That made some uses, such as looping and moving those children to another element, throw a ConcurrentModificationException. Now this method returns its own list so that they are separated and changes to the parent's contents will not impact the children view. This aligns with similar methods such as Element.children(). If you have code that iterates this list and makes parenting changes to its contents, you may need to make a code update. #1431
Change: the org.jsoup.Connection interface has been modified to introduce new methods for sessions and the cookie store. If you have a custom implementation of this interface, you will need to add implementations of these methods.

Improvements

Improvement: added HTTP request session management support with Jsoup.newSession(). This extends the Connection implementation to support (optional) sessions, which allow request defaults (timeout, proxy, etc) to be set once and then applied to all requests within that session.

Cookies are re-implemented to correctly support path and domain filtering when used within a session. A default in-memory cookie store is used for the session, or a custom implementation (perhaps disk-persistent, or pre-set) can be used instead.

Forms submitted using the FormElement.submit() use the same session that was used to fetch the document and so pass cookies and other defaults appropriately.

The session is multi-thread safe and can execute multiple requests concurrently. If the user accidentally tries to execute the same request object across multiple threads (vs calling Connection.newRequest()), that is detected cleanly and a clear exception is thrown (vs weird blowups in input stream reading, or forcing everything through a synchronized bottleneck. #1476
Improvement: renamed the Whitelist class to Safelist, with the goal of more inclusive language. A shim is provided for backwards compatibility (source and binary). This shim is marked as deprecated and will be removed in the jsoup 1.15.1 release. #1464
Improvement: added support for Internationalized Domain Names (IDNs) in Jsoup.Connect. #1300
Improvement: added support for loading and parsing gzipped HTML files in Jsoup.parse(File in, charset, baseUri).
Improvement: reduced thread contention in HttpConnection and Document. #1455
Improvement: better parsing performance when under high thread concurrency #1402
Improvement: added Element.id(String) ID attribute setter.
Improvement: in Document, #body() and #head() accessors will now automatically create those elements, if they were missing (e.g. if the Document was not parsed from HTML). Additionally, the #body() method returns the frameset element (instead of null) for frameset documents.
Improvement: when cleaning a document, the output settings of the original document are cloned into the cleaned document. #1417
Improvement: when parsing XML, disable pretty-printing by default. #1168
Improvement: much better performance in Node.clone() for large and deeply nested documents. Complexity was O(n^2) or worse, now O(n).
Improvement: during traversal using the NodeTraversor, nodes may now be replaced with Node.replaceWith(Node). #1289
Improvement: added Element.insertChildren and Elment.prependChildren, as convenience methods in addition to Element.insertChildren(index, children), for bulk moving nodes.
Improvement: clean up relative URLs with too many .. segments better. #1482

Build Improvements

Build Improvement: integrated jsoup into the OSS Fuzz project, which semi-randomly generates millions of different HTML and XML input files, searching for areas to improve in the parser for increased robustness and throughput. #1502
Build Improvement: integrated with GitHub's CodeQL static code analyzer. #1494
Build Improvement: moved to GitHub Workflows for build verification.
Build Improvement: updated Jetty (used for integration tests; not bundled) to 9.4.42.
Build Improvement: added nullability annotations and initial settings. #1467

Bug Fixes

Bugfix: corrected the adoption agency algorithm, to handle cases where e.g. an a tag incorrectly nests further a tags. #1517 #845
Bugfix: when parsing HTML, could throw NPEs on some tags (isindex or table>input). #1404
Bugfix: in HttpConnection.Request, headers beginning with "sec-" (e.g. Sec-Fetch-Mode) were silently discarded by the underlying Java HttpURLConnection. These are now settable correctly. #1461
Bugfix: when adding child Nodes to a Node, could incorrectly reparent all nodes if the first parent had the same length of children as the incoming node list.
Bugfix: when wrapping an orphaned element, would throw an NPE.
Bugfix: when wrapping an element with HTML that included multiple sibling elements, those siblings were incorrectly added as children of the wrapper instead of siblings.
Bugfix: when setting the content of a script or style tag via the Element#html(String) method, the content is now treated as a DataNode, not a TextNode. This means that characters like '<' will no longer be incorrectly escaped. As a related ergonomic improvement, the same behavior applies for Element#text(String) (i.e. the content will be treated as a DataNode, despite calling the text() method. #1419
Bugfix: when wrapping HTML around an existing element with Element#wrap(String), will now take the content as provided and ignore normal HTML tree-building rules. This allows for e.g. a div tag to be placed inside of p tags.
Bugfix: the Elements#forms() method should return the selected immediate elements that are Forms, not children. #1403
Bugfix: when creating a selector for an element with Element#cssSelector, if the element used a non-unique ID attribute, the returned selector may not match the desired element. #1085
Bugfix: corrected the toString() methods of the Evaluator classes.
Bugfix: when converting a jsoup document to a W3C document (in W3CDom.convert()), if a tag had XML illegal characters, a DOMException would be thown. Now instead, that tag is represented as a text node. #1093
Bugfix: if a HTML file ended with an open noscript tag, an "EOF" string would appear in the HTML output.
Bugfix: when parsing a document as XML, automatically set the output syntax to XML, and ensure that < characters in attributes are escaped as &lt (which is not required in HTML as the quoted attribute contents are safe, but is required in XML). #1420
Bugfix: [Fuzz] when parsing an attribute key containing abs:abs, a validation error would be incorrectly thrown. #1541
Bugfix: [Fuzz] could NPE while parsing in resetInsertionMode. #1538
Bugfix: [Fuzz] when parsing XML, could Stack Overflow when parsing XML declarations. #1539
Bugfix: [Fuzz] fixed a potential Stack Overflow when parsing mis-nested tfoot tags, and updated the tree parser for this situation to match the updated HTML5 spec. #1543
Bugfix: [Fuzz] fixed a potentially slow HTML parse when tags are nested extremely deep (e.g. 88K depth), by limiting the formatting tag search depth to 256. In practice, it's generally between 4 - 8. #1544
Bugfix: [Fuzz] when parsing an unterminated RCDATA token (e.g. a title tag), could throw an IO Exception "No buffer left to unconsume" when trying to rewind the buffer. #1542

My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.

You can also follow me (@jhy) on Twitter to receive occasional notes about jsoup releases.