jsoup Java HTML Parser release 1.14.1
jsoup 1.14.1 is out now, with simple request session management, increased parse robustness, and a ton of other improvements, speed-ups, and bug fixes.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Please note the changes indicated below as in some circumstances you may need to modify your build or codebase to upgrade.
Download jsoup now.
- Change: updated the minimum supported Java version from Java 7 to Java 8.
- Change: updated the minimum Android API level from 8 to 10.
- Change: although
Node.childNodes()returns an UnmodifiableList as a view into its children, it was still directly backed by the internal child list. That made some uses, such as looping and moving those children to another element, throw a ConcurrentModificationException. Now this method returns its own list so that they are separated and changes to the parent's contents will not impact the children view. This aligns with similar methods such as Element.children(). If you have code that iterates this list and makes parenting changes to its contents, you may need to make a code update.
- Change: the
org.jsoup.Connectioninterface has been modified to introduce new methods for sessions and the cookie store. If you have a custom implementation of this interface, you will need to add implementations of these methods.
- Improvement: added HTTP request session management support with
Jsoup.newSession(). This extends the
Connectionimplementation to support (optional) sessions, which allow request defaults (timeout, proxy, etc) to be set once and then applied to all requests within that session.
Cookies are re-implemented to correctly support path and domain filtering when used within a session. A default in-memory cookie store is used for the session, or a custom implementation (perhaps disk-persistent, or pre-set) can be used instead.
Forms submitted using the
FormElement.submit()use the same session that was used to fetch the document and so pass cookies and other defaults appropriately.
The session is multi-thread safe and can execute multiple requests concurrently. If the user accidentally tries to execute the same request object across multiple threads (vs calling
Connection.newRequest()), that is detected cleanly and a clear exception is thrown (vs weird blowups in input stream reading, or forcing everything through a synchronized bottleneck.
- Improvement: renamed the
Safelist, with the goal of more inclusive language. A shim is provided for backwards compatibility (source and binary). This shim is marked as deprecated and will be removed in the jsoup 1.15.1 release.
- Improvement: added support for Internationalized Domain Names (IDNs) in Jsoup.Connect.
- Improvement: added support for loading and parsing gzipped HTML files in
Jsoup.parse(File in, charset, baseUri).
- Improvement: reduced thread contention in
- Improvement: better parsing performance when under high thread concurrency
- Improvement: added
Element.id(String)ID attribute setter.
- Improvement: in
Document, #body() and #head() accessors will now automatically create those elements, if they were missing (e.g. if the Document was not parsed from HTML). Additionally, the #body() method returns the frameset element (instead of null) for frameset documents.
- Improvement: when cleaning a document, the output settings of the original document are cloned into the cleaned document.
- Improvement: when parsing XML, disable pretty-printing by default.
- Improvement: much better performance in
Node.clone()for large and deeply nested documents. Complexity was O(n^2) or worse, now O(n).
- Improvement: during traversal using the
NodeTraversor, nodes may now be replaced with
- Improvement: added
Elment.prependChildren, as convenience methods in addition to
Element.insertChildren(index, children), for bulk moving nodes.
- Improvement: clean up relative URLs with too many
- Build Improvement: integrated jsoup into the
OSS Fuzzproject, which semi-randomly generates millions of different HTML and XML input files, searching for areas to improve in the parser for increased robustness and throughput.
- Build Improvement: integrated with GitHub's CodeQL static code analyzer.
- Build Improvement: moved to GitHub Workflows for build verification.
- Build Improvement: updated Jetty (used for integration tests; not bundled) to 9.4.42.
- Build Improvement: added nullability annotations and initial settings.
- Bugfix: corrected the adoption agency algorithm, to handle cases where e.g. an
atag incorrectly nests further
- Bugfix: when parsing HTML, could throw NPEs on some tags (isindex or table>input).
- Bugfix: in
HttpConnection.Request, headers beginning with "sec-" (e.g. Sec-Fetch-Mode) were silently discarded by the underlying Java HttpURLConnection. These are now settable correctly.
- Bugfix: when adding child Nodes to a Node, could incorrectly reparent all nodes if the first parent had the same length of children as the incoming node list.
- Bugfix: when wrapping an orphaned element, would throw an NPE.
- Bugfix: when wrapping an element with HTML that included multiple sibling elements, those siblings were incorrectly added as children of the wrapper instead of siblings.
- Bugfix: when setting the content of a script or style tag via the Element#html(String) method, the content is now treated as a DataNode, not a TextNode. This means that characters like '<' will no longer be incorrectly escaped. As a related ergonomic improvement, the same behavior applies for Element#text(String) (i.e. the content will be treated as a DataNode, despite calling the text() method.
- Bugfix: when wrapping HTML around an existing element with Element#wrap(String), will now take the content as provided and ignore normal HTML tree-building rules. This allows for e.g. a div tag to be placed inside of p tags.
- Bugfix: the Elements#forms() method should return the selected immediate elements that are Forms, not children.
- Bugfix: when creating a selector for an element with Element#cssSelector, if the element used a non-unique ID attribute, the returned selector may not match the desired element.
- Bugfix: corrected the toString() methods of the Evaluator classes.
- Bugfix: when converting a jsoup document to a W3C document (in
W3CDom.convert()), if a tag had XML illegal characters, a DOMException would be thown. Now instead, that tag is represented as a text node.
- Bugfix: if a HTML file ended with an open noscript tag, an "EOF" string would appear in the HTML output.
- Bugfix: when parsing a document as XML, automatically set the output syntax to XML, and ensure that
<characters in attributes are escaped as
<(which is not required in HTML as the quoted attribute contents are safe, but is required in XML).
- Bugfix: [Fuzz] when parsing an attribute key containing
abs:abs, a validation error would be incorrectly thrown.
- Bugfix: [Fuzz] could NPE while parsing in
- Bugfix: [Fuzz] when parsing XML, could Stack Overflow when parsing XML declarations.
- Bugfix: [Fuzz] fixed a potential Stack Overflow when parsing mis-nested tfoot tags, and updated the tree parser for this situation to match the updated HTML5 spec.
- Bugfix: [Fuzz] fixed a potentially slow HTML parse when tags are nested extremely deep (e.g. 88K depth), by limiting the formatting tag search depth to 256. In practice, it's generally between 4 - 8.
- Bugfix: [Fuzz] when parsing an unterminated
RCDATAtoken (e.g. a
titletag), could throw an IO Exception "No buffer left to unconsume" when trying to rewind the buffer.
My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.