Brand new HTML5 parser: jsoup 1.6.0 released
2011-Jun-13
I am very happy to announce that jsoup 1.6.0 has been released and is now available for download.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
New HTML5 parser
This release of jsoup includes a completely re-implemented parser, based on the WHATWG HTML5 specification. jsoup now parses HTML exactly like modern browsers such as Chrome, Firefox, and Safari parse HTML. This helps users scrape data more readily, and improves HTML tidying.
As this is such a large change since the previous 1.5.2
release, jsoup is being released as version 1.6.0
. The .0
denotes a beta release: if you run into problems parsing documents, please file a bug, particularly if parsing under 1.5.2
worked OK.
Other improvements and bug fixes
- When parsing files from disk, files are loaded via memory mapping, to increase parse speed.
- Reduced memory overhead and lowered garbage collector pressure with
Attribute
,Node
andElement
model optimisations.
- Improved
abs:
absolute URL handling inElements.attr(abs:href)
andNode.hasAttr(abs:href)
.
- Fixed cookie handling issue in
jsoup.Connect
where empty cookies would cause a validation exception.
- Added
jsoup.Connect
configuration options to allow HTTP errors to be ignored, and the content-type to be ignored.
- Added
Node.before(Node)
andNode.after(Node)
, to allow existing nodes to be moved, or new nodes to be inserted, into precise DOM positions.
- Added
Node.unwrap()
andElements.unwrap()
, to remove a node but keep its contents. Useful for e.g. removing unwanted formatting tags.
- Now handles unclosed
<title>
tags in document by breaking out of the title at the next start tag, instead of eating up to the end of the document.
- Added
OSGi
bundle support to the jsoup package jar. If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.