Brand new HTML5 parser: jsoup 1.6.0 released
2011-Jun-13 I am very happy to announce that jsoup 1.6.0 has been released and is now available for download.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
New HTML5 parser
This release of jsoup includes a completely re-implemented parser, based on the WHATWG HTML5 specification. jsoup now parses HTML exactly like modern browsers such as Chrome, Firefox, and Safari parse HTML. This helps users scrape data more readily, and improves HTML tidying.
As this is such a large change since the previous
1.5.2 release, jsoup is being released as version
.0 denotes a beta release: if you run into problems parsing documents, please file a bug, particularly if parsing under
1.5.2 worked OK.
Other improvements and bug fixes
- When parsing files from disk, files are loaded via memory mapping, to increase parse speed.
- Reduced memory overhead and lowered garbage collector pressure with
- Fixed cookie handling issue in
jsoup.Connectwhere empty cookies would cause a validation exception.
jsoup.Connectconfiguration options to allow HTTP errors to be ignored, and the content-type to be ignored.
Node.after(Node), to allow existing nodes to be moved, or new nodes to be inserted, into precise DOM positions.
Elements.unwrap(), to remove a node but keep its contents. Useful for e.g. removing unwanted formatting tags.
- Now handles unclosed
<title>tags in document by breaking out of the title at the next start tag, instead of eating up to the end of the document.