jsoup HTML parser launches

2010-Jan-31 Today, I am announcing the public beta launch of jsoup, an open source Java HTML parser that I have been working on.

jsoup is a Java library for working with real-world HTML:

  • parse HTML from a URL, file, or string
  • find and extract data, using DOM traversal or CSS selectors
  • manipulate the HTML elements, attributes, and text
  • clean user-submitted content against a safe white-list

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

jsoup is an open source project distributed under the liberal MIT license. Source code is available at GitHub.

As of this initial launch, jsoup is immediately useful, and it is in use in several internal projects. But of course it can be made more useful: so please, send me your suggestions and thoughts; either to the project's mailing list, or to me directly.

If you would like to contribute code that would also be welcomed.

For more information, and to get started using jsoup, visit the project's website.

-- Jonathan Hedley 2010-Jan-31