jsoup HTML parser launches
2010-Jan-31 Today, I am announcing the public beta launch of
jsoup, an open source Java HTML parser that I have been working on.
jsoup is a Java library for working with real-world HTML:
- parse HTML from a URL, file, or string
- find and extract data, using DOM traversal or CSS selectors
- manipulate the HTML elements, attributes, and text
- clean user-submitted content against a safe white-list
jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.
As of this initial launch, jsoup is immediately useful, and it is in use in several internal projects. But of course it can be made more useful: so please, send me your suggestions and thoughts; either to the project's mailing list, or to me directly.
If you would like to contribute code that would also be welcomed.
For more information, and to get started using jsoup, visit the project's website.
-- Jonathan Hedley 2010-Jan-31