Sanitize untrusted HTML (to prevent XSS)
You want to allow untrusted users to supply HTML for output on your website (e.g. as comment submission). You need to clean this HTML to avoid cross-site scripting (XSS) attacks.
String unsafe = "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>"; String safe = Jsoup.clean(unsafe, Safelist.basic()); // now: <p><a href="http://example.com/" rel="nofollow">Link</a></p>
A cross-site scripting attack against your site can really ruin your day, not to mention your users'. Many sites avoid XSS attacks by not allowing HTML in user submitted content: they enforce plain text only, or use an alternative markup syntax like wiki-text or Markdown. These are seldom optimal solutions for the user, as they lower expressiveness, and force the user to learn a new syntax.
The jsoup safelist sanitizer works by parsing the input HTML (in a safe, sand-boxed environment), and then iterating through the parse tree and only allowing known-safe tags and attributes (and values) through into the cleaned output.
It does not use regular expressions, which are inappropriate for this task.
jsoup provides a range of
Safelist configurations to suit most requirements; they can be modified if necessary, but take care.
The cleaner is useful not only for avoiding XSS, but also in limiting the range of elements the user can provide: you may be OK with textual
strong elements, but not structural
- See the XSS cheat sheet and filter evasion guide, as an example of how regular-expression filters don't work, and why a safe safelist parser-based sanitizer is the correct approach.
- See the
Cleanerreference if you want to get a
Documentinstead of a String return
- See the
Safelistreference for the different canned options, and to create a custom safelist
- The nofollow link attribute
- Parse a document from a String
- Parsing a body fragment
- Load a Document from a URL
- Load a Document from a File
- Use DOM methods to navigate a document
- Use CSS selectors to find elements
- Use XPath selectors to find elements and nodes
- Extract attributes, text, and HTML from elements
- Working with relative and absolute URLs
- Example program: list links
- Sanitize untrusted HTML (to prevent XSS)