|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.jsoup.Jsoup
public class Jsoup
The core public access point to the jsoup functionality.
| Method Summary | |
|---|---|
static String |
clean(String bodyHtml, String baseUri, Whitelist whitelist) Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes. |
static String |
clean(String bodyHtml, Whitelist whitelist) Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes. |
static Connection |
connect(String url) Creates a new Connection to a URL. |
static boolean |
isValid(String bodyHtml, Whitelist whitelist) Test if the input HTML has only tags and attributes allowed by the Whitelist. |
static Document |
parse(File in, String charsetName) Parse the contents of a file as HTML. |
static Document |
parse(File in, String charsetName, String baseUri) Parse the contents of a file as HTML. |
static Document |
parse(InputStream in, String charsetName, String baseUri) Read an input stream, and parse it to a Document. |
static Document |
parse(String html) Parse HTML into a Document. |
static Document |
parse(String html, String baseUri) Parse HTML into a Document. |
static Document |
parse(URL url, int timeoutMillis) Fetch a URL, and parse it as HTML. |
static Document |
parseBodyFragment(String bodyHtml) Parse a fragment of HTML, with the assumption that it forms the body of the HTML. |
static Document |
parseBodyFragment(String bodyHtml, String baseUri) Parse a fragment of HTML, with the assumption that it forms the body of the HTML. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Method Detail |
|---|
public static Document parse(String html,
String baseUri)
html - HTML to parse
baseUri - The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur before the HTML declares a
<base href> tag.
public static Document parse(String html)
<base href> tag.
html - HTML to parse
parse(String, String)
public static Connection connect(String url)
Connection to a URL. Use to fetch and parse a HTML page.
Use examples:
Document doc = Jsoup.connect("http://example.com").userAgent("Mozilla").data("name", "jsoup").get();Document doc = Jsoup.connect("http://example.com").cookie("auth", "token").post();
-
Parameters:
-
url - URL to connect to. The protocol must be
http or
https.
-
Returns:
-
the connection. You can add data, cookies, and headers; set the user-agent, referrer, method; and then execute.
parse
public static Document parse(File in,
String charsetName,
String baseUri)
throws IOException
-
Parse the contents of a file as HTML.
-
-
Parameters:
-
in - file to load HTML from
-
charsetName - (optional) character set of file contents. Set to
null to determine from
http-equiv meta tag, if present, or fall back to
UTF-8 (which is often safe to do).
-
baseUri - The URL where the HTML was retrieved from, to resolve relative links against.
-
Returns:
-
sane HTML
-
Throws:
-
IOException - if the file could not be found, or read, or if the charsetName is invalid.
parse
public static Document parse(File in,
String charsetName)
throws IOException
-
Parse the contents of a file as HTML. The location of the file is used as the base URI to qualify relative URLs.
-
-
Parameters:
-
in - file to load HTML from
-
charsetName - (optional) character set of file contents. Set to
null to determine from
http-equiv meta tag, if present, or fall back to
UTF-8 (which is often safe to do).
-
Returns:
-
sane HTML
-
Throws:
-
IOException - if the file could not be found, or read, or if the charsetName is invalid.
-
See Also:
-
parse(File, String, String)
parse
public static Document parse(InputStream in,
String charsetName,
String baseUri)
throws IOException
-
Read an input stream, and parse it to a Document.
-
-
Parameters:
-
in - input stream to read. Make sure to close it after parsing.
-
charsetName - (optional) character set of file contents. Set to
null to determine from
http-equiv meta tag, if present, or fall back to
UTF-8 (which is often safe to do).
-
baseUri - The URL where the HTML was retrieved from, to resolve relative links against.
-
Returns:
-
sane HTML
-
Throws:
-
IOException - if the file could not be found, or read, or if the charsetName is invalid.
parseBodyFragment
public static Document parseBodyFragment(String bodyHtml,
String baseUri)
-
Parse a fragment of HTML, with the assumption that it forms the
body of the HTML.
-
-
Parameters:
-
bodyHtml - body HTML fragment
-
baseUri - URL to resolve relative URLs against.
-
Returns:
-
sane HTML document
-
See Also:
-
Document.body()
parseBodyFragment
public static Document parseBodyFragment(String bodyHtml)
-
Parse a fragment of HTML, with the assumption that it forms the
body of the HTML.
-
-
Parameters:
-
bodyHtml - body HTML fragment
-
Returns:
-
sane HTML document
-
See Also:
-
Document.body()
parse
public static Document parse(URL url,
int timeoutMillis)
throws IOException
-
Fetch a URL, and parse it as HTML. Provided for compatibility; in most cases use
connect(String) instead.
The encoding character set is determined by the content-type header or http-equiv meta tag, or falls back to UTF-8.
-
-
Parameters:
-
url - URL to fetch (with a GET). The protocol must be
http or
https.
-
timeoutMillis - Connection and read timeout, in milliseconds. If exceeded, IOException is thrown.
-
Returns:
-
The parsed HTML.
-
Throws:
-
IOException - If the final server response != 200 OK (redirects are followed), or if there's an error reading the response stream.
-
See Also:
-
connect(String)
clean
public static String clean(String bodyHtml,
String baseUri,
Whitelist whitelist)
-
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.
-
-
Parameters:
-
bodyHtml - input untrusted HMTL
-
baseUri - URL to resolve relative URLs against
-
whitelist - white-list of permitted HTML elements
-
Returns:
-
safe HTML
-
See Also:
-
Cleaner.clean(Document)
clean
public static String clean(String bodyHtml,
Whitelist whitelist)
-
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.
-
-
Parameters:
-
bodyHtml - input untrusted HTML
-
whitelist - white-list of permitted HTML elements
-
Returns:
-
safe HTML
-
See Also:
-
Cleaner.clean(Document)
isValid
public static boolean isValid(String bodyHtml,
Whitelist whitelist)
-
Test if the input HTML has only tags and attributes allowed by the Whitelist. Useful for form validation. The input HTML should still be run through the cleaner to set up enforced attributes, and to tidy the output.
-
-
Parameters:
-
bodyHtml - HTML to test
-
whitelist - whitelist to test against
-
Returns:
-
true if no tags or attributes were removed; false otherwise
-
See Also:
-
clean(String, org.jsoup.safety.Whitelist)
Overview
Package
Class
Use
Tree
Deprecated
Index
Help
PREV CLASS NEXT CLASS
FRAMES NO FRAMES
SUMMARY: NESTED | FIELD | CONSTR | METHOD
DETAIL: FIELD | CONSTR | METHOD
Copyright © 2009-2011 Jonathan Hedley. All Rights Reserved.