public class Parser extends java.lang.Object implements java.io.Serializable, ConnectionMonitor
限定符和类型 | 字段和说明 |
---|---|
static ParserFeedback |
DEVNULL
A quiet message sink.
|
static ParserFeedback |
STDOUT
A verbose message sink.
|
static java.lang.String |
VERSION_DATE
The date of the version ("Jun 10, 2006").
|
static double |
VERSION_NUMBER
The floating point version number (1.6).
|
static java.lang.String |
VERSION_STRING
The display version ("1.6 (Release Build Jun 10, 2006)").
|
static java.lang.String |
VERSION_TYPE
The type of version ("Release Build").
|
构造器和说明 |
---|
Parser()
Zero argument constructor.
|
Parser(Lexer lexer)
Construct a parser using the provided lexer.
|
Parser(Lexer lexer,
ParserFeedback fb)
Construct a parser using the provided lexer and feedback object.
|
Parser(java.lang.String resource)
Creates a Parser object with the location of the resource (URL or file).
|
Parser(java.lang.String resource,
ParserFeedback feedback)
Creates a Parser object with the location of the resource (URL or file)
You would typically create a DefaultHTMLParserFeedback object and pass
it in.
|
Parser(java.net.URLConnection connection)
Construct a parser using the provided URLConnection.
|
Parser(java.net.URLConnection connection,
ParserFeedback fb)
Constructor for custom HTTP access.
|
限定符和类型 | 方法和说明 |
---|---|
static Parser |
createParser(java.lang.String html,
java.lang.String charset)
Creates the parser on an input string.
|
NodeIterator |
elements() |
NodeList |
extractAllNodesThatMatch(NodeFilter filter)
Extract all nodes matching the given filter.
|
java.net.URLConnection |
getConnection()
Return the current connection.
|
static ConnectionManager |
getConnectionManager()
Get the connection manager all Parsers use.
|
java.lang.String |
getEncoding()
Get the encoding for the page this parser is reading from.
|
ParserFeedback |
getFeedback()
Returns the current feedback object.
|
Lexer |
getLexer()
Returns the lexer associated with the parser.
|
NodeFactory |
getNodeFactory()
Get the current node factory.
|
java.lang.String |
getURL()
Return the current URL being parsed.
|
static java.lang.String |
getVersion()
Return the version string of this parser.
|
static double |
getVersionNumber()
Return the version number of this parser.
|
static void |
main(java.lang.String[] args)
The main program, which can be executed from the command line.
|
NodeList |
parse(NodeFilter filter) |
void |
postConnect(java.net.HttpURLConnection connection)
Called just after calling connect.
|
void |
preConnect(java.net.HttpURLConnection connection)
Called just prior to calling connect.
|
void |
reset()
Reset the parser to start from the beginning again.
|
void |
setConnection(java.net.URLConnection connection)
Set the connection for this parser.
|
static void |
setConnectionManager(ConnectionManager manager)
Set the connection manager all Parsers use.
|
void |
setEncoding(java.lang.String encoding)
Set the encoding for the page this parser is reading from.
|
void |
setFeedback(ParserFeedback fb)
Sets the feedback object used in scanning.
|
void |
setInputHTML(java.lang.String inputHTML)
Initializes the parser with the given input HTML String.
|
void |
setLexer(Lexer lexer)
Set the lexer for this parser.
|
void |
setNodeFactory(NodeFactory factory)
Set the current node factory.
|
void |
setResource(java.lang.String resource)
Set the html, a url, or a file.
|
void |
setURL(java.lang.String url)
Set the URL for this parser.
|
void |
visitAllNodesWith(NodeVisitor visitor)
Apply the given visitor to the current page.
|
public static final double VERSION_NUMBER
public static final java.lang.String VERSION_TYPE
public static final java.lang.String VERSION_DATE
public static final java.lang.String VERSION_STRING
public static final ParserFeedback DEVNULL
public static final ParserFeedback STDOUT
System.out
.public Parser()
setLexer(org.htmlparser.lexer.Lexer)
or setConnection(java.net.URLConnection)
.public Parser(Lexer lexer, ParserFeedback fb)
lexer
- The lexer to draw characters from.fb
- The object to use when information,
warning and error messages are produced. If null no feedback
is provided.public Parser(java.net.URLConnection connection, ParserFeedback fb) throws ParserException
ConnectionManager
.connection
- A fully conditioned connection. The connect()
method will be called so it need not be connected yet.fb
- The object to use for message communication.ParserException
- If the creation of the underlying Lexer
cannot be performed.public Parser(java.lang.String resource, ParserFeedback feedback) throws ParserException
resource
- Either a URL, a filename or a string of HTML.
The string is considered HTML if the first non-whitespace character
is a <. The use of a url or file is autodetected by first attempting
to open the resource as a URL, if that fails it is assumed to be a file
name.
A standard HTTP GET is performed to read the content of the URL.feedback
- The HTMLParserFeedback object to use when information,
warning and error messages are produced. If null no feedback
is provided.ParserException
- If the URL is invalid.Parser(URLConnection,ParserFeedback)
public Parser(java.lang.String resource) throws ParserException
resource
- Either HTML, a URL or a filename (autodetects).ParserException
- If the resourceLocn argument does not resolve
to a valid page or file.Parser(String,ParserFeedback)
public Parser(Lexer lexer)
System.out
is used.
This would be used to create a parser for special cases where the
normal creation of a lexer on a URLConnection needs to be customized.lexer
- The lexer to draw characters from.public Parser(java.net.URLConnection connection) throws ParserException
ConnectionManager
.
A feedback object printing to System.out
is used.connection
- A fully conditioned connection. The connect()
method will be called so it need not be connected yet.ParserException
- If the creation of the underlying Lexer
cannot be performed.Parser(URLConnection,ParserFeedback)
public static java.lang.String getVersion()
"[floating point number] ([build-type] [build-date])"
public static double getVersionNumber()
public static ConnectionManager getConnectionManager()
setConnectionManager(org.htmlparser.http.ConnectionManager)
public static void setConnectionManager(ConnectionManager manager)
manager
- The new connection manager.getConnectionManager()
public static Parser createParser(java.lang.String html, java.lang.String charset)
html
- The string containing HTML.charset
- Optional. The character set encoding that will
be reported by getEncoding()
. If charset is null
the default character set is used.html
string as input.java.lang.IllegalArgumentException
- if html
is null
.public void setResource(java.lang.String resource) throws ParserException
resource
- The resource to use.java.lang.IllegalArgumentException
- if resource
is null
.ParserException
- if a problem occurs in connecting.public void setConnection(java.net.URLConnection connection) throws ParserException
Lexer
reading from the connection.connection
- A fully conditioned connection. The connect()
method will be called so it need not be connected yet.ParserException
- if the character set specified in the
HTTP header is not supported, or an i/o exception occurs creating the
lexer.java.lang.IllegalArgumentException
- if connection
is null
.ParserException
- if a problem occurs in connecting.setLexer(org.htmlparser.lexer.Lexer)
,
getConnection()
public java.net.URLConnection getConnection()
setConnection(java.net.URLConnection)
.setConnection(URLConnection)
public void setURL(java.lang.String url) throws ParserException
url
- The new URL for the parser.ParserException
- If the url is invalid or creation of the
underlying Lexer cannot be performed.ParserException
- if a problem occurs in connecting.getURL()
public java.lang.String getURL()
Page.getUrl()
,
setURL(java.lang.String)
public void setEncoding(java.lang.String encoding) throws ParserException
encoding
- The new character set to use.ParserException
- If the encoding change causes characters that
have already been consumed to differ from the characters that would
have been seen had the new encoding been in force.EncodingChangeException
,
getEncoding()
public java.lang.String getEncoding()
setEncoding(java.lang.String)
public void setLexer(Lexer lexer)
feedback
object.lexer
- The lexer object to use.java.lang.IllegalArgumentException
- if lexer
is null
.setNodeFactory(org.htmlparser.NodeFactory)
,
getLexer()
public Lexer getLexer()
setLexer(org.htmlparser.lexer.Lexer)
public NodeFactory getNodeFactory()
setNodeFactory(org.htmlparser.NodeFactory)
public void setNodeFactory(NodeFactory factory)
factory
- The new node factory for the current lexer.java.lang.IllegalArgumentException
- if factory
is null
.getNodeFactory()
public void setFeedback(ParserFeedback fb)
fb
- The new feedback object to use. If this is null a
silent feedback object
is used.getFeedback()
public ParserFeedback getFeedback()
setFeedback(org.htmlparser.util.ParserFeedback)
public void reset()
Source
object.
This is cheaper (in terms of time) than resetting the URL, i.e.
parser.setURL (parser.getURL ());because the page is not refetched from the internet. Note: the nodes returned on the second parse are new nodes and not the same nodes returned on the first parse. If you want the same nodes for re-use, collect them in a NodeList with
parse(null)
and operate on the NodeList.public NodeIterator elements() throws ParserException
ParserException
public NodeList parse(NodeFilter filter) throws ParserException
ParserException
public void visitAllNodesWith(NodeVisitor visitor) throws ParserException
accept()
method of each node
in the page in a depth first traversal. The visitor
beginParsing()
method is called prior to processing the
page and finishedParsing()
is called after the processing.visitor
- The visitor to visit all nodes with.ParserException
- If a parse error occurs while traversing
the page with the visitor.public void setInputHTML(java.lang.String inputHTML) throws ParserException
inputHTML
- the input HTML that is to be parsed.ParserException
- If a error occurs in setting up the
underlying Lexer.java.lang.IllegalArgumentException
- if inputHTML
is null
.public NodeList extractAllNodesThatMatch(NodeFilter filter) throws ParserException
filter
- The filter to be applied to the nodes.true
.ParserException
- If a parse error occurs.Node.collectInto(NodeList, NodeFilter)
public void preConnect(java.net.HttpURLConnection connection) throws ParserException
preConnect
在接口中 ConnectionMonitor
connection
- The connection which is about to be connected.ParserException
- Not usedConnectionMonitor.preConnect(java.net.HttpURLConnection)
public void postConnect(java.net.HttpURLConnection connection) throws ParserException
postConnect
在接口中 ConnectionMonitor
connection
- The connection that was just connected.ParserException
- Not used.ConnectionMonitor.postConnect(java.net.HttpURLConnection)
public static void main(java.lang.String[] args)
args
- A URL or file name to parse, and an optional tag name to be
used as a filter.