|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.apache.nutch.parse.js.JSParseFilter
public class JSParseFilter
This class is a heuristic link extractor for JavaScript files and code snippets. The general idea of a two-pass regex matching comes from Heritrix. Parts of the code come from OutlinkExtractor.java by Stephan Strittmatter.
| Field Summary | |
|---|---|
static org.slf4j.Logger |
LOG
|
| Fields inherited from interface org.apache.nutch.parse.ParseFilter |
|---|
X_POINT_ID |
| Fields inherited from interface org.apache.nutch.parse.Parser |
|---|
X_POINT_ID |
| Constructor Summary | |
|---|---|
JSParseFilter()
|
|
| Method Summary | |
|---|---|
Parse |
filter(String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
DocumentFragment doc)
Scan the JavaScript looking for possible Outlink's |
Configuration |
getConf()
Get the Configuration object |
Collection<WebPage.Field> |
getFields()
Gets all the fields for a given WebPage
Many datastores need to setup the mapreduce job by specifying the fields
needed. |
Parse |
getParse(String url,
WebPage page)
Set the Configuration object |
static void |
main(String[] args)
Main method which can be run from command line with the plugin option. |
void |
setConf(Configuration conf)
Set the Configuration object |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static final org.slf4j.Logger LOG
| Constructor Detail |
|---|
public JSParseFilter()
| Method Detail |
|---|
public Parse filter(String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
DocumentFragment doc)
Outlink's
filter in interface ParseFilterurl - URL of the WebPage to be parsedpage - WebPage object relative to the URLparse - Parse object holding parse statusmetatags - within the NutchDocumentdoc - The NutchDocument object
Parse object
public Parse getParse(String url,
WebPage page)
Configuration object
getParse in interface Parserurl - URL of the WebPage which is parsedpage - WebPage object relative to the URL
Parse object
public static void main(String[] args)
throws Exception
args -
Exceptionpublic void setConf(Configuration conf)
Configuration object
setConf in interface Configurablepublic Configuration getConf()
Configuration object
getConf in interface Configurablepublic Collection<WebPage.Field> getFields()
WebPage
Many datastores need to setup the mapreduce job by specifying the fields
needed. All extensions that work on WebPage are able to specify what fields
they need.
getFields in interface FieldPluggable
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||