Package org.apache.nutch.protocol
Interface Protocol
-
-
Field Summary
Fields Modifier and Type Field Description static String
X_POINT_ID
The name of the extension point.
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description ProtocolOutput
getProtocolOutput(Text url, CrawlDatum datum)
Get theProtocolOutput
for a given url and crawldatumcrawlercommons.robots.BaseRobotRules
getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
Retrieve robot rules applicable for this URL.-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
X_POINT_ID
static final String X_POINT_ID
The name of the extension point.
-
-
Method Detail
-
getProtocolOutput
ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum)
Get theProtocolOutput
for a given url and crawldatum- Parameters:
url
- canonical urldatum
- associatedCrawlDatum
- Returns:
- the
ProtocolOutput
-
getRobotRules
crawlercommons.robots.BaseRobotRules getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
Retrieve robot rules applicable for this URL.- Parameters:
url
- URL to checkdatum
- page datumrobotsTxtContent
- container to store responses when fetching the robots.txt file for debugging or archival purposes. Instead of a robots.txt file, it may include redirects or an error page (404, etc.). ResponseContent
is appended to the passed list. If null is passed nothing is stored.- Returns:
- robot rules (specific for this URL or default), never null
-
-