Package org.apache.nutch.crawl
Class DefaultFetchSchedule
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.crawl.AbstractFetchSchedule
-
- org.apache.nutch.crawl.DefaultFetchSchedule
-
- All Implemented Interfaces:
Configurable
,FetchSchedule
public class DefaultFetchSchedule extends AbstractFetchSchedule
This class implements the default re-fetch schedule. That is, no matter if the page was changed or not, thefetchInterval
remains unchanged, and the updated page fetchTime will always be set tofetchTime + fetchInterval * 1000
.- Author:
- Andrzej Bialecki
-
-
Field Summary
-
Fields inherited from class org.apache.nutch.crawl.AbstractFetchSchedule
defaultInterval, maxInterval
-
Fields inherited from interface org.apache.nutch.crawl.FetchSchedule
SECONDS_PER_DAY, STATUS_MODIFIED, STATUS_NOTMODIFIED, STATUS_UNKNOWN
-
-
Constructor Summary
Constructors Constructor Description DefaultFetchSchedule()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description CrawlDatum
setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
Sets thefetchInterval
andfetchTime
on a successfully fetched page.-
Methods inherited from class org.apache.nutch.crawl.AbstractFetchSchedule
calculateLastFetchTime, forceRefetch, initializeSchedule, setConf, setPageGoneSchedule, setPageRetrySchedule, shouldFetch
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf
-
-
-
-
Method Detail
-
setFetchSchedule
public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
Description copied from class:AbstractFetchSchedule
Sets thefetchInterval
andfetchTime
on a successfully fetched page. NOTE: this implementation resets the retry counter - extending classes should call super.setFetchSchedule() to preserve this behavior.- Specified by:
setFetchSchedule
in interfaceFetchSchedule
- Overrides:
setFetchSchedule
in classAbstractFetchSchedule
- Parameters:
url
- url of the pagedatum
- page description to be adjusted. NOTE: this instance, passed by reference, may be modified inside the method.prevFetchTime
- previous value of fetch time, or 0 if not available.prevModifiedTime
- previous value of modifiedTime, or 0 if not available.fetchTime
- the latest time, when the page was recently re-fetched. Most FetchSchedule implementations should update the value in @see CrawlDatum to something greater than this value.modifiedTime
- last time the content was modified. This information comes from the protocol implementations, or is set to < 0 if not available. Most FetchSchedule implementations should update the value in @see CrawlDatum to this value.state
- ifFetchSchedule.STATUS_MODIFIED
, then the content is considered to be "changed" before thefetchTime
, ifFetchSchedule.STATUS_NOTMODIFIED
then the content is known to be unchanged. This information may be obtained by comparing page signatures before and after fetching. If this is set toFetchSchedule.STATUS_UNKNOWN
, then it is unknown whether the page was changed; implementations are free to follow a sensible default behavior.- Returns:
- adjusted page information, including all original information. NOTE: this may be a different instance than @see CrawlDatum, but implementations should make sure that it contains at least all information from @see CrawlDatum}.
-
-