Package org.apache.nutch.crawl
Class AdaptiveFetchSchedule
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.crawl.AbstractFetchSchedule
-
- org.apache.nutch.crawl.AdaptiveFetchSchedule
-
- All Implemented Interfaces:
Configurable
,FetchSchedule
- Direct Known Subclasses:
MimeAdaptiveFetchSchedule
public class AdaptiveFetchSchedule extends AbstractFetchSchedule
This class implements an adaptive re-fetch algorithm. This works as follows:- for pages that has changed since the last fetchTime, decrease their fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
- for pages that haven't changed since the last fetchTime, increase their
fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
If SYNC_DELTA property is true, then:- calculate a
delta = fetchTime - modifiedTime
- try to synchronize with the time of change, by shifting the next
fetchTime by a fraction of the difference between the last modification time
and the last fetch time. I.e. the next fetch time will be set to
fetchTime + fetchInterval - delta * SYNC_DELTA_RATE
- if the adjusted fetch interval is bigger than the delta, then
fetchInterval = delta
.
- calculate a
- the minimum value of fetchInterval may not be smaller than MIN_INTERVAL (default is 1 minute).
- the maximum value of fetchInterval may not be bigger than MAX_INTERVAL (default is 365 days).
NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize the algorithm, so that the fetch interval either increases or decreases infinitely, with little relevance to the page changes. Please use
The class also allows specifying custom min. and max. re-fetch intervals per hostname, in adaptive-host-specific-intervals.txt. If they are specified, the calculated re-fetch interval for a URL matching the hostname will not be allowed to fall outside of the corresponding range, instead of the default range.main(String[])
method to test the values before applying them in a production system.- Author:
- Andrzej Bialecki
-
-
Field Summary
Fields Modifier and Type Field Description protected float
DEC_RATE
protected float
INC_RATE
-
Fields inherited from class org.apache.nutch.crawl.AbstractFetchSchedule
defaultInterval, maxInterval
-
Fields inherited from interface org.apache.nutch.crawl.FetchSchedule
SECONDS_PER_DAY, STATUS_MODIFIED, STATUS_NOTMODIFIED, STATUS_UNKNOWN
-
-
Constructor Summary
Constructors Constructor Description AdaptiveFetchSchedule()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description Float
getCustomMaxInterval(Text url)
Returns the custom max.Float
getCustomMinInterval(Text url)
Returns the custom min.static String
getHostName(String url)
Strip a URL, leaving only the hostname.static void
main(String[] args)
void
setConf(Configuration conf)
CrawlDatum
setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
Sets thefetchInterval
andfetchTime
on a successfully fetched page.-
Methods inherited from class org.apache.nutch.crawl.AbstractFetchSchedule
calculateLastFetchTime, forceRefetch, initializeSchedule, setPageGoneSchedule, setPageRetrySchedule, shouldFetch
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf
-
-
-
-
Method Detail
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interfaceConfigurable
- Overrides:
setConf
in classAbstractFetchSchedule
-
getHostName
public static String getHostName(String url) throws URISyntaxException
Strip a URL, leaving only the hostname.- Parameters:
url
- the URL for which to get the hostname- Returns:
- hostname
- Throws:
URISyntaxException
- if the given string violates RFC 2396
-
getCustomMaxInterval
public Float getCustomMaxInterval(Text url)
Returns the custom max. refetch interval for this URL, if specified for the corresponding hostname.- Parameters:
url
- the URL to be scheduled- Returns:
- the configured max. interval or null
-
getCustomMinInterval
public Float getCustomMinInterval(Text url)
Returns the custom min. refetch interval for this URL, if specified for the corresponding hostname.- Parameters:
url
- the URL to be scheduled- Returns:
- the configured min. interval or null
-
setFetchSchedule
public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
Description copied from class:AbstractFetchSchedule
Sets thefetchInterval
andfetchTime
on a successfully fetched page. NOTE: this implementation resets the retry counter - extending classes should call super.setFetchSchedule() to preserve this behavior.- Specified by:
setFetchSchedule
in interfaceFetchSchedule
- Overrides:
setFetchSchedule
in classAbstractFetchSchedule
- Parameters:
url
- url of the pagedatum
- page description to be adjusted. NOTE: this instance, passed by reference, may be modified inside the method.prevFetchTime
- previous value of fetch time, or 0 if not available.prevModifiedTime
- previous value of modifiedTime, or 0 if not available.fetchTime
- the latest time, when the page was recently re-fetched. Most FetchSchedule implementations should update the value in @see CrawlDatum to something greater than this value.modifiedTime
- last time the content was modified. This information comes from the protocol implementations, or is set to < 0 if not available. Most FetchSchedule implementations should update the value in @see CrawlDatum to this value.state
- ifFetchSchedule.STATUS_MODIFIED
, then the content is considered to be "changed" before thefetchTime
, ifFetchSchedule.STATUS_NOTMODIFIED
then the content is known to be unchanged. This information may be obtained by comparing page signatures before and after fetching. If this is set toFetchSchedule.STATUS_UNKNOWN
, then it is unknown whether the page was changed; implementations are free to follow a sensible default behavior.- Returns:
- adjusted page information, including all original information. NOTE: this may be a different instance than @see CrawlDatum, but implementations should make sure that it contains at least all information from @see CrawlDatum}.
-
-