- All Implemented Interfaces:
- Configurable, FetchSchedule
public class AdaptiveFetchSchedule
- extends AbstractFetchSchedule
This class implements an adaptive re-fetch algorithm. This works as follows:
- for pages that has changed since the last fetchTime, decrease their
fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
- for pages that haven't changed since the last fetchTime, increase their
fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
If SYNC_DELTA property is true, then:
- calculate a
delta = fetchTime - modifiedTime
- try to synchronize with the time of change, by shifting the next fetchTime
by a fraction of the difference between the last modification time and the last
fetch time. I.e. the next fetch time will be set to
fetchTime + fetchInterval - delta * SYNC_DELTA_RATE
- if the adjusted fetch interval is bigger than the delta, then
fetchInterval = delta.
- the minimum value of fetchInterval may not be smaller than MIN_INTERVAL
(default is 1 minute).
- the maximum value of fetchInterval may not be bigger than MAX_INTERVAL
(default is 365 days).
NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize the algorithm,
so that the fetch interval either increases or decreases infinitely, with little
relevance to the page changes. Please use
main(String) method to
test the values before applying them in a production system.
- Andrzej Bialecki
|Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
public void setConf(Configuration conf)
- Specified by:
setConf in interface
setConf in class
public CrawlDatum setFetchSchedule(Text url,
- Description copied from class:
- Sets the
fetchTime on a
successfully fetched page. NOTE: this implementation resets the
retry counter - extending classes should call super.setFetchSchedule() to
preserve this behavior.
- Specified by:
setFetchSchedule in interface
setFetchSchedule in class
url - url of the page
datum - page description to be adjusted. NOTE: this instance, passed by reference,
may be modified inside the method.
prevFetchTime - previous value of fetch time, or 0 if not available.
prevModifiedTime - previous value of modifiedTime, or 0 if not available.
fetchTime - the latest time, when the page was recently re-fetched. Most FetchSchedule
implementations should update the value in @see CrawlDatum to something greater than this value.
modifiedTime - last time the content was modified. This information comes from
the protocol implementations, or is set to < 0 if not available. Most FetchSchedule
implementations should update the value in @see CrawlDatum to this value.
state - if
FetchSchedule.STATUS_MODIFIED, then the content is considered to be "changed" before the
FetchSchedule.STATUS_NOTMODIFIED then the content is known to be unchanged.
This information may be obtained by comparing page signatures before and after fetching. If this
is set to
FetchSchedule.STATUS_UNKNOWN, then it is unknown whether the page was changed; implementations
are free to follow a sensible default behavior.
- adjusted page information, including all original information. NOTE: this may
be a different instance than @see CrawlDatum, but implementations should make sure that
it contains at least all information from @see CrawlDatum}.
public static void main(String args)
Copyright © 2012 The Apache Software Foundation