- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, FetchSchedule
public class AdaptiveFetchSchedule
- extends AbstractFetchSchedule
This class implements an adaptive re-fetch algorithm. This works as follows:
- for pages that has changed since the last fetchTime, decrease their
fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
- for pages that haven't changed since the last fetchTime, increase their
fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
If SYNC_DELTA property is true, then:
- calculate a
delta = fetchTime - modifiedTime
- try to synchronize with the time of change, by shifting the next fetchTime
by a fraction of the difference between the last modification time and the last
fetch time. I.e. the next fetch time will be set to
fetchTime + fetchInterval - delta * SYNC_DELTA_RATE
- if the adjusted fetch interval is bigger than the delta, then
fetchInterval = delta.
- the minimum value of fetchInterval may not be smaller than MIN_INTERVAL
(default is 1 minute).
- the maximum value of fetchInterval may not be bigger than MAX_INTERVAL
(default is 365 days).
NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize the algorithm,
so that the fetch interval either increases or decreases infinitely, with little
relevance to the page changes. Please use
#main(String) method to
test the values before applying them in a production system.
- Andrzej Bialecki
fetchTime on a
successfully fetched page.
|Methods inherited from class org.apache.hadoop.conf.Configured
|Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
|Methods inherited from interface org.apache.hadoop.conf.Configurable
public void setConf(org.apache.hadoop.conf.Configuration conf)
- Specified by:
setConf in interface
setConf in class
public void setFetchSchedule(String url,
- Description copied from class:
- Sets the
fetchTime on a
successfully fetched page. NOTE: this implementation resets the
retry counter - extending classes should call super.setFetchSchedule() to
preserve this behavior.
- Specified by:
setFetchSchedule in interface
setFetchSchedule in class
url - url of the page
prevFetchTime - previous value of fetch time, or -1 if not available
prevModifiedTime - previous value of modifiedTime, or -1 if not available
fetchTime - the latest time, when the page was recently re-fetched. Most FetchSchedule
implementations should update the value in to something greater than this value.
modifiedTime - last time the content was modified. This information comes from
the protocol implementations, or is set to < 0 if not available. Most FetchSchedule
implementations should update the value in to this value.
state - if
FetchSchedule.STATUS_MODIFIED, then the content is considered to be "changed" before the
FetchSchedule.STATUS_NOTMODIFIED then the content is known to be unchanged.
This information may be obtained by comparing page signatures before and after fetching. If this
is set to
FetchSchedule.STATUS_UNKNOWN, then it is unknown whether the page was changed; implementations
are free to follow a sensible default behavior.
Copyright © 2013 The Apache Software Foundation