java.lang.Object
- org.apache.hadoop.conf.Configured
- - org.apache.nutch.crawl.AbstractFetchSchedule
  - - org.apache.nutch.crawl.AdaptiveFetchSchedule

All Implemented Interfaces:

Configurable, FetchSchedule

Direct Known Subclasses:

MimeAdaptiveFetchSchedule
```
public class AdaptiveFetchSchedule
extends AbstractFetchSchedule
```
This class implements an adaptive re-fetch algorithm. This works as follows:
- for pages that has changed since the last fetchTime, decrease their fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
- for pages that haven't changed since the last fetchTime, increase their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
  If SYNC_DELTA property is true, then:
  - calculate a delta = fetchTime - modifiedTime
  - try to synchronize with the time of change, by shifting the next fetchTime by a fraction of the difference between the last modification time and the last fetch time. I.e. the next fetch time will be set to fetchTime + fetchInterval - delta * SYNC_DELTA_RATE
  - if the adjusted fetch interval is bigger than the delta, then fetchInterval = delta.
- the minimum value of fetchInterval may not be smaller than MIN_INTERVAL (default is 1 minute).
- the maximum value of fetchInterval may not be bigger than MAX_INTERVAL (default is 365 days).
NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize the algorithm, so that the fetch interval either increases or decreases infinitely, with little relevance to the page changes. Please use main(String[]) method to test the values before applying them in a production system.
The class also allows specifying custom min. and max. re-fetch intervals per hostname, in adaptive-host-specific-intervals.txt. If they are specified, the calculated re-fetch interval for a URL matching the hostname will not be allowed to fall outside of the corresponding range, instead of the default range.
Author:

Andrzej Bialecki

Field Summary

Fields
Modifier and Type Field Description

protected float DEC_RATE

protected float INC_RATE
- Fields inherited from class org.apache.nutch.crawl.AbstractFetchSchedule
  defaultInterval, maxInterval
- Fields inherited from interface org.apache.nutch.crawl.FetchSchedule
  SECONDS_PER_DAY, STATUS_MODIFIED, STATUS_NOTMODIFIED, STATUS_UNKNOWN

Constructor Summary

Constructors
Constructor Description

AdaptiveFetchSchedule()

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`Float`	`getCustomMaxInterval(Text url)`	Returns the custom max.
`Float`	`getCustomMinInterval(Text url)`	Returns the custom min.
`static String`	`getHostName(String url)`	Strip a URL, leaving only the hostname.
`static void`	`main(String[] args)`
`void`	`setConf(Configuration conf)`
`CrawlDatum`	`setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)`	Sets the `fetchInterval` and `fetchTime` on a successfully fetched page.

Methods inherited from class org.apache.nutch.crawl.AbstractFetchSchedule
calculateLastFetchTime, forceRefetch, initializeSchedule, setPageGoneSchedule, setPageRetrySchedule, shouldFetch

Methods inherited from class org.apache.hadoop.conf.Configured
getConf

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf

- Field Detail
  - INC_RATE
```
protected float INC_RATE
```
  - DEC_RATE
```
protected float DEC_RATE
```
- Constructor Detail
  - AdaptiveFetchSchedule
```
public AdaptiveFetchSchedule()
```
- Method Detail
  - setConf
```
public void setConf(Configuration conf)
```
    Specified by:
    
    setConf in interface Configurable
    
    Overrides:
    
    setConf in class AbstractFetchSchedule
  - getHostName
```
public static String getHostName(String url)
                          throws URISyntaxException
```
    Strip a URL, leaving only the hostname.
    
    Parameters:
    
    url - the URL for which to get the hostname
    
    Returns:
    
    hostname
    
    Throws:
    
    URISyntaxException - if the given string violates RFC 2396
  - getCustomMaxInterval
```
public Float getCustomMaxInterval(Text url)
```
    Returns the custom max. refetch interval for this URL, if specified for the corresponding hostname.
    
    Parameters:
    
    url - the URL to be scheduled
    
    Returns:
    
    the configured max. interval or null
  - getCustomMinInterval
```
public Float getCustomMinInterval(Text url)
```
    Returns the custom min. refetch interval for this URL, if specified for the corresponding hostname.
    
    Parameters:
    
    url - the URL to be scheduled
    
    Returns:
    
    the configured min. interval or null
  - setFetchSchedule
```
public CrawlDatum setFetchSchedule(Text url,
                                   CrawlDatum datum,
                                   long prevFetchTime,
                                   long prevModifiedTime,
                                   long fetchTime,
                                   long modifiedTime,
                                   int state)
```
    Description copied from class: AbstractFetchSchedule
    
    Sets the fetchInterval and fetchTime on a successfully fetched page. NOTE: this implementation resets the retry counter - extending classes should call super.setFetchSchedule() to preserve this behavior.
    
    Specified by:
    
    setFetchSchedule in interface FetchSchedule
    
    Overrides:
    
    setFetchSchedule in class AbstractFetchSchedule
    
    Parameters:
    
    url - url of the page
    
    datum - page description to be adjusted. NOTE: this instance, passed by reference, may be modified inside the method.
    
    prevFetchTime - previous value of fetch time, or 0 if not available.
    
    prevModifiedTime - previous value of modifiedTime, or 0 if not available.
    
    fetchTime - the latest time, when the page was recently re-fetched. Most FetchSchedule implementations should update the value in @see CrawlDatum to something greater than this value.
    
    modifiedTime - last time the content was modified. This information comes from the protocol implementations, or is set to < 0 if not available. Most FetchSchedule implementations should update the value in @see CrawlDatum to this value.
    
    state - if FetchSchedule.STATUS_MODIFIED, then the content is considered to be "changed" before the fetchTime, if FetchSchedule.STATUS_NOTMODIFIED then the content is known to be unchanged. This information may be obtained by comparing page signatures before and after fetching. If this is set to FetchSchedule.STATUS_UNKNOWN, then it is unknown whether the page was changed; implementations are free to follow a sensible default behavior.
    
    Returns:
    
    adjusted page information, including all original information. NOTE: this may be a different instance than @see CrawlDatum, but implementations should make sure that it contains at least all information from @see CrawlDatum}.
  - main
```
public static void main(String[] args)
                 throws Exception
```
    Throws:
    
    Exception

Fields
Modifier and Type	Field	Description
`protected float`	`DEC_RATE`
`protected float`	`INC_RATE`

Class AdaptiveFetchSchedule

Field Summary

Fields inherited from class org.apache.nutch.crawl.AbstractFetchSchedule

Fields inherited from interface org.apache.nutch.crawl.FetchSchedule

Constructor Summary

Method Summary

Methods inherited from class org.apache.nutch.crawl.AbstractFetchSchedule

Methods inherited from class org.apache.hadoop.conf.Configured

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.hadoop.conf.Configurable

Field Detail

INC_RATE

DEC_RATE

Constructor Detail

AdaptiveFetchSchedule

Method Detail

setConf

getHostName

getCustomMaxInterval

getCustomMinInterval

setFetchSchedule

main