Class Fetcher
- java.lang.Object
-
- org.apache.hadoop.conf.Configured
-
- org.apache.nutch.util.NutchTool
-
- org.apache.nutch.fetcher.Fetcher
-
- All Implemented Interfaces:
Configurable
,Tool
public class Fetcher extends NutchTool implements Tool
A queue-based fetcher.This fetcher uses a well-known model of one producer (a QueueFeeder) and many consumers (FetcherThread-s).
QueueFeeder reads input fetchlists and populates a set of FetchItemQueue-s, which hold FetchItem-s that describe the items to be fetched. There are as many queues as there are unique hosts, but at any given time the total number of fetch items in all queues is less than a fixed number (currently set to a multiple of the number of threads).
As items are consumed from the queues, the QueueFeeder continues to add new input items, so that their total count stays fixed (FetcherThread-s may also add new items to the queues e.g. as a results of redirection) - until all input items are exhausted, at which point the number of items in the queues begins to decrease. When this number reaches 0 fetcher will finish.
This fetcher implementation handles per-host blocking itself, instead of delegating this work to protocol-specific plugins. Each per-host queue handles its own "politeness" settings, such as the maximum number of concurrent requests and crawl delay between consecutive requests - and also a list of requests in progress, and the time the last request was finished. As FetcherThread-s ask for new items to be fetched, queues may return eligible items or null if for "politeness" reasons this host's queue is not yet ready.
If there are still unfetched items in the queues, but none of the items are ready, FetcherThread-s will spin-wait until either some items become available, or a timeout is reached (at which point the Fetcher will abort, assuming the task is hung).
- Author:
- Andrzej Bialecki
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
Fetcher.FetcherRun
static class
Fetcher.InputFormat
-
Field Summary
Fields Modifier and Type Field Description static String
CONTENT_REDIR
static int
PERM_REFRESH_TIME
static String
PROTOCOL_REDIR
-
Fields inherited from class org.apache.nutch.util.NutchTool
currentJob, currentJobNum, numJobs, results, status
-
-
Constructor Summary
Constructors Constructor Description Fetcher()
Fetcher(Configuration conf)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
fetch(Path segment, int threads)
static boolean
isParsing(Configuration conf)
static boolean
isStoringContent(Configuration conf)
static void
main(String[] args)
Run the fetcher.int
run(String[] args)
Map<String,Object>
run(Map<String,Object> args, String crawlId)
Runs the tool, using a map of arguments.-
Methods inherited from class org.apache.nutch.util.NutchTool
getProgress, getStatus, killJob, setConf, stopJob
-
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
-
-
-
-
Field Detail
-
PERM_REFRESH_TIME
public static final int PERM_REFRESH_TIME
- See Also:
- Constant Field Values
-
CONTENT_REDIR
public static final String CONTENT_REDIR
- See Also:
- Constant Field Values
-
PROTOCOL_REDIR
public static final String PROTOCOL_REDIR
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
Fetcher
public Fetcher()
-
Fetcher
public Fetcher(Configuration conf)
-
-
Method Detail
-
isParsing
public static boolean isParsing(Configuration conf)
-
isStoringContent
public static boolean isStoringContent(Configuration conf)
-
fetch
public void fetch(Path segment, int threads) throws IOException, InterruptedException, ClassNotFoundException
-
main
public static void main(String[] args) throws Exception
Run the fetcher.- Parameters:
args
- input parameters for the job- Throws:
Exception
- if a fatal error arises whilst running the job
-
run
public Map<String,Object> run(Map<String,Object> args, String crawlId) throws Exception
Description copied from class:NutchTool
Runs the tool, using a map of arguments. May return results, or null.- Specified by:
run
in classNutchTool
- Parameters:
args
- aMap
of arguments to be run with the toolcrawlId
- a crawl identifier to associate with the tool invocation- Returns:
- Map results object if tool executes successfully otherwise null
- Throws:
Exception
- if there is an error during the tool execution
-
-