org.apache.nutch.crawl (apache-nutch 1.21 API)

Crawl control code and tools to run the crawler.

Interface Summary
Interface Description

FetchSchedule
This interface defines the contract for implementations that manipulate fetch times and re-fetch intervals.

Interface Summary
Interface	Description
FetchSchedule	This interface defines the contract for implementations that manipulate fetch times and re-fetch intervals.

Class Summary
Class	Description
AbstractFetchSchedule	This class provides common methods for implementations of `FetchSchedule`.
AdaptiveFetchSchedule	This class implements an adaptive re-fetch algorithm.
CrawlDatum
CrawlDatum.Comparator	A Comparator optimized for CrawlDatum.
CrawlDb	This class takes the output of the fetcher and updates the crawldb accordingly.
CrawlDbFilter	This class provides a way to separate the URL normalization and filtering steps from the rest of CrawlDb manipulation code.
CrawlDbMerger	This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages.
CrawlDbMerger.Merger
CrawlDbReader	Read utility for the CrawlDB.
CrawlDbReader.CrawlDatumCsvOutputFormat
CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter
CrawlDbReader.CrawlDatumJsonOutputFormat
CrawlDbReader.CrawlDatumJsonOutputFormat.LineRecordWriter
CrawlDbReader.CrawlDatumJsonOutputFormat.WritableSerializer
CrawlDbReader.CrawlDbDumpMapper
CrawlDbReader.CrawlDbStatMapper
CrawlDbReader.CrawlDbStatReducer
CrawlDbReader.CrawlDbTopNMapper
CrawlDbReader.CrawlDbTopNReducer
CrawlDbReader.JsonIndenter
CrawlDbReducer	Merge new page entries with existing entries.
DeduplicationJob	Generic deduplicator which groups fetched URLs with the same digest and marks all of them as duplicate except the one with the highest score (based on the score in the crawldb, which is not necessarily the same as the score indexed).
DeduplicationJob.DBFilter
DeduplicationJob.DedupReducer<K extends Writable>
DeduplicationJob.StatusUpdateReducer	Combine multiple new entries for a url.
DefaultFetchSchedule	This class implements the default re-fetch schedule.
FetchScheduleFactory	Creates and caches a `FetchSchedule` implementation.
Generator	Generates a subset of a CrawlDb to fetch.
Generator.CrawlDbUpdater	Update the CrawlDB so that the next generate won't include the same URLs.
Generator.CrawlDbUpdater.CrawlDbUpdateMapper
Generator.CrawlDbUpdater.CrawlDbUpdateReducer
Generator.DecreasingFloatComparator
Generator.HashComparator	Sort fetch lists by hash of URL.
Generator.PartitionReducer
Generator.Selector	Selects entries due for fetch.
Generator.SelectorEntry
Generator.SelectorInverseMapper
Generator.SelectorMapper	Select and invert subset due for fetch.
Generator.SelectorReducer	Collect until limit is reached.
Injector	Injector takes a flat text file of URLs (or a folder containing text files) and merges ("injects") these URLs into the CrawlDb.
Injector.InjectMapper	InjectMapper reads the CrawlDb seeds are injected into the plain-text seed files and parses each line into the URL and metadata.
Injector.InjectReducer	Combine multiple new entries for a url.
Inlink	An incoming link to a page.
Inlinks	A list of `Inlink`s.
LinkDb	Maintains an inverted link map, listing incoming links for each url.
LinkDb.LinkDbMapper
LinkDbFilter	This class provides a way to separate the URL normalization and filtering steps from the rest of LinkDb manipulation code.
LinkDbMerger	This tool merges several LinkDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited URLs and links.
LinkDbMerger.LinkDbMergeReducer
LinkDbReader	Read utility for the LinkDb.
LinkDbReader.LinkDBDumpMapper
MD5Signature	Default implementation of a page signature.
MimeAdaptiveFetchSchedule	Extension of @see AdaptiveFetchSchedule that allows for more flexible configuration of DEC and INC factors for various MIME-types.
NutchWritable
Signature
SignatureComparator
SignatureFactory	Factory class, which instantiates a Signature implementation according to the current Configuration configuration.
TextMD5Signature	Implementation of a page signature.
TextProfileSignature	An implementation of a page signature.
URLPartitioner	Partition urls by host, domain name or IP depending on the value of the parameter 'partition.url.mode' which can be 'byHost', 'byDomain' or 'byIP'

Package org.apache.nutch.crawl