Package org.apache.nutch.crawl
Crawl control code and tools to run the crawler.
-
Interface Summary Interface Description FetchSchedule This interface defines the contract for implementations that manipulate fetch times and re-fetch intervals. -
Class Summary Class Description AbstractFetchSchedule This class provides common methods for implementations ofFetchSchedule
.AdaptiveFetchSchedule This class implements an adaptive re-fetch algorithm.CrawlDatum CrawlDatum.Comparator A Comparator optimized for CrawlDatum.CrawlDb This class takes the output of the fetcher and updates the crawldb accordingly.CrawlDbFilter This class provides a way to separate the URL normalization and filtering steps from the rest of CrawlDb manipulation code.CrawlDbMerger This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages.CrawlDbMerger.Merger CrawlDbReader Read utility for the CrawlDB.CrawlDbReader.CrawlDatumCsvOutputFormat CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter CrawlDbReader.CrawlDatumJsonOutputFormat CrawlDbReader.CrawlDatumJsonOutputFormat.LineRecordWriter CrawlDbReader.CrawlDatumJsonOutputFormat.WritableSerializer CrawlDbReader.CrawlDbDumpMapper CrawlDbReader.CrawlDbStatMapper CrawlDbReader.CrawlDbStatReducer CrawlDbReader.CrawlDbTopNMapper CrawlDbReader.CrawlDbTopNReducer CrawlDbReader.JsonIndenter CrawlDbReducer Merge new page entries with existing entries.DeduplicationJob Generic deduplicator which groups fetched URLs with the same digest and marks all of them as duplicate except the one with the highest score (based on the score in the crawldb, which is not necessarily the same as the score indexed).DeduplicationJob.DBFilter DeduplicationJob.DedupReducer<K extends Writable> DeduplicationJob.StatusUpdateReducer Combine multiple new entries for a url.DefaultFetchSchedule This class implements the default re-fetch schedule.FetchScheduleFactory Creates and caches aFetchSchedule
implementation.Generator Generates a subset of a crawl db to fetch.Generator.CrawlDbUpdater Update the CrawlDB so that the next generate won't include the same URLs.Generator.CrawlDbUpdater.CrawlDbUpdateMapper Generator.CrawlDbUpdater.CrawlDbUpdateReducer Generator.DecreasingFloatComparator Generator.HashComparator Sort fetch lists by hash of URL.Generator.PartitionReducer Generator.Selector Selects entries due for fetch.Generator.SelectorEntry Generator.SelectorInverseMapper Generator.SelectorMapper Select and invert subset due for fetch.Generator.SelectorReducer Collect until limit is reached.Injector Injector takes a flat text file of URLs (or a folder containing text files) and merges ("injects") these URLs into the CrawlDb.Injector.InjectMapper InjectMapper reads the CrawlDb seeds are injected into the plain-text seed files and parses each line into the URL and metadata.Injector.InjectReducer Combine multiple new entries for a url.Inlink An incoming link to a page.Inlinks A list ofInlink
s.LinkDb Maintains an inverted link map, listing incoming links for each url.LinkDb.LinkDbMapper LinkDbFilter This class provides a way to separate the URL normalization and filtering steps from the rest of LinkDb manipulation code.LinkDbMerger This tool merges several LinkDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited URLs and links.LinkDbMerger.LinkDbMergeReducer LinkDbReader Read utility for the LinkDb.LinkDbReader.LinkDBDumpMapper MD5Signature Default implementation of a page signature.MimeAdaptiveFetchSchedule Extension of @see AdaptiveFetchSchedule that allows for more flexible configuration of DEC and INC factors for various MIME-types.NutchWritable Signature SignatureComparator SignatureFactory Factory class, which instantiates a Signature implementation according to the current Configuration configuration.TextMD5Signature Implementation of a page signature.TextProfileSignature An implementation of a page signature.URLPartitioner Partition urls by host, domain name or IP depending on the value of the parameter 'partition.url.mode' which can be 'byHost', 'byDomain' or 'byIP'