Package org.apache.nutch.indexer
Index content, configure and run indexing and cleaning jobs to
add, update, and delete documents from an index. Two tasks are
delegated to plugins:
- indexing filters, which fill index fields of each document
- index writer plugins; which send documents to index back-ends (Solr, etc.).
-
Interface Summary Interface Description IndexingFilter Extension point for indexing.IndexWriter -
Class Summary Class Description CleaningJob The class scans CrawlDB looking for entries with status DB_GONE (404) or DB_DUPLICATE and sends delete requests to indexers for those documents.CleaningJob.DBFilter CleaningJob.DeleterReducer IndexerMapReduce This class is typically invoked from withinIndexingJob
and handles all MapReduce functionality required when undertaking indexing.IndexerMapReduce.IndexerMapper IndexerMapReduce.IndexerReducer IndexerOutputFormat IndexingFilters Creates and cachesIndexingFilter
implementing plugins.IndexingFiltersChecker Reads and parses a URL and run the indexers on it.IndexingJob Generic indexer which relies on the plugins implementing IndexWriterIndexWriterConfig IndexWriterParams IndexWriters Creates and cachesIndexWriter
implementing plugins.NutchDocument ANutchDocument
is the unit of indexing.NutchField This class represents a multi-valued field with a weight.NutchIndexAction ANutchIndexAction
is the new unit of indexing holding the document and action information. -
Exception Summary Exception Description IndexingException