Package org.apache.nutch.segment
A segment stores all data from on generate/fetch/update cycle:
fetch list, protocol status, raw content, parsed content, and extracted outgoing links.
-
Interface Summary Interface Description SegmentMergeFilter Interface used to filter segments during segment merge. -
Class Summary Class Description ContentAsTextInputFormat An input format that takes Nutch Content objects and converts them to text while converting newline endings to spaces.SegmentChecker Checks whether a segment is valid, or has a certain status (generated, fetched, parsed), or can be used safely for a certain processing step (e.g., indexing).SegmentMergeFilters This class wraps allSegmentMergeFilter
extensions in a single object so it is easier to operate on them.SegmentMerger This tool takes several segments and merges their data together.SegmentMerger.ObjectInputFormat Wraps inputs in anMetaWrapper
, to permit merging different types in reduce and use additional metadata.SegmentMerger.SegmentMergerMapper SegmentMerger.SegmentMergerReducer NOTE: in selecting the latest version we rely exclusively on the segment name (not all segment data contain time information).SegmentMerger.SegmentOutputFormat SegmentPart Utility class for handling information about segment parts.SegmentReader Dump the content of a segment.SegmentReader.InputCompatMapper SegmentReader.InputCompatReducer SegmentReader.SegmentReaderStats SegmentReader.TextOutputFormat Implements a text output format