Package org.apache.nutch.tools
Miscellaneous tools.
-
Interface Summary Interface Description CommonCrawlFormat Interface for all CommonCrawl formatter. -
Class Summary Class Description AbstractCommonCrawlFormat Abstract class that implements { @see org.apache.nutch.tools.CommonCrawlFormat } interface.CommonCrawlConfig CommonCrawlDataDumper The Common Crawl Data Dumper tool enables one to reverse generate the raw content from Nutch segment data directories into a common crawling data format, consumed by many applications.CommonCrawlFormatFactory Factory class that creates newCommonCrawlFormat
objects (a.k.a.CommonCrawlFormatJackson This class provides methods to map crawled data on JSON using Jackson Streaming APIs.CommonCrawlFormatJettinson This class provides methods to map crawled data on JSON using Jettinson APIs.CommonCrawlFormatSimple This class provides methods to map crawled data on JSON using a StringBuilder object.CommonCrawlFormatWARC DmozParser Utility that converts DMOZ RDF into a flat file of URLs to be injected.FileDumper The file dumper tool enables one to reverse generate the raw content from Nutch segment data directories.FreeGenerator This tool generates fetchlists (segments to be fetched) from plain text files containing one URL per line.FreeGenerator.FG FreeGenerator.FG.FGMapper FreeGenerator.FG.FGReducer ResolveUrls A simple tool that will spin up multiple threads to resolve urls to ip addresses.ShowProperties Tool to list properties and their values set by the current Nutch configurationWARCUtils