Overview (apache-nutch 1.19 API)

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.

All Packages Core Plugins API Protocol Plugins URL Filter Plugins URL Normalizer Plugins Scoring Plugins Parse Plugins Parse Filter Plugins Publisher Plugins Exchange Plugins Indexing Filter Plugins Indexer Plugins Misc. Plugins
Package	Description
org.apache.nutch.analysis.lang	Text document language identifier.
org.apache.nutch.any23	This packages uses the Apache Any23 library for parsing and extracting structured data in RDF format from a variety of Web documents.
org.apache.nutch.collection	Subcollection is a subset of an index.
org.apache.nutch.crawl	Crawl control code and tools to run the crawler.
org.apache.nutch.exchange	Control code for exchange component, which acts in indexing job and decides to which index writer a document should be routed, based on plugins behavior.
org.apache.nutch.exchange.jexl	Plugin of Exchange component based on JEXL expressions.
org.apache.nutch.fetcher	The Nutch multi-threaded fetching module
org.apache.nutch.hostdb
org.apache.nutch.indexer	Index content, configure and run indexing and cleaning jobs to add, update, and delete documents from an index.
org.apache.nutch.indexer.anchor	An indexing plugin for inbound anchor text.
org.apache.nutch.indexer.basic	A basic indexing plugin, adds basic fields: url, host, title, content, etc.
org.apache.nutch.indexer.feed	Indexing filter to index meta data from RSS feeds.
org.apache.nutch.indexer.filter
org.apache.nutch.indexer.geoip	This plugin implements an indexing filter which takes advantage of the GeoIP2-java API.
org.apache.nutch.indexer.jexl	This plugin implements a dynamic indexing filter which uses JEXL expressions to allow filtering based on the page's metadata
org.apache.nutch.indexer.links
org.apache.nutch.indexer.metadata	Indexing filter to add document metadata to the index.
org.apache.nutch.indexer.more	A more indexing plugin, adds "more" index fields:last modified date, MIME type, content length.
org.apache.nutch.indexer.replace	Indexing filter to allow pattern replacements on metadata.
org.apache.nutch.indexer.staticfield	A simple plugin called at indexing that adds fields with static data.
org.apache.nutch.indexer.subcollection	Indexing filter to assign documents to subcollections.
org.apache.nutch.indexer.tld	Top Level Domain Indexing plugin.
org.apache.nutch.indexer.urlmeta	URL Meta Tag Indexing Plugin
org.apache.nutch.indexwriter.cloudsearch
org.apache.nutch.indexwriter.csv	Index writer plugin to write a plain CSV file.
org.apache.nutch.indexwriter.dummy	Index writer plugin for debugging, writes pairs of <action, url> to a text file, action is one of "add", "update", or "delete".
org.apache.nutch.indexwriter.elastic	Index writer plugin for Elasticsearch.
org.apache.nutch.indexwriter.kafka	Index writer plugin to produce JSON messages to Kafka.
org.apache.nutch.indexwriter.rabbit
org.apache.nutch.indexwriter.solr	Index writer plugin for Apache Solr.
org.apache.nutch.metadata	A Multi-valued Metadata container, and set of constant fields for Nutch Metadata.
org.apache.nutch.microformats.reltag	A microformats Rel-Tag Parser/Indexer/Querier plugin.
org.apache.nutch.net	Web-related interfaces: URL `filters` and `normalizers`.
org.apache.nutch.net.protocols	Helper classes related to the `Protocol` interface, see also `org.apache.nutch.protocol`.
org.apache.nutch.net.urlnormalizer.ajax
org.apache.nutch.net.urlnormalizer.basic	URL normalizer performing basic normalizations: remove default ports, e.g., port 80 for `http://` URLs remove needless slashes and dot segments in the path component remove anchors use percent-encoding (only) where needed E.g., `https://www.example.org/a/../b//./select%2Dlang.php?lang=español#anchor` is normalized to `https://www.example.org/b/select-lang.php?lang=espa%C3%B1ol` Optional and configurable normalizations are: convert Internationalized Domain Names (IDNs) uniquely either to the ASCII (Punycode) or Unicode representation, see property `urlnormalizer.basic.host.idn` remove a trailing dot from host names, see property `urlnormalizer.basic.host.trim-trailing-dot`
org.apache.nutch.net.urlnormalizer.host	URL normalizer renaming hosts to a canonical form listed in the configuration file.
org.apache.nutch.net.urlnormalizer.pass	URL normalizer dummy which does not change URLs.
org.apache.nutch.net.urlnormalizer.protocol	URL normalizer to normalize the protocol for all URLs of a given host or domain.
org.apache.nutch.net.urlnormalizer.querystring	URL normalizer which sort the elements in the query part to avoid duplicates by permutations.
org.apache.nutch.net.urlnormalizer.regex	URL normalizer with configurable rules based on regular expressions (`Pattern`).
org.apache.nutch.net.urlnormalizer.slash
org.apache.nutch.parse	The `Parse` interface and related classes.
org.apache.nutch.parse.ext	Parse wrapper to run external command to do the parsing.
org.apache.nutch.parse.feed	Parse RSS feeds.
org.apache.nutch.parse.headings	Parse filter to extract headings (h1, h2, etc.) from DOM parse tree.
org.apache.nutch.parse.html	An HTML document parsing plugin.
org.apache.nutch.parse.js	Parser and parse filter plugin to extract all (possible) links from JavaScript files and embedded JavaScript code snippets.
org.apache.nutch.parse.metatags	Parse filter to extract meta tags: keywords, description, etc.
org.apache.nutch.parse.tika	Parse various document formats with help of Apache Tika.
org.apache.nutch.parse.zip	Parse ZIP files: embedded files are recursively passed to appropriate parsers.
org.apache.nutch.parsefilter.debug	Adds serialized DOM to parse data, useful for debugging, to understand how the parser implementation interprets a document (not only HTML).
org.apache.nutch.parsefilter.naivebayes	Html Parse filter that classifies the outlinks from the parseresult as relevant or irrelevant based on the parseText's relevancy (using a training file where you can give positive and negative example texts see the description of parsefilter.naivebayes.trainfile) and if found irrelevent it gives the link a second chance if it contains any of the words from the list given in parsefilter.naivebayes.wordlist.
org.apache.nutch.parsefilter.regex	RegexParseFilter.
org.apache.nutch.plugin	The Nutch `Plugin` System.
org.apache.nutch.protocol	Classes related to the `Protocol` interface, see also `org.apache.nutch.net.protocols`.
org.apache.nutch.protocol.file	Protocol plugin which supports retrieving local file resources.
org.apache.nutch.protocol.ftp	Protocol plugin which supports retrieving documents via the ftp protocol.
org.apache.nutch.protocol.htmlunit	Protocol plugin which supports retrieving documents via HTTP/HTTPS using Selenium and the HtmlUnitDriver web driver for the for the HtmlUnit headless browser.
org.apache.nutch.protocol.http	Protocol plugin which supports retrieving documents via the http protocol.
org.apache.nutch.protocol.http.api	Common API used by HTTP plugins (`http`, `httpclient`, etc.)
org.apache.nutch.protocol.httpclient	Protocol plugin which supports retrieving documents via the HTTP andHTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
org.apache.nutch.protocol.interactiveselenium	Protocol plugin which supports retrieving documents using and interacting with Selenium.
org.apache.nutch.protocol.interactiveselenium.handlers	Handler implementations to interact with Selenium for `org.apache.nutch.protocol.interactiveselenium`.
org.apache.nutch.protocol.okhttp	Protocol plugin for HTTP/HTTPS based on okhttp, supports HTTP 1.1 and/or http/2.
org.apache.nutch.protocol.selenium	Protocol plugin which supports retrieving documents via Selenium.
org.apache.nutch.publisher
org.apache.nutch.publisher.rabbitmq	Publisher package to implement queues
org.apache.nutch.rabbitmq
org.apache.nutch.scoring	The `ScoringFilter` interface.
org.apache.nutch.scoring.depth	Scoring filter to stop crawling at a configurable depth (number of "hops" from seed URLs).
org.apache.nutch.scoring.link	Scoring filter used in conjunction with `WebGraph`.
org.apache.nutch.scoring.metadata	Metadata Scoring Plugin
org.apache.nutch.scoring.opic	Scoring filter implementing a variant of the Online Page Importance Computation (OPIC) algorithm.
org.apache.nutch.scoring.orphan	Scoring filter to modify score or status of orphaned pages (no inlinks found for a configurable amount of time).
org.apache.nutch.scoring.similarity
org.apache.nutch.scoring.similarity.cosine	Implements the cosine similarity metric for scoring relevant documents
org.apache.nutch.scoring.similarity.util	Utility package for Lucene functions.
org.apache.nutch.scoring.tld	Top Level Domain Scoring plugin.
org.apache.nutch.scoring.urlmeta	URL Meta Tag Scoring Plugin
org.apache.nutch.scoring.webgraph	Scoring implementation based on link analysis (`LinkRank`), see `WebGraph`.
org.apache.nutch.segment	A segment stores all data from on generate/fetch/update cycle: fetch list, protocol status, raw content, parsed content, and extracted outgoing links.
org.apache.nutch.service
org.apache.nutch.service.impl
org.apache.nutch.service.model.request
org.apache.nutch.service.model.response
org.apache.nutch.service.resources
org.apache.nutch.tools	Miscellaneous tools.
org.apache.nutch.tools.arc	Tools to read the Arc file format.
org.apache.nutch.tools.warc	Tools to import / export between Nutch segments and WARC archives.
org.apache.nutch.urlfilter.api	Generic `URL filter` library, abstracting away from regular expression implementations.
org.apache.nutch.urlfilter.automaton	URL filter plugin based on dk.brics.automaton Finite-State Automata for Java^TM.
org.apache.nutch.urlfilter.domain	URL filter plugin to include only URLs which match an element in a given list of domain suffixes, domain names, and/or host names.
org.apache.nutch.urlfilter.domaindenylist	URL filter plugin to exclude URLs by domain suffixes, domain names, and/or host names.
org.apache.nutch.urlfilter.fast	URL filter plugin that first does fast exact suffix matches on host/domain names before applying regular expressions to the path component of a URL.
org.apache.nutch.urlfilter.ignoreexempt	URL filter plugin which identifies exemptions to external urls when when external urls are set to ignore.
org.apache.nutch.urlfilter.prefix	URL filter plugin to include only URLs which match one of a given list of URL prefixes.
org.apache.nutch.urlfilter.regex	URL filter plugin to include and/or exclude URLs matching Java regular expressions.
org.apache.nutch.urlfilter.suffix	URL filter plugin to either exclude or include only URLs which match one of the given (path) suffixes.
org.apache.nutch.urlfilter.validator	URL filter plugin that validates given urls.
org.apache.nutch.util	Miscellaneous utility classes.
org.apache.nutch.util.domain	Classes for domain name analysis.
org.creativecommons.nutch	Sample plugins that parse and index Creative Commons metadata.