Apache Nutch is a highly extensible and scalable open source web crawler software project.

Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.

All Packages Core Plugins API Protocol Plugins URL Filter Plugins URL Normalizer Plugins Scoring Plugins Parse Plugins Parse Filter Plugins Publisher Plugins Exchange Plugins Indexing Filter Plugins Indexer Plugins Misc. Plugins 
Package Description
org.apache.nutch.analysis.lang
Text document language identifier.
org.apache.nutch.any23
This packages uses the Apache Any23 library for parsing and extracting structured data in RDF format from a variety of Web documents.
org.apache.nutch.collection
Subcollection is a subset of an index.
org.apache.nutch.crawl
Crawl control code and tools to run the crawler.
org.apache.nutch.exchange
Control code for exchange component, which acts in indexing job and decides to which index writer a document should be routed, based on plugins behavior.
org.apache.nutch.exchange.jexl
Plugin of Exchange component based on JEXL expressions.
org.apache.nutch.fetcher
The Nutch multi-threaded fetching module
org.apache.nutch.hostdb  
org.apache.nutch.indexer
Index content, configure and run indexing and cleaning jobs to add, update, and delete documents from an index.
org.apache.nutch.indexer.anchor
An indexing plugin for inbound anchor text.
org.apache.nutch.indexer.basic
A basic indexing plugin, adds basic fields: url, host, title, content, etc.
org.apache.nutch.indexer.feed
Indexing filter to index meta data from RSS feeds.
org.apache.nutch.indexer.filter  
org.apache.nutch.indexer.geoip
This plugin implements an indexing filter which takes advantage of the GeoIP2-java API.
org.apache.nutch.indexer.jexl
This plugin implements a dynamic indexing filter which uses JEXL expressions to allow filtering based on the page's metadata
org.apache.nutch.indexer.links  
org.apache.nutch.indexer.metadata
Indexing filter to add document metadata to the index.
org.apache.nutch.indexer.more
A more indexing plugin, adds "more" index fields:last modified date, MIME type, content length.
org.apache.nutch.indexer.replace
Indexing filter to allow pattern replacements on metadata.
org.apache.nutch.indexer.staticfield
A simple plugin called at indexing that adds fields with static data.
org.apache.nutch.indexer.subcollection
Indexing filter to assign documents to subcollections.
org.apache.nutch.indexer.tld
Top Level Domain Indexing plugin.
org.apache.nutch.indexer.urlmeta
URL Meta Tag Indexing Plugin
org.apache.nutch.indexwriter.cloudsearch  
org.apache.nutch.indexwriter.csv
Index writer plugin to write a plain CSV file.
org.apache.nutch.indexwriter.dummy
Index writer plugin for debugging, writes pairs of <action, url> to a text file, action is one of "add", "update", or "delete".
org.apache.nutch.indexwriter.elastic
Index writer plugin for Elasticsearch.
org.apache.nutch.indexwriter.kafka
Index writer plugin to produce JSON messages to Kafka.
org.apache.nutch.indexwriter.rabbit  
org.apache.nutch.indexwriter.solr
Index writer plugin for Apache Solr.
org.apache.nutch.metadata
A Multi-valued Metadata container, and set of constant fields for Nutch Metadata.
org.apache.nutch.microformats.reltag
A microformats Rel-Tag Parser/Indexer/Querier plugin.
org.apache.nutch.net
Web-related interfaces: URL filters and normalizers.
org.apache.nutch.net.protocols
Helper classes related to the Protocol interface, see also org.apache.nutch.protocol.
org.apache.nutch.net.urlnormalizer.ajax  
org.apache.nutch.net.urlnormalizer.basic
URL normalizer performing basic normalizations: remove default ports, e.g., port 80 for http:// URLs remove needless slashes and dot segments in the path component remove anchors use percent-encoding (only) where needed E.g., https://www.example.org/a/../b//./select%2Dlang.php?lang=espaƱol#anchor is normalized to https://www.example.org/b/select-lang.php?lang=espa%C3%B1ol Optional and configurable normalizations are: convert Internationalized Domain Names (IDNs) uniquely either to the ASCII (Punycode) or Unicode representation, see property urlnormalizer.basic.host.idn remove a trailing dot from host names, see property urlnormalizer.basic.host.trim-trailing-dot
org.apache.nutch.net.urlnormalizer.host
URL normalizer renaming hosts to a canonical form listed in the configuration file.
org.apache.nutch.net.urlnormalizer.pass
URL normalizer dummy which does not change URLs.
org.apache.nutch.net.urlnormalizer.protocol
URL normalizer to normalize the protocol for all URLs of a given host or domain.
org.apache.nutch.net.urlnormalizer.querystring
URL normalizer which sort the elements in the query part to avoid duplicates by permutations.
org.apache.nutch.net.urlnormalizer.regex
URL normalizer with configurable rules based on regular expressions (Pattern).
org.apache.nutch.net.urlnormalizer.slash  
org.apache.nutch.parse
The Parse interface and related classes.
org.apache.nutch.parse.ext
Parse wrapper to run external command to do the parsing.
org.apache.nutch.parse.feed
Parse RSS feeds.
org.apache.nutch.parse.headings
Parse filter to extract headings (h1, h2, etc.) from DOM parse tree.
org.apache.nutch.parse.html
An HTML document parsing plugin.
org.apache.nutch.parse.js
Parser and parse filter plugin to extract all (possible) links from JavaScript files and embedded JavaScript code snippets.
org.apache.nutch.parse.metatags
Parse filter to extract meta tags: keywords, description, etc.
org.apache.nutch.parse.tika
Parse various document formats with help of Apache Tika.
org.apache.nutch.parse.zip
Parse ZIP files: embedded files are recursively passed to appropriate parsers.
org.apache.nutch.parsefilter.debug
Adds serialized DOM to parse data, useful for debugging, to understand how the parser implementation interprets a document (not only HTML).
org.apache.nutch.parsefilter.naivebayes
Html Parse filter that classifies the outlinks from the parseresult as relevant or irrelevant based on the parseText's relevancy (using a training file where you can give positive and negative example texts see the description of parsefilter.naivebayes.trainfile) and if found irrelevent it gives the link a second chance if it contains any of the words from the list given in parsefilter.naivebayes.wordlist.
org.apache.nutch.parsefilter.regex
RegexParseFilter.
org.apache.nutch.plugin
The Nutch Plugin System.
org.apache.nutch.protocol
Classes related to the Protocol interface, see also org.apache.nutch.net.protocols.
org.apache.nutch.protocol.file
Protocol plugin which supports retrieving local file resources.
org.apache.nutch.protocol.ftp
Protocol plugin which supports retrieving documents via the ftp protocol.
org.apache.nutch.protocol.htmlunit
Protocol plugin which supports retrieving documents via HTTP/HTTPS using Selenium and the HtmlUnitDriver web driver for the for the HtmlUnit headless browser.
org.apache.nutch.protocol.http
Protocol plugin which supports retrieving documents via the http protocol.
org.apache.nutch.protocol.http.api
Common API used by HTTP plugins (http, httpclient, etc.)
org.apache.nutch.protocol.httpclient
Protocol plugin which supports retrieving documents via the HTTP andHTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
org.apache.nutch.protocol.interactiveselenium
Protocol plugin which supports retrieving documents using and interacting with Selenium.
org.apache.nutch.protocol.interactiveselenium.handlers
Handler implementations to interact with Selenium for org.apache.nutch.protocol.interactiveselenium.
org.apache.nutch.protocol.okhttp
Protocol plugin for HTTP/HTTPS based on okhttp, supports HTTP 1.1 and/or http/2.
org.apache.nutch.protocol.selenium
Protocol plugin which supports retrieving documents via Selenium.
org.apache.nutch.publisher  
org.apache.nutch.publisher.rabbitmq
Publisher package to implement queues
org.apache.nutch.rabbitmq  
org.apache.nutch.scoring
The ScoringFilter interface.
org.apache.nutch.scoring.depth
Scoring filter to stop crawling at a configurable depth (number of "hops" from seed URLs).
org.apache.nutch.scoring.link
Scoring filter used in conjunction with WebGraph.
org.apache.nutch.scoring.metadata
Metadata Scoring Plugin
org.apache.nutch.scoring.opic
Scoring filter implementing a variant of the Online Page Importance Computation (OPIC) algorithm.
org.apache.nutch.scoring.orphan
Scoring filter to modify score or status of orphaned pages (no inlinks found for a configurable amount of time).
org.apache.nutch.scoring.similarity  
org.apache.nutch.scoring.similarity.cosine
Implements the cosine similarity metric for scoring relevant documents
org.apache.nutch.scoring.similarity.util
Utility package for Lucene functions.
org.apache.nutch.scoring.tld
Top Level Domain Scoring plugin.
org.apache.nutch.scoring.urlmeta
URL Meta Tag Scoring Plugin
org.apache.nutch.scoring.webgraph
Scoring implementation based on link analysis (LinkRank), see WebGraph.
org.apache.nutch.segment
A segment stores all data from on generate/fetch/update cycle: fetch list, protocol status, raw content, parsed content, and extracted outgoing links.
org.apache.nutch.service  
org.apache.nutch.service.impl  
org.apache.nutch.service.model.request  
org.apache.nutch.service.model.response  
org.apache.nutch.service.resources  
org.apache.nutch.tools
Miscellaneous tools.
org.apache.nutch.tools.arc
Tools to read the Arc file format.
org.apache.nutch.tools.warc
Tools to import / export between Nutch segments and WARC archives.
org.apache.nutch.urlfilter.api
Generic URL filter library, abstracting away from regular expression implementations.
org.apache.nutch.urlfilter.automaton
URL filter plugin based on dk.brics.automaton Finite-State Automata for JavaTM.
org.apache.nutch.urlfilter.domain
URL filter plugin to include only URLs which match an element in a given list of domain suffixes, domain names, and/or host names.
org.apache.nutch.urlfilter.domaindenylist
URL filter plugin to exclude URLs by domain suffixes, domain names, and/or host names.
org.apache.nutch.urlfilter.fast
URL filter plugin that first does fast exact suffix matches on host/domain names before applying regular expressions to the path component of a URL.
org.apache.nutch.urlfilter.ignoreexempt
URL filter plugin which identifies exemptions to external urls when when external urls are set to ignore.
org.apache.nutch.urlfilter.prefix
URL filter plugin to include only URLs which match one of a given list of URL prefixes.
org.apache.nutch.urlfilter.regex
URL filter plugin to include and/or exclude URLs matching Java regular expressions.
org.apache.nutch.urlfilter.suffix
URL filter plugin to either exclude or include only URLs which match one of the given (path) suffixes.
org.apache.nutch.urlfilter.validator
URL filter plugin that validates given urls.
org.apache.nutch.util
Miscellaneous utility classes.
org.apache.nutch.util.domain
Classes for domain name analysis.
org.creativecommons.nutch
Sample plugins that parse and index Creative Commons metadata.