Uses of Package
org.apache.nutch.protocol
-
Packages that use org.apache.nutch.protocol Package Description org.apache.nutch.analysis.lang Text document language identifier.org.apache.nutch.crawl Crawl control code and tools to run the crawler.org.apache.nutch.microformats.reltag A microformats Rel-Tag Parser/Indexer/Querier plugin.org.apache.nutch.parse TheParse
interface and related classes.org.apache.nutch.parse.ext Parse wrapper to run external command to do the parsing.org.apache.nutch.parse.feed Parse RSS feeds.org.apache.nutch.parse.headings Parse filter to extract headings (h1, h2, etc.) from DOM parse tree.org.apache.nutch.parse.html An HTML document parsing plugin.org.apache.nutch.parse.js Parser and parse filter plugin to extract all (possible) links from JavaScript files and embedded JavaScript code snippets.org.apache.nutch.parse.metatags Parse filter to extract meta tags: keywords, description, etc.org.apache.nutch.parse.tika Parse various document formats with help of Apache Tika.org.apache.nutch.parse.zip Parse ZIP files: embedded files are recursively passed to appropriate parsers.org.apache.nutch.parsefilter.debug Adds serialized DOM to parse data, useful for debugging, to understand how the parser implementation interprets a document (not only HTML).org.apache.nutch.parsefilter.naivebayes Html Parse filter that classifies the outlinks from the parseresult as relevant or irrelevant based on the parseText's relevancy (using a training file where you can give positive and negative example texts see the description of parsefilter.naivebayes.trainfile) and if found irrelevent it gives the link a second chance if it contains any of the words from the list given in parsefilter.naivebayes.wordlist.org.apache.nutch.parsefilter.regex RegexParseFilter.org.apache.nutch.protocol Classes related to theProtocol
interface, see alsoorg.apache.nutch.net.protocols
.org.apache.nutch.protocol.file Protocol plugin which supports retrieving local file resources.org.apache.nutch.protocol.ftp Protocol plugin which supports retrieving documents via the ftp protocol.org.apache.nutch.protocol.htmlunit Protocol plugin which supports retrieving documents via HTTP/HTTPS using Selenium and the HtmlUnitDriver web driver for the for the HtmlUnit headless browser.org.apache.nutch.protocol.http Protocol plugin which supports retrieving documents via the http protocol.org.apache.nutch.protocol.http.api Common API used by HTTP plugins (http
,httpclient
, etc.)org.apache.nutch.protocol.httpclient Protocol plugin which supports retrieving documents via the HTTP andHTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.org.apache.nutch.protocol.interactiveselenium Protocol plugin which supports retrieving documents using and interacting with Selenium.org.apache.nutch.protocol.okhttp Protocol plugin for HTTP/HTTPS based on okhttp, supports HTTP 1.1 and/or http/2.org.apache.nutch.protocol.selenium Protocol plugin which supports retrieving documents via Selenium.org.apache.nutch.scoring TheScoringFilter
interface.org.apache.nutch.scoring.depth Scoring filter to stop crawling at a configurable depth (number of "hops" from seed URLs).org.apache.nutch.scoring.link Scoring filter used in conjunction withWebGraph
.org.apache.nutch.scoring.metadata Metadata Scoring Pluginorg.apache.nutch.scoring.opic Scoring filter implementing a variant of the Online Page Importance Computation (OPIC) algorithm.org.apache.nutch.scoring.similarity org.apache.nutch.scoring.similarity.cosine Implements the cosine similarity metric for scoring relevant documentsorg.apache.nutch.scoring.urlmeta URL Meta Tag Scoring Pluginorg.apache.nutch.segment A segment stores all data from on generate/fetch/update cycle: fetch list, protocol status, raw content, parsed content, and extracted outgoing links.org.apache.nutch.tools Miscellaneous tools.org.apache.nutch.util Miscellaneous utility classes.org.creativecommons.nutch Sample plugins that parse and index Creative Commons metadata. -
Classes in org.apache.nutch.protocol used by org.apache.nutch.analysis.lang Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.crawl Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.microformats.reltag Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.parse Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.parse.ext Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.parse.feed Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.parse.headings Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.parse.html Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.parse.js Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.parse.metatags Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.parse.tika Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.parse.zip Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.parsefilter.debug Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.parsefilter.naivebayes Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.parsefilter.regex Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.protocol Class Description Content Protocol A retriever of url content.ProtocolException ProtocolNotFound ProtocolOutput Simple aggregate to pass from protocol plugins both content and protocol status.ProtocolStatus -
Classes in org.apache.nutch.protocol used by org.apache.nutch.protocol.file Class Description Content Protocol A retriever of url content.ProtocolException ProtocolOutput Simple aggregate to pass from protocol plugins both content and protocol status. -
Classes in org.apache.nutch.protocol used by org.apache.nutch.protocol.ftp Class Description Content Protocol A retriever of url content.ProtocolException ProtocolOutput Simple aggregate to pass from protocol plugins both content and protocol status.RobotRulesParser This class uses crawler-commons for handling the parsing ofrobots.txt
files. -
Classes in org.apache.nutch.protocol used by org.apache.nutch.protocol.htmlunit Class Description Protocol A retriever of url content.ProtocolException -
Classes in org.apache.nutch.protocol used by org.apache.nutch.protocol.http Class Description Protocol A retriever of url content.ProtocolException -
Classes in org.apache.nutch.protocol used by org.apache.nutch.protocol.http.api Class Description Content Protocol A retriever of url content.ProtocolException ProtocolOutput Simple aggregate to pass from protocol plugins both content and protocol status.RobotRulesParser This class uses crawler-commons for handling the parsing ofrobots.txt
files. -
Classes in org.apache.nutch.protocol used by org.apache.nutch.protocol.httpclient Class Description Protocol A retriever of url content.ProtocolException -
Classes in org.apache.nutch.protocol used by org.apache.nutch.protocol.interactiveselenium Class Description Protocol A retriever of url content.ProtocolException -
Classes in org.apache.nutch.protocol used by org.apache.nutch.protocol.okhttp Class Description Protocol A retriever of url content.ProtocolException -
Classes in org.apache.nutch.protocol used by org.apache.nutch.protocol.selenium Class Description Protocol A retriever of url content.ProtocolException -
Classes in org.apache.nutch.protocol used by org.apache.nutch.scoring Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.scoring.depth Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.scoring.link Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.scoring.metadata Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.scoring.opic Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.scoring.similarity Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.scoring.similarity.cosine Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.scoring.urlmeta Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.segment Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.tools Class Description Content -
Classes in org.apache.nutch.protocol used by org.apache.nutch.util Class Description Content ProtocolOutput Simple aggregate to pass from protocol plugins both content and protocol status. -
Classes in org.apache.nutch.protocol used by org.creativecommons.nutch Class Description Content