Uses of Package
org.apache.nutch.net
-
Packages that use org.apache.nutch.net Package Description org.apache.nutch.collection Subcollection is a subset of an index.org.apache.nutch.hostdb org.apache.nutch.indexer Index content, configure and run indexing and cleaning jobs to add, update, and delete documents from an index.org.apache.nutch.net Web-related interfaces: URLfilters
andnormalizers
.org.apache.nutch.net.urlnormalizer.ajax org.apache.nutch.net.urlnormalizer.basic URL normalizer performing basic normalizations: remove default ports, e.g., port 80 forhttp://
URLs remove needless slashes and dot segments in the path component remove anchors use percent-encoding (only) where needed E.g.,https://www.example.org/a/../b//./select%2Dlang.php?lang=espaƱol#anchor
is normalized tohttps://www.example.org/b/select-lang.php?lang=espa%C3%B1ol
Optional and configurable normalizations are: convert Internationalized Domain Names (IDNs) uniquely either to the ASCII (Punycode) or Unicode representation, see propertyurlnormalizer.basic.host.idn
remove a trailing dot from host names, see propertyurlnormalizer.basic.host.trim-trailing-dot
org.apache.nutch.net.urlnormalizer.host URL normalizer renaming hosts to a canonical form listed in the configuration file.org.apache.nutch.net.urlnormalizer.pass URL normalizer dummy which does not change URLs.org.apache.nutch.net.urlnormalizer.protocol URL normalizer to normalize the protocol for all URLs of a given host or domain.org.apache.nutch.net.urlnormalizer.querystring URL normalizer which sort the elements in the query part to avoid duplicates by permutations.org.apache.nutch.net.urlnormalizer.regex URL normalizer with configurable rules based on regular expressions (Pattern
).org.apache.nutch.net.urlnormalizer.slash org.apache.nutch.parse TheParse
interface and related classes.org.apache.nutch.urlfilter.api GenericURL filter
library, abstracting away from regular expression implementations.org.apache.nutch.urlfilter.automaton URL filter plugin based on dk.brics.automaton Finite-State Automata for JavaTM.org.apache.nutch.urlfilter.domain URL filter plugin to include only URLs which match an element in a given list of domain suffixes, domain names, and/or host names.org.apache.nutch.urlfilter.domaindenylist URL filter plugin to exclude URLs by domain suffixes, domain names, and/or host names.org.apache.nutch.urlfilter.fast URL filter plugin that first does fast exact suffix matches on host/domain names before applying regular expressions to the path component of a URL.org.apache.nutch.urlfilter.ignoreexempt URL filter plugin which identifies exemptions to external urls when when external urls are set to ignore.org.apache.nutch.urlfilter.prefix URL filter plugin to include only URLs which match one of a given list of URL prefixes.org.apache.nutch.urlfilter.regex URL filter plugin to include and/or exclude URLs matching Java regular expressions.org.apache.nutch.urlfilter.suffix URL filter plugin to either exclude or include only URLs which match one of the given (path) suffixes.org.apache.nutch.urlfilter.validator URL filter plugin that validates given urls. -
Classes in org.apache.nutch.net used by org.apache.nutch.collection Class Description URLFilter Interface used to limit which URLs enter Nutch. -
Classes in org.apache.nutch.net used by org.apache.nutch.hostdb Class Description URLFilters Creates and caches plugins implementingURLFilter
.URLNormalizers This class uses a "chained filter" pattern to run defined normalizers. -
Classes in org.apache.nutch.net used by org.apache.nutch.indexer Class Description URLNormalizers This class uses a "chained filter" pattern to run defined normalizers. -
Classes in org.apache.nutch.net used by org.apache.nutch.net Class Description URLFilter Interface used to limit which URLs enter Nutch.URLFilterException -
Classes in org.apache.nutch.net used by org.apache.nutch.net.urlnormalizer.ajax Class Description URLNormalizer Interface used to convert URLs to normal form and optionally perform substitutions -
Classes in org.apache.nutch.net used by org.apache.nutch.net.urlnormalizer.basic Class Description URLNormalizer Interface used to convert URLs to normal form and optionally perform substitutions -
Classes in org.apache.nutch.net used by org.apache.nutch.net.urlnormalizer.host Class Description URLNormalizer Interface used to convert URLs to normal form and optionally perform substitutions -
Classes in org.apache.nutch.net used by org.apache.nutch.net.urlnormalizer.pass Class Description URLNormalizer Interface used to convert URLs to normal form and optionally perform substitutions -
Classes in org.apache.nutch.net used by org.apache.nutch.net.urlnormalizer.protocol Class Description URLNormalizer Interface used to convert URLs to normal form and optionally perform substitutions -
Classes in org.apache.nutch.net used by org.apache.nutch.net.urlnormalizer.querystring Class Description URLNormalizer Interface used to convert URLs to normal form and optionally perform substitutions -
Classes in org.apache.nutch.net used by org.apache.nutch.net.urlnormalizer.regex Class Description URLNormalizer Interface used to convert URLs to normal form and optionally perform substitutions -
Classes in org.apache.nutch.net used by org.apache.nutch.net.urlnormalizer.slash Class Description URLNormalizer Interface used to convert URLs to normal form and optionally perform substitutions -
Classes in org.apache.nutch.net used by org.apache.nutch.parse Class Description URLExemptionFilters Creates and cachesURLExemptionFilter
implementing plugins.URLFilters Creates and caches plugins implementingURLFilter
.URLNormalizers This class uses a "chained filter" pattern to run defined normalizers. -
Classes in org.apache.nutch.net used by org.apache.nutch.urlfilter.api Class Description URLFilter Interface used to limit which URLs enter Nutch. -
Classes in org.apache.nutch.net used by org.apache.nutch.urlfilter.automaton Class Description URLFilter Interface used to limit which URLs enter Nutch. -
Classes in org.apache.nutch.net used by org.apache.nutch.urlfilter.domain Class Description URLFilter Interface used to limit which URLs enter Nutch. -
Classes in org.apache.nutch.net used by org.apache.nutch.urlfilter.domaindenylist Class Description URLFilter Interface used to limit which URLs enter Nutch. -
Classes in org.apache.nutch.net used by org.apache.nutch.urlfilter.fast Class Description URLFilter Interface used to limit which URLs enter Nutch. -
Classes in org.apache.nutch.net used by org.apache.nutch.urlfilter.ignoreexempt Class Description URLExemptionFilter Interface used to allow exemptions to external domain resources by overridingdb.ignore.external.links
.URLFilter Interface used to limit which URLs enter Nutch. -
Classes in org.apache.nutch.net used by org.apache.nutch.urlfilter.prefix Class Description URLFilter Interface used to limit which URLs enter Nutch. -
Classes in org.apache.nutch.net used by org.apache.nutch.urlfilter.regex Class Description URLFilter Interface used to limit which URLs enter Nutch. -
Classes in org.apache.nutch.net used by org.apache.nutch.urlfilter.suffix Class Description URLFilter Interface used to limit which URLs enter Nutch. -
Classes in org.apache.nutch.net used by org.apache.nutch.urlfilter.validator Class Description URLFilter Interface used to limit which URLs enter Nutch.