Uses of Interface
org.apache.nutch.net.URLNormalizer
-
Packages that use URLNormalizer Package Description org.apache.nutch.net.urlnormalizer.ajax org.apache.nutch.net.urlnormalizer.basic URL normalizer performing basic normalizations: remove default ports, e.g., port 80 forhttp://
URLs remove needless slashes and dot segments in the path component remove anchors use percent-encoding (only) where needed E.g.,https://www.example.org/a/../b//./select%2Dlang.php?lang=espaƱol#anchor
is normalized tohttps://www.example.org/b/select-lang.php?lang=espa%C3%B1ol
Optional and configurable normalizations are: convert Internationalized Domain Names (IDNs) uniquely either to the ASCII (Punycode) or Unicode representation, see propertyurlnormalizer.basic.host.idn
remove a trailing dot from host names, see propertyurlnormalizer.basic.host.trim-trailing-dot
org.apache.nutch.net.urlnormalizer.host URL normalizer renaming hosts to a canonical form listed in the configuration file.org.apache.nutch.net.urlnormalizer.pass URL normalizer dummy which does not change URLs.org.apache.nutch.net.urlnormalizer.protocol URL normalizer to normalize the protocol for all URLs of a given host or domain.org.apache.nutch.net.urlnormalizer.querystring URL normalizer which sort the elements in the query part to avoid duplicates by permutations.org.apache.nutch.net.urlnormalizer.regex URL normalizer with configurable rules based on regular expressions (Pattern
).org.apache.nutch.net.urlnormalizer.slash -
-
Uses of URLNormalizer in org.apache.nutch.net.urlnormalizer.ajax
Classes in org.apache.nutch.net.urlnormalizer.ajax that implement URLNormalizer Modifier and Type Class Description class
AjaxURLNormalizer
URLNormalizer capable of dealing with AJAX URL's. -
Uses of URLNormalizer in org.apache.nutch.net.urlnormalizer.basic
Classes in org.apache.nutch.net.urlnormalizer.basic that implement URLNormalizer Modifier and Type Class Description class
BasicURLNormalizer
Converts URLs to a normal form: remove dot segments in path:/./
or/../
remove default ports, e.g. -
Uses of URLNormalizer in org.apache.nutch.net.urlnormalizer.host
Classes in org.apache.nutch.net.urlnormalizer.host that implement URLNormalizer Modifier and Type Class Description class
HostURLNormalizer
URL normalizer for mapping hosts to their desired form. -
Uses of URLNormalizer in org.apache.nutch.net.urlnormalizer.pass
Classes in org.apache.nutch.net.urlnormalizer.pass that implement URLNormalizer Modifier and Type Class Description class
PassURLNormalizer
This URLNormalizer doesn't change urls. -
Uses of URLNormalizer in org.apache.nutch.net.urlnormalizer.protocol
Classes in org.apache.nutch.net.urlnormalizer.protocol that implement URLNormalizer Modifier and Type Class Description class
ProtocolURLNormalizer
URL normalizer to normalize the protocol for all URLs of a given host or domain, e.g. -
Uses of URLNormalizer in org.apache.nutch.net.urlnormalizer.querystring
Classes in org.apache.nutch.net.urlnormalizer.querystring that implement URLNormalizer Modifier and Type Class Description class
QuerystringURLNormalizer
URL normalizer plugin for normalizing query strings but sorting query string parameters. -
Uses of URLNormalizer in org.apache.nutch.net.urlnormalizer.regex
Classes in org.apache.nutch.net.urlnormalizer.regex that implement URLNormalizer Modifier and Type Class Description class
RegexURLNormalizer
Allows users to do regex substitutions on all/any URLs that are encountered, which is useful for stripping session IDs from URLs. -
Uses of URLNormalizer in org.apache.nutch.net.urlnormalizer.slash
Classes in org.apache.nutch.net.urlnormalizer.slash that implement URLNormalizer Modifier and Type Class Description class
SlashURLNormalizer
-