Package org.apache.nutch.net.urlnormalizer.protocol
URL normalizer to normalize the protocol for all URLs of a given host or
domain.
E.g., normalize
http://nutch.apache.org/path/
to
https://www.apache.org/path/
if it's known that the host
nutch.apache.org
supports https and http-URLs either cause
duplicate content or are redirected to https.
The configuration of rules follows the schema:
<host> \t <protcol>for example
nutch.apache.org \t https *.example.com \t httpThese rules will normalize all URLs of the host
nutch.apache.org
to use https while every URL from example.com
and its subdomains
is normalized to be based on http.
A "host" pattern which starts with *.
will match all hosts
(subdomains) of the given domain, or more generally matches domain suffixes
separated by a dot.
Rules are usually configured via the configuration file "protocols.txt". The
filename is specified by the property
urlnormalizer.protocols.file
. Alternatively, if the property
urlnormalizer.protocols.rules
defines a non-empty string, these
rules take precedence of those specified in the rule file.-
Class Summary Class Description ProtocolURLNormalizer URL normalizer to normalize the protocol for all URLs of a given host or domain, e.g.