Package org.apache.nutch.net.urlnormalizer.protocol

URL normalizer to normalize the protocol for all URLs of a given host or domain. E.g., normalize http://nutch.apache.org/path/ to https://www.apache.org/path/ if it's known that the host nutch.apache.org supports https and http-URLs either cause duplicate content or are redirected to https. The configuration of rules follows the schema:
 <host> \t <protcol>
 
for example
 nutch.apache.org \t https
 *.example.com \t http
 
These rules will normalize all URLs of the host nutch.apache.org to use https while every URL from example.com and its subdomains is normalized to be based on http. A "host" pattern which starts with *. will match all hosts (subdomains) of the given domain, or more generally matches domain suffixes separated by a dot. Rules are usually configured via the configuration file "protocols.txt". The filename is specified by the property urlnormalizer.protocols.file. Alternatively, if the property urlnormalizer.protocols.rules defines a non-empty string, these rules take precedence of those specified in the rule file.