Package org.apache.nutch.net.urlnormalizer.basic

URL normalizer performing basic normalizations:
  • remove default ports, e.g., port 80 for http:// URLs
  • remove needless slashes and dot segments in the path component
  • remove anchors
  • use percent-encoding (only) where needed
E.g., https://www.example.org/a/../b//./select%2Dlang.php?lang=espaƱol#anchor is normalized to https://www.example.org/b/select-lang.php?lang=espa%C3%B1ol Optional and configurable normalizations are:
  • convert Internationalized Domain Names (IDNs) uniquely either to the ASCII (Punycode) or Unicode representation, see property urlnormalizer.basic.host.idn
  • remove a trailing dot from host names, see property urlnormalizer.basic.host.trim-trailing-dot