Package org.apache.nutch.net.urlnormalizer.basic
URL normalizer performing basic normalizations:
- remove default ports, e.g., port 80 for
http://
URLs - remove needless slashes and dot segments in the path component
- remove anchors
- use percent-encoding (only) where needed
https://www.example.org/a/../b//./select%2Dlang.php?lang=espaƱol#anchor
is normalized to https://www.example.org/b/select-lang.php?lang=espa%C3%B1ol
Optional and configurable normalizations are:
- convert Internationalized Domain Names (IDNs) uniquely either to the
ASCII (Punycode) or Unicode representation, see property
urlnormalizer.basic.host.idn
- remove a trailing dot from host names, see property
urlnormalizer.basic.host.trim-trailing-dot
-
Class Summary Class Description BasicURLNormalizer Converts URLs to a normal form: remove dot segments in path:/./
or/../
remove default ports, e.g.