org.apache.nutch.analysis.lang
Class HTMLLanguageParser

java.lang.Object
  extended by org.apache.nutch.analysis.lang.HTMLLanguageParser
All Implemented Interfaces:
Configurable, HtmlParseFilter, Pluggable

public class HTMLLanguageParser
extends Object
implements HtmlParseFilter

Adds metadata identifying language of document if found We could also run statistical analysis here but we'd miss all other formats


Field Summary
static org.apache.commons.logging.Log LOG
           
 
Fields inherited from interface org.apache.nutch.parse.HtmlParseFilter
X_POINT_ID
 
Constructor Summary
HTMLLanguageParser()
           
 
Method Summary
 ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
          Scan the HTML document looking at possible indications of content language
1.
 Configuration getConf()
           
 void setConf(Configuration conf)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.apache.commons.logging.Log LOG
Constructor Detail

HTMLLanguageParser

public HTMLLanguageParser()
Method Detail

filter

public ParseResult filter(Content content,
                          ParseResult parseResult,
                          HTMLMetaTags metaTags,
                          DocumentFragment doc)
Scan the HTML document looking at possible indications of content language
  • 1. html lang attribute (http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1)
  • 2. meta dc.language (http://dublincore.org/documents/2000/07/16/usageguide/qualified-html.shtml#language)
  • 3. meta http-equiv (content-language) (http://www.w3.org/TR/REC-html40/struct/global.html#h-7.4.4.2)
    Only the first occurence of language is stored.

    Specified by:
    filter in interface HtmlParseFilter

  • setConf

    public void setConf(Configuration conf)
    Specified by:
    setConf in interface Configurable

    getConf

    public Configuration getConf()
    Specified by:
    getConf in interface Configurable


    Copyright © 2011 The Apache Software Foundation