Class ParserFactory

  extended by org.apache.nutch.parse.ParserFactory

public final class ParserFactory
extends Object

Creates and caches Parser plugins.

Field Summary
static String DEFAULT_PLUGIN
          Wildcard for default plugins.
static org.slf4j.Logger LOG
Constructor Summary
ParserFactory(Configuration conf)
Method Summary
protected  List<Extension> getExtensions(String contentType)
          Finds the best-suited parse plugin for a given contentType.
 Parser getParserById(String id)
          Function returns a Parser instance with the specified extId, representing its extension ID.
 Parser[] getParsers(String contentType, String url)
          Function returns an array of Parsers for a given content type.
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail


public static final org.slf4j.Logger LOG


public static final String DEFAULT_PLUGIN
Wildcard for default plugins.

See Also:
Constant Field Values
Constructor Detail


public ParserFactory(Configuration conf)
Method Detail


public Parser[] getParsers(String contentType,
                           String url)
                    throws ParserNotFound
Function returns an array of Parsers for a given content type. The function consults the internal list of parse plugins for the ParserFactory to determine the list of pluginIds, then gets the appropriate extension points to instantiate as Parsers.

contentType - The contentType to return the Array of Parsers for.
url - The url for the content that may allow us to get the type from the file suffix.
An Array of Parsers for the given contentType. If there were plugins mapped to a contentType via the parse-plugins.xml file, but never enabled via the plugin.includes Nutch conf, then those plugins won't be part of this array, i.e., they will be skipped. So, if the ordered list of parsing plugins for text/plain was [parse-text,parse-html, parse-rtf], and only parse-html and parse-rtf were enabled via plugin.includes, then this ordered Array would consist of two Parser interfaces, [parse-html, parse-rtf].


public Parser getParserById(String id)
                     throws ParserNotFound
Function returns a Parser instance with the specified extId, representing its extension ID. If the Parser instance isn't found, then the function throws a ParserNotFound exception. If the function is able to find the Parser in the internal PARSER_CACHE then it will return the already instantiated Parser. Otherwise, if it has to instantiate the Parser itself , then this function will cache that Parser in the internal PARSER_CACHE.

id - The string extension ID (e.g., "org.apache.nutch.parse.rss.RSSParser", "org.apache.nutch.parse.rtf.RTFParseFactory") of the Parser implementation to return.
A Parser implementation specified by the parameter id.
ParserNotFound - If the Parser is not found (i.e., registered with the extension point), or if the there a PluginRuntimeException instantiating the Parser.


protected List<Extension> getExtensions(String contentType)
Finds the best-suited parse plugin for a given contentType.

contentType - Content-Type for which we seek a parse plugin.
a list of extensions to be used for this contentType. If none, returns null.

Copyright © 2011 The Apache Software Foundation