Class PdfParser

  extended by org.apache.nutch.parse.pdf.PdfParser
All Implemented Interfaces:
Configurable, Parser, Pluggable

public class PdfParser
extends Object
implements Parser

parser for mime type application/pdf. It is based on org.pdfbox.*. We have to see how well it does the job.

John Xing Note on 20040614 by Xing: Some codes are stacked here for convenience (see inline comments). They may be moved to more appropriate places when new codebase stabilizes, especially after code for indexing is written.

Field Summary
static org.apache.commons.logging.Log LOG
Fields inherited from interface org.apache.nutch.parse.Parser
Constructor Summary
Method Summary
 Configuration getConf()
 ParseResult getParse(Content content)
           This method parses the given content and returns a map of <key, parse> pairs.
 void setConf(Configuration conf)
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail


public static final org.apache.commons.logging.Log LOG
Constructor Detail


public PdfParser()
Method Detail


public ParseResult getParse(Content content)
Description copied from interface: Parser

This method parses the given content and returns a map of <key, parse> pairs. Parse instances will be persisted under the given key.

Note: Meta-redirects should be followed only when they are coming from the original URL. That is:
Assume fetcher is in parsing mode and is currently processing If this url contains a meta redirect to another url, fetcher should only follow the redirect if the map contains an entry of the form <"", Parse with a ParseStatus indicating the redirect>.

Specified by:
getParse in interface Parser
content - Content to be parsed
a map containing <key, parse> pairs


public void setConf(Configuration conf)
Specified by:
setConf in interface Configurable


public Configuration getConf()
Specified by:
getConf in interface Configurable

Copyright © 2006 The Apache Software Foundation