org.apache.nutch.scoring.webgraph
Class WebGraph

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.nutch.scoring.webgraph.WebGraph
All Implemented Interfaces:
Configurable, Tool

public class WebGraph
extends Configured
implements Tool

Creates three databases, one for inlinks, one for outlinks, and a node database that holds the number of in and outlinks to a url and the current score for the url. The score is set by an analysis program such as LinkRank. The WebGraph is an update-able database. Outlinks are stored by their fetch time or by the current system time if no fetch time is available. Only the most recent version of outlinks for a given url is stored. As more crawls are executed and the WebGraph updated, newer Outlinks will replace older Outlinks. This allows the WebGraph to adapt to changes in the link structure of the web. The Inlink database is created from the Outlink database and is regenerated when the WebGraph is updated. The Node database is created from both the Inlink and Outlink databases. Because the Node database is overwritten when the WebGraph is updated and because the Node database holds current scores for urls it is recommended that a crawl-cyle (one or more full crawls) fully complete before the WebGraph is updated and some type of analysis, such as LinkRank, is run to update scores in the Node database in a stable fashion.


Nested Class Summary
static class WebGraph.OutlinkDb
          The OutlinkDb creates a database of all outlinks.
 
Field Summary
static String INLINK_DIR
           
static String LOCK_NAME
           
static org.slf4j.Logger LOG
           
static String NODE_DIR
           
static String OUTLINK_DIR
           
 
Constructor Summary
WebGraph()
           
 
Method Summary
 void createWebGraph(Path webGraphDb, Path[] segments)
          Creates the three different WebGraph databases, Outlinks, Inlinks, and Node.
static void main(String[] args)
           
 int run(String[] args)
          Parses command link arguments and runs the WebGraph jobs.
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Field Detail

LOG

public static final org.slf4j.Logger LOG

LOCK_NAME

public static final String LOCK_NAME
See Also:
Constant Field Values

INLINK_DIR

public static final String INLINK_DIR
See Also:
Constant Field Values

OUTLINK_DIR

public static final String OUTLINK_DIR
See Also:
Constant Field Values

NODE_DIR

public static final String NODE_DIR
See Also:
Constant Field Values
Constructor Detail

WebGraph

public WebGraph()
Method Detail

createWebGraph

public void createWebGraph(Path webGraphDb,
                           Path[] segments)
                    throws IOException
Creates the three different WebGraph databases, Outlinks, Inlinks, and Node. If a current WebGraph exists then it is updated, if it doesn't exist then a new WebGraph database is created.

Parameters:
webGraphDb - The WebGraph to create or update.
segments - The array of segments used to update the WebGraph. Newer segments and fetch times will overwrite older segments.
Throws:
IOException - If an error occurs while processing the WebGraph.

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception

run

public int run(String[] args)
        throws Exception
Parses command link arguments and runs the WebGraph jobs.

Specified by:
run in interface Tool
Throws:
Exception


Copyright © 2011 The Apache Software Foundation