Class Generator

  • All Implemented Interfaces:
    Configurable, Tool

    public class Generator
    extends NutchTool
    implements Tool
    Generates a subset of a CrawlDb to fetch. This version allows to generate fetchlists for several segments in one go. Unlike in the initial version (OldGenerator), the IP resolution is done ONLY on the entries which have been selected for fetching. The URLs are partitioned by IP, domain or host within a segment. We can choose separately how to count the URLs i.e. by domain or host to limit the entries.
    • Constructor Detail

      • Generator

        public Generator()
    • Method Detail

      • generate

        public Path[] generate​(Path dbDir,
                               Path segments,
                               int numLists,
                               long topN,
                               long curTime)
                        throws IOException,
                               InterruptedException,
                               ClassNotFoundException
        Parameters:
        dbDir - Crawl database directory
        segments - Segments directory
        numLists - Number of fetch lists (partitions) per segment or number of fetcher map tasks. (One fetch list partition is fetched in one fetcher map task.)
        topN - Number of top URLs to be selected
        curTime - Current time in milliseconds
        Returns:
        Path to generated segment or null if no entries were selected
        Throws:
        IOException - if an I/O exception occurs.
        InterruptedException - if a thread is waiting, sleeping, or otherwise occupied, and the thread is interrupted, either before or during the activity.
        ClassNotFoundException - if runtime class(es) are not available
        See Also:
        LockUtil.createLockFile(Configuration, Path, boolean)
      • generate

        public Path[] generate​(Path dbDir,
                               Path segments,
                               int numLists,
                               long topN,
                               long curTime,
                               boolean filter,
                               boolean norm,
                               boolean force,
                               int maxNumSegments,
                               String expr)
                        throws IOException,
                               InterruptedException,
                               ClassNotFoundException
        This signature should be used in the instance that no hostdb is available. Generate fetchlists in one or more segments. Whether to filter URLs or not is read from the "generate.filter" property set for the job from command-line. If the property is not found, the URLs are filtered. Same for the normalisation.
        Parameters:
        dbDir - Crawl database directory
        segments - Segments directory
        numLists - Number of fetch lists (partitions) per segment or number of fetcher map tasks. (One fetch list partition is fetched in one fetcher map task.)
        topN - Number of top URLs to be selected
        curTime - Current time in milliseconds
        filter - whether to apply filtering operation
        norm - whether to apply normalization operation
        force - if true, and the target lockfile exists, consider it valid. If false and the target file exists, throw an IOException.
        maxNumSegments - maximum number of segments to generate
        expr - a Jexl expression to use in the Generator job.
        Returns:
        Path to generated segment or null if no entries were selected
        Throws:
        IOException - if an I/O exception occurs.
        InterruptedException - if a thread is waiting, sleeping, or otherwise occupied, and the thread is interrupted, either before or during the activity.
        ClassNotFoundException - if runtime class(es) are not available
        See Also:
        JexlUtil.parseExpression(String), LockUtil.createLockFile(Configuration, Path, boolean)
      • generate

        public Path[] generate​(Path dbDir,
                               Path segments,
                               int numLists,
                               long topN,
                               long curTime,
                               boolean filter,
                               boolean norm,
                               boolean force,
                               int maxNumSegments,
                               String expr,
                               String hostdb)
                        throws IOException,
                               InterruptedException,
                               ClassNotFoundException
        Generate fetchlists in one or more segments. Whether to filter URLs or not is read from the "generate.filter" property set for the job from command-line. If the property is not found, the URLs are filtered. Same for the normalisation.
        Parameters:
        dbDir - Crawl database directory
        segments - Segments directory
        numLists - Number of fetch lists (partitions) per segment or number of fetcher map tasks. (One fetch list partition is fetched in one fetcher map task.)
        topN - Number of top URLs to be selected
        curTime - Current time in milliseconds
        filter - whether to apply filtering operation
        norm - whether to apply normalization operation
        force - if true, and the target lockfile exists, consider it valid. If false and the target file exists, throw an IOException.
        maxNumSegments - maximum number of segments to generate
        expr - a Jexl expression to use in the Generator job.
        hostdb - name of a hostdb from which to execute Jexl expressions in a bid to determine the maximum URL count and/or fetch delay per host.
        Returns:
        Path to generated segment or null if no entries were selected
        Throws:
        IOException - if an I/O exception occurs.
        InterruptedException - if a thread is waiting, sleeping, or otherwise occupied, and the thread is interrupted, either before or during the activity.
        ClassNotFoundException - if runtime class(es) are not available
        See Also:
        JexlUtil.parseExpression(String), LockUtil.createLockFile(Configuration, Path, boolean)
      • generateSegmentName

        public static String generateSegmentName()
      • main

        public static void main​(String[] args)
                         throws Exception
        Generate a fetchlist from the crawldb.
        Parameters:
        args - array of arguments for this job
        Throws:
        Exception - if there is an error running the job
      • run

        public Map<String,​Object> run​(Map<String,​Object> args,
                                            String crawlId)
                                     throws Exception
        Description copied from class: NutchTool
        Runs the tool, using a map of arguments. May return results, or null.
        Specified by:
        run in class NutchTool
        Parameters:
        args - a Map of arguments to be run with the tool
        crawlId - a crawl identifier to associate with the tool invocation
        Returns:
        Map results object if tool executes successfully otherwise null
        Throws:
        Exception - if there is an error during the tool execution