|
||||||||||
| PREV NEXT | FRAMES NO FRAMES | |||||||||
FetchSchedule.String can be decoded in reverse and the
first character is represented by a terminal node.
String can be decoded and the last character is
represented by a terminal node.
ArchRecordReader class provides a record reader which
reads records from arc files.ArcSegmentCreator is a replacement for fetcher that will
take arc files as input and produce a nutch segment as output.CircularDependencyException will be thrown if a circular
dependency is detected.MimeType name by removing out the actual MimeType,
from a string of the form:
Configuration for Nutch.
Configuration from supplied properties.
Text object for the key.
ParseResult from a single
Parse output.
RegexRule.
BytesWritable object for the key
DomainSuffix objects
Note: this class is singletonExtension is a kind of listener descriptor that will be
installed on a concrete ExtensionPoint that acts as kind of
Publisher.ExtensionPoint provide meta information of a extension
point.FetchSchedule implementation.Indexer for indexing within the Nutch
index.
SegmentMergeFilter extensions and if any of them
returns false, it will return false as well.
MimeTypes.forName(String)
method.
CrawlDatum.getScore().
ConfigurablesDomainSuffix object for the extension, if
extension is a top level domain returned object will be an
instance of TopLevelDomain
DomainSuffix corresponding to the
last public part of the hostname
DomainSuffix corresponding to the
last public part of the hostname
robotsMeta to appropriate
values, based on any META tags found under the given
node.
MimeTypes.getMimeType(String)
method.
MimeTypes.getMimeType(File)
method.
node, and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks ArrayList.
Outlink from given plain text.
Outlink from given plain text and adds anchor
to the extracted Outlinks
Parser instance with the specified
extId, representing its extension ID.
Parsers for a given content type.
Plugin class.
null.
Protocol implementation for a url.
Content for a fetchlist entry.
RecordReader for reading the arc file.
url with a configured HTTP client and
gets the response.
StringBuffer and a DOM Node,
and will append all the content text found beneath the DOM node to
the StringBuffer.
getText(sb, node, false).
StringBuffer and a DOM Node,
and will append the content text found beneath the first
title node to the StringBuffer.
HtmlParseFilter implementing plugins.IndexingFilter implementing plugins.sizeLimit bytes, if necessary.
Inlinks.false if the robots.txt file
prohibits us from accessing the given url, or
true otherwise.
false if the robots.txt file
prohibits us from accessing the given path, or
true otherwise.
false if the robots.txt file
prohibits us from accessing the given url, or
true otherwise.
IndexingFilter that
add a lang (language) field to the document.s padded with leading spaces so
that it's length is length.
input that is matched,
or null if no match exists.
- longestMatch(String) -
Method in class org.apache.nutch.util.SuffixStringMatcher
- Returns the longest suffix of
input that is matched,
or null if no match exists.
- longestMatch(String) -
Method in class org.apache.nutch.util.TrieStringMatcher
- Returns the longest substring of
input that is
matched by a pattern in the trie, or null if no match
exists.
- LoopReader - Class in org.apache.nutch.scoring.webgraph
- The LoopReader tool prints the loopset information for a single url.
- LoopReader() -
Constructor for class org.apache.nutch.scoring.webgraph.LoopReader
-
- LoopReader(Configuration) -
Constructor for class org.apache.nutch.scoring.webgraph.LoopReader
-
- Loops - Class in org.apache.nutch.scoring.webgraph
- The Loops job identifies cycles of loops inside of the web graph.
- Loops() -
Constructor for class org.apache.nutch.scoring.webgraph.Loops
-
- Loops.Finalizer - Class in org.apache.nutch.scoring.webgraph
- Finishes the Loops job by aggregating and collecting and found routes.
- Loops.Finalizer() -
Constructor for class org.apache.nutch.scoring.webgraph.Loops.Finalizer
- Default constructor.
- Loops.Finalizer(Configuration) -
Constructor for class org.apache.nutch.scoring.webgraph.Loops.Finalizer
- Configurable constructor.
- Loops.Initializer - Class in org.apache.nutch.scoring.webgraph
- Initializes the Loop routes.
- Loops.Initializer() -
Constructor for class org.apache.nutch.scoring.webgraph.Loops.Initializer
- Default constructor.
- Loops.Initializer(Configuration) -
Constructor for class org.apache.nutch.scoring.webgraph.Loops.Initializer
- Configurable constructor.
- Loops.Looper - Class in org.apache.nutch.scoring.webgraph
- Follows a route path looking for the start url of the route.
- Loops.Looper() -
Constructor for class org.apache.nutch.scoring.webgraph.Loops.Looper
- Default constructor.
- Loops.Looper(Configuration) -
Constructor for class org.apache.nutch.scoring.webgraph.Loops.Looper
- Configurable constructor.
- Loops.LoopSet - Class in org.apache.nutch.scoring.webgraph
- A set of loops.
- Loops.LoopSet() -
Constructor for class org.apache.nutch.scoring.webgraph.Loops.LoopSet
-
- Loops.Route - Class in org.apache.nutch.scoring.webgraph
- A link path or route looking to identify a link cycle.
- Loops.Route() -
Constructor for class org.apache.nutch.scoring.webgraph.Loops.Route
-
- LOOPS_DIR -
Static variable in class org.apache.nutch.scoring.webgraph.Loops
-
Parser.
TrieStringMatcher.TrieNode visited, given that you are at
node, and the the next character in the input is
the idx'th character of s.
String is matched by a
prefix in the trie
String is matched by a
suffix in the trie
String is matched by a
pattern in the trie
MissingDependencyException will be thrown if a plugin
dependency cannot be found.Node on the stack and pushes all of its
children onto the stack, allowing us to walk the node tree without the
use of recursion.
Node tree from the root node.
Configurations that include Nutch-specific
resources.NutchDocument is the unit of indexing.JobConf for Nutch jobs.Plugin System.http,
httpclient)Outlinks
/ URLs from plain text using Regular Expressions.Parsers
until a successful parse is performed and a Parse object is
returned.
Content object using the Parser specified
by the parameter extId, i.e., the Parser's extension ID.
Protocol
implementation.Parser plugins.Parsers to obtain
Parse objects.Content metadata.
PluginClassLoader contains only classes of the runtime
libraries setuped in the plugin manifest file and exported libraries of
plugins that are required pluguin.PluginDescriptor provide access to all meta information of
a nutch-plugin, as well to the internationalizable resources and the plugin
own classloader.PluginManifestParser parser just parse the manifest file
in all plugin directories.PluginRuntimeException will be thrown until a exception in the
plugin managemnt occurs.Strings against a set
of prefixes.PrefixStringMatcher which will match
Strings with any prefix in the supplied array.
PrefixStringMatcher which will match
Strings with any prefix in the supplied
Collection.
ProtocolException instead.Protocol plugins.Java Regex implementation.URL filter based on
regular expressions.IndexingFilter that
add tag field(s) to the document.false.
s padded with trailing spaces so
that it's length is length.
robots.txt files.Fetcher when processing
redirect URLs.
Generator.
Injector.
Outlink instances.
URLPartitioner.
ScoringFilter implementing plugins.SegmentMergeFilter extensions in a single object
so it is easier to operate on them.MetaWrapper, to permit merging different
types in reduce and use additional metadata.baseHref.
Configuration object used to configure this
IndexingFilter.
Configuration object for this Parser.
fetchInterval and fetchTime on a
successfully fetched page.
fetchInterval and fetchTime on a
successfully fetched page.
noCache to true.
noFollow to true.
noIndex to true.
refresh to the supplied value.
refreshHref.
refreshTime.
input that is matched,
or null if no match exists.
- shortestMatch(String) -
Method in class org.apache.nutch.util.SuffixStringMatcher
- Returns the shortest suffix of
input that is matched,
or null if no match exists.
- shortestMatch(String) -
Method in class org.apache.nutch.util.TrieStringMatcher
- Returns the shortest substring of
input that is
matched by a pattern in the trie, or null if no match
exists.
- shouldFetch(Text, CrawlDatum, long) -
Method in class org.apache.nutch.crawl.AbstractFetchSchedule
- This method provides information whether the page is suitable for
selection in the current fetchlist.
- shouldFetch(Text, CrawlDatum, long) -
Method in interface org.apache.nutch.crawl.FetchSchedule
- This method provides information whether the page is suitable for
selection in the current fetchlist.
- shutDown() -
Method in class org.apache.nutch.plugin.Plugin
- Shutdown the plugin.
- Signature - Class in org.apache.nutch.crawl
-
- Signature() -
Constructor for class org.apache.nutch.crawl.Signature
-
- SIGNATURE_KEY -
Static variable in interface org.apache.nutch.metadata.Nutch
-
- SignatureComparator - Class in org.apache.nutch.crawl
-
- SignatureComparator() -
Constructor for class org.apache.nutch.crawl.SignatureComparator
-
- SignatureFactory - Class in org.apache.nutch.crawl
- Factory class, which instantiates a Signature implementation according to the
current Configuration configuration.
- size() -
Method in class org.apache.nutch.crawl.Inlinks
-
- size() -
Method in class org.apache.nutch.crawl.MapWritable
- Deprecated.
- size() -
Method in class org.apache.nutch.metadata.Metadata
- Returns the number of metadata names in this metadata.
- size() -
Method in class org.apache.nutch.parse.ParseResult
- Return the number of parse outputs (both successful and failed)
- skip(DataInput) -
Static method in class org.apache.nutch.crawl.Inlink
- Skips over one Inlink in the input.
- skip(DataInput) -
Static method in class org.apache.nutch.parse.Outlink
- Skips over one Outlink in the input.
- SKIP_TRUNCATED -
Static variable in class org.apache.nutch.parse.ParseSegment
-
- skipChildren() -
Method in class org.apache.nutch.util.NodeWalker
- Skips over and removes from the node stack the children of the last
node.
- skippedEntity(String) -
Method in class org.apache.nutch.parse.html.DOMBuilder
- Receive notification of a skipped entity.
- SOLR_PREFIX -
Static variable in interface org.apache.nutch.indexer.solr.SolrConstants
-
- SolrClean - Class in org.apache.nutch.indexer.solr
- The class scans CrawlDB looking for entries with status DB_GONE (404) and sends delete requests to Solr
for those documents.
- SolrClean() -
Constructor for class org.apache.nutch.indexer.solr.SolrClean
-
- SolrClean.DBFilter - Class in org.apache.nutch.indexer.solr
-
- SolrClean.DBFilter() -
Constructor for class org.apache.nutch.indexer.solr.SolrClean.DBFilter
-
- SolrClean.SolrDeleter - Class in org.apache.nutch.indexer.solr
-
- SolrClean.SolrDeleter() -
Constructor for class org.apache.nutch.indexer.solr.SolrClean.SolrDeleter
-
- SolrConstants - Interface in org.apache.nutch.indexer.solr
-
- SolrDeleteDuplicates - Class in org.apache.nutch.indexer.solr
- Utility class for deleting duplicate documents from a solr index.
- SolrDeleteDuplicates() -
Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates
-
- SolrDeleteDuplicates.SolrInputFormat - Class in org.apache.nutch.indexer.solr
-
- SolrDeleteDuplicates.SolrInputFormat() -
Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrInputFormat
-
- SolrDeleteDuplicates.SolrInputSplit - Class in org.apache.nutch.indexer.solr
-
- SolrDeleteDuplicates.SolrInputSplit() -
Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrInputSplit
-
- SolrDeleteDuplicates.SolrInputSplit(int, int) -
Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrInputSplit
-
- SolrDeleteDuplicates.SolrRecord - Class in org.apache.nutch.indexer.solr
-
- SolrDeleteDuplicates.SolrRecord() -
Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrRecord
-
- SolrDeleteDuplicates.SolrRecord(SolrDeleteDuplicates.SolrRecord) -
Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrRecord
-
- SolrDeleteDuplicates.SolrRecord(String, float, long) -
Constructor for class org.apache.nutch.indexer.solr.SolrDeleteDuplicates.SolrRecord
-
- SolrIndexer - Class in org.apache.nutch.indexer.solr
-
- SolrIndexer() -
Constructor for class org.apache.nutch.indexer.solr.SolrIndexer
-
- SolrIndexer(Configuration) -
Constructor for class org.apache.nutch.indexer.solr.SolrIndexer
-
- SolrMappingReader - Class in org.apache.nutch.indexer.solr
-
- SolrMappingReader(Configuration) -
Constructor for class org.apache.nutch.indexer.solr.SolrMappingReader
-
- SolrUtils - Class in org.apache.nutch.indexer.solr
-
- SolrUtils() -
Constructor for class org.apache.nutch.indexer.solr.SolrUtils
-
- SolrWriter - Class in org.apache.nutch.indexer.solr
-
- SolrWriter() -
Constructor for class org.apache.nutch.indexer.solr.SolrWriter
-
- SOURCE -
Static variable in interface org.apache.nutch.metadata.DublinCore
- A reference to a resource from which the present resource is derived.
- SpellCheckedMetadata - Class in org.apache.nutch.metadata
- A decorator to Metadata that adds spellchecking capabilities to property
names.
- SpellCheckedMetadata() -
Constructor for class org.apache.nutch.metadata.SpellCheckedMetadata
-
- splitEnd -
Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- splitLen -
Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- splitStart -
Variable in class org.apache.nutch.tools.arc.ArcRecordReader
-
- start -
Variable in class org.apache.nutch.segment.SegmentReader.SegmentReaderStats
-
- startCDATA() -
Method in class org.apache.nutch.parse.html.DOMBuilder
- Report the start of a CDATA section.
- startDocument() -
Method in class org.apache.nutch.parse.html.DOMBuilder
- Receive notification of the beginning of a document.
- startDTD(String, String, String) -
Method in class org.apache.nutch.parse.html.DOMBuilder
- Report the start of DTD declarations, if any.
- startElement(String, String, String, Attributes) -
Method in class org.apache.nutch.parse.html.DOMBuilder
- Receive notification of the beginning of an element.
- startEntity(String) -
Method in class org.apache.nutch.parse.html.DOMBuilder
- Report the beginning of an entity.
- startPrefixMapping(String, String) -
Method in class org.apache.nutch.parse.html.DOMBuilder
- Begin the scope of a prefix-URI Namespace mapping.
- startUp() -
Method in class org.apache.nutch.plugin.Plugin
- Will be invoked until plugin start up.
- StaticFieldIndexer - Class in org.apache.nutch.indexer.staticfield
- A simple plugin called at indexing that adds fields with static data.
- StaticFieldIndexer() -
Constructor for class org.apache.nutch.indexer.staticfield.StaticFieldIndexer
-
- statNames -
Static variable in class org.apache.nutch.crawl.CrawlDatum
-
- STATUS_BLOCKED -
Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_DB_FETCHED -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Page was successfully fetched.
- STATUS_DB_GONE -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Page no longer exists.
- STATUS_DB_MAX -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Maximum value of DB-related status.
- STATUS_DB_NOTMODIFIED -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Page was successfully fetched and found not modified.
- STATUS_DB_REDIR_PERM -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Page permanently redirects to other page.
- STATUS_DB_REDIR_TEMP -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Page temporarily redirects to other page.
- STATUS_DB_UNFETCHED -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Page was not fetched yet.
- STATUS_FAILED -
Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_FAILURE -
Static variable in class org.apache.nutch.parse.ParseStatus
-
- STATUS_FETCH_GONE -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Fetching unsuccessful - page is gone.
- STATUS_FETCH_MAX -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Maximum value of fetch-related status.
- STATUS_FETCH_NOTMODIFIED -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Fetching successful - page is not modified.
- STATUS_FETCH_REDIR_PERM -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Fetching permanently redirected to other page.
- STATUS_FETCH_REDIR_TEMP -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Fetching temporarily redirected to other page.
- STATUS_FETCH_RETRY -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Fetching unsuccessful, needs to be retried (transient errors).
- STATUS_FETCH_SUCCESS -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Fetching was successful.
- STATUS_GONE -
Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_INJECTED -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Page was newly injected.
- STATUS_LINKED -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Page discovered through a link.
- STATUS_MODIFIED -
Static variable in interface org.apache.nutch.crawl.FetchSchedule
- Page is known to have been modified since our last visit.
- STATUS_NOTFETCHING -
Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_NOTFOUND -
Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_NOTMODIFIED -
Static variable in interface org.apache.nutch.crawl.FetchSchedule
- Page is known to remain unmodified since our last visit.
- STATUS_NOTMODIFIED -
Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_NOTPARSED -
Static variable in class org.apache.nutch.parse.ParseStatus
-
- STATUS_PARSE_META -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Page got metadata from a parser
- STATUS_REDIR_EXCEEDED -
Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_RETRY -
Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_ROBOTS_DENIED -
Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_SIGNATURE -
Static variable in class org.apache.nutch.crawl.CrawlDatum
- Page signature.
- STATUS_SUCCESS -
Static variable in class org.apache.nutch.parse.ParseStatus
-
- STATUS_SUCCESS -
Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- STATUS_UNKNOWN -
Static variable in interface org.apache.nutch.crawl.FetchSchedule
- It is unknown whether page was changed since our last visit.
- STATUS_WOULDBLOCK -
Static variable in class org.apache.nutch.protocol.ProtocolStatus
-
- StringUtil - Class in org.apache.nutch.util
- A collection of String processing utility methods.
- StringUtil() -
Constructor for class org.apache.nutch.util.StringUtil
-
- stripNonCharCodepoints(String) -
Static method in class org.apache.nutch.indexer.solr.SolrUtils
-
- Subcollection - Class in org.apache.nutch.collection
- SubCollection represents a subset of index, you can define url patterns that
will indicate that particular page (url) is part of SubCollection.
- Subcollection(String, String, Configuration) -
Constructor for class org.apache.nutch.collection.Subcollection
- public Constructor
- Subcollection(String, String, String, Configuration) -
Constructor for class org.apache.nutch.collection.Subcollection
- public Constructor
- Subcollection(Configuration) -
Constructor for class org.apache.nutch.collection.Subcollection
-
- SubcollectionIndexingFilter - Class in org.apache.nutch.indexer.subcollection
-
- SubcollectionIndexingFilter() -
Constructor for class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
- SubcollectionIndexingFilter(Configuration) -
Constructor for class org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
-
- SUBJECT -
Static variable in interface org.apache.nutch.metadata.DublinCore
- The topic of the content of the resource.
- SUCCESS -
Static variable in class org.apache.nutch.parse.ParseStatus
- Parsing succeeded.
- SUCCESS -
Static variable in class org.apache.nutch.protocol.ProtocolStatus
- Content was retrieved without errors.
- SUCCESS_REDIRECT -
Static variable in class org.apache.nutch.parse.ParseStatus
- Parsed content contains a directive to redirect to another URL.
- SuffixStringMatcher - Class in org.apache.nutch.util
- A class for efficiently matching
Strings against a set
of suffixes. - SuffixStringMatcher(String[]) -
Constructor for class org.apache.nutch.util.SuffixStringMatcher
- Creates a new
PrefixStringMatcher which will match
Strings with any suffix in the supplied array.
- SuffixStringMatcher(Collection) -
Constructor for class org.apache.nutch.util.SuffixStringMatcher
- Creates a new
PrefixStringMatcher which will match
Strings with any suffix in the supplied
Collection
- SuffixURLFilter - Class in org.apache.nutch.urlfilter.suffix
- Filters URLs based on a file of URL suffixes.
- SuffixURLFilter() -
Constructor for class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- SuffixURLFilter(Reader) -
Constructor for class org.apache.nutch.urlfilter.suffix.SuffixURLFilter
-
- SWFParser - Class in org.apache.nutch.parse.swf
- Parser for Flash SWF files.
- SWFParser() -
Constructor for class org.apache.nutch.parse.swf.SWFParser
-
StringUtil.toHexString(byte[], String, int), where
sep = null; lineLen = Integer.MAX_VALUE.
sizeLimit bytes, if necessary.
URLFilter implementing plugins.
|
||||||||||
| PREV NEXT | FRAMES NO FRAMES | |||||||||