Uses of Class
org.apache.nutch.storage.WebPage

Packages that use WebPage
org.apache.nutch.analysis.lang Text document language identifier. 
org.apache.nutch.crawl Crawl control code. 
org.apache.nutch.fetcher The Nutch robot. 
org.apache.nutch.host   
org.apache.nutch.indexer Maintain Lucene full-text indexes. 
org.apache.nutch.indexer.anchor An indexing plugin for inbound anchor text. 
org.apache.nutch.indexer.basic A basic indexing plugin. 
org.apache.nutch.indexer.more A more indexing plugin. 
org.apache.nutch.indexer.subcollection   
org.apache.nutch.indexer.tld Top Level Domain Indexing plugin. 
org.apache.nutch.microformats.reltag A microformats Rel-Tag Parser/Indexer/Querier plugin. 
org.apache.nutch.parse   
org.apache.nutch.parse.html An HTML document parsing plugin. 
org.apache.nutch.parse.js   
org.apache.nutch.parse.tika   
org.apache.nutch.protocol   
org.apache.nutch.protocol.file Protocol plugin which supports retrieving local file resources. 
org.apache.nutch.protocol.ftp Protocol plugin which supports retrieving documents via the ftp protocol. 
org.apache.nutch.protocol.http Protocol plugin which supports retrieving documents via the http protocol. 
org.apache.nutch.protocol.http.api Common API used by HTTP plugins (http, httpclient
org.apache.nutch.protocol.httpclient Protocol plugin which supports retrieving documents via the HTTP and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server. 
org.apache.nutch.protocol.sftp Protocol plugin which supports retrieving documents via the sftp protocol. 
org.apache.nutch.scoring   
org.apache.nutch.scoring.link   
org.apache.nutch.scoring.opic   
org.apache.nutch.scoring.tld Top Level Domain Scoring plugin. 
org.apache.nutch.storage   
org.apache.nutch.util   
org.apache.nutch.util.domain org.apache.nutch.util.domain 
org.creativecommons.nutch Sample plugins that parse and index Creative Commons medadata. 
 

Uses of WebPage in org.apache.nutch.analysis.lang
 

Methods in org.apache.nutch.analysis.lang with parameters of type WebPage
 NutchDocument LanguageIndexingFilter.filter(NutchDocument doc, String url, WebPage page)
           
 Parse HTMLLanguageParser.filter(String url, WebPage page, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc)
          Scan the HTML document looking at possible indications of content language
1.
 

Uses of WebPage in org.apache.nutch.crawl
 

Methods in org.apache.nutch.crawl that return WebPage
 WebPage URLWebPage.getDatum()
           
 

Methods in org.apache.nutch.crawl with parameters of type WebPage
abstract  byte[] Signature.calculate(WebPage page)
           
 byte[] TextProfileSignature.calculate(WebPage page)
           
 byte[] MD5Signature.calculate(WebPage page)
           
 long AbstractFetchSchedule.calculateLastFetchTime(WebPage page)
          This method return the last fetch time of the CrawlDatum
 long FetchSchedule.calculateLastFetchTime(WebPage page)
          Calculates last fetch time of the given CrawlDatum.
 void AbstractFetchSchedule.forceRefetch(String url, WebPage page, boolean asap)
          This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page signature, so that it forces refetching.
 void FetchSchedule.forceRefetch(String url, WebPage row, boolean asap)
          This method resets fetchTime, fetchInterval, modifiedTime and page signature, so that it forces refetching.
 int URLPartitioner.SelectorEntryPartitioner.getPartition(GeneratorJob.SelectorEntry selectorEntry, WebPage page, int numReduces)
           
 void AbstractFetchSchedule.initializeSchedule(String url, WebPage page)
          Initialize fetch schedule related data.
 void FetchSchedule.initializeSchedule(String url, WebPage page)
          Initialize fetch schedule related data.
 void GeneratorMapper.map(String reversedUrl, WebPage page, Mapper.Context context)
           
protected  void InjectorJob.InjectorMapper.map(String key, WebPage row, Mapper.Context context)
           
protected  void WebTableReader.WebTableStatMapper.map(String key, WebPage value, Mapper.Context context)
           
protected  void WebTableReader.WebTableRegexMapper.map(String key, WebPage value, Mapper.Context context)
           
 void DbUpdateMapper.map(String key, WebPage page, Mapper.Context context)
           
 void URLWebPage.setDatum(WebPage datum)
           
 void AdaptiveFetchSchedule.setFetchSchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
           
 void AbstractFetchSchedule.setFetchSchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
          Sets the fetchInterval and fetchTime on a successfully fetched page.
 void DefaultFetchSchedule.setFetchSchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
           
 void FetchSchedule.setFetchSchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
          Sets the fetchInterval and fetchTime on a successfully fetched page.
 void AbstractFetchSchedule.setPageGoneSchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime)
          This method specifies how to schedule refetching of pages marked as GONE.
 void FetchSchedule.setPageGoneSchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime)
          This method specifies how to schedule refetching of pages marked as GONE.
 void AbstractFetchSchedule.setPageRetrySchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime)
          This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors.
 void FetchSchedule.setPageRetrySchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime)
          This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors.
 boolean AbstractFetchSchedule.shouldFetch(String url, WebPage page, long curTime)
          This method provides information whether the page is suitable for selection in the current fetchlist.
 boolean FetchSchedule.shouldFetch(String url, WebPage page, long curTime)
          This method provides information whether the page is suitable for selection in the current fetchlist.
 

Method parameters in org.apache.nutch.crawl with type arguments of type WebPage
protected  void GeneratorReducer.reduce(GeneratorJob.SelectorEntry key, Iterable<WebPage> values, Reducer.Context context)
           
 

Constructors in org.apache.nutch.crawl with parameters of type WebPage
URLWebPage(String url, WebPage datum)
           
 

Uses of WebPage in org.apache.nutch.fetcher
 

Methods in org.apache.nutch.fetcher that return WebPage
 WebPage FetchEntry.getWebPage()
           
 

Methods in org.apache.nutch.fetcher with parameters of type WebPage
protected  void FetcherJob.FetcherMapper.map(String key, WebPage page, Mapper.Context context)
           
 

Constructors in org.apache.nutch.fetcher with parameters of type WebPage
FetchEntry(Configuration conf, String key, WebPage page)
           
 

Uses of WebPage in org.apache.nutch.host
 

Methods in org.apache.nutch.host with parameters of type WebPage
protected  void HostDbUpdateJob.Mapper.map(String key, WebPage value, Mapper.Context context)
           
 

Method parameters in org.apache.nutch.host with type arguments of type WebPage
protected  void HostDbUpdateReducer.reduce(Text key, Iterable<WebPage> values, Reducer.Context context)
           
 

Uses of WebPage in org.apache.nutch.indexer
 

Fields in org.apache.nutch.indexer with type parameters of type WebPage
 org.apache.gora.store.DataStore<String,WebPage> IndexerJob.IndexerMapper.store
           
 

Methods in org.apache.nutch.indexer with parameters of type WebPage
 NutchDocument IndexingFilters.filter(NutchDocument doc, String url, WebPage page)
          Run all defined filters.
 NutchDocument IndexingFilter.filter(NutchDocument doc, String url, WebPage page)
          Adds fields or otherwise modifies the document that will be indexed for a parse.
 NutchDocument IndexUtil.index(String key, WebPage page)
          Index a webpage.
 void IndexerJob.IndexerMapper.map(String key, WebPage page, Mapper.Context context)
           
 

Uses of WebPage in org.apache.nutch.indexer.anchor
 

Methods in org.apache.nutch.indexer.anchor with parameters of type WebPage
 NutchDocument AnchorIndexingFilter.filter(NutchDocument doc, String url, WebPage page)
           
 

Uses of WebPage in org.apache.nutch.indexer.basic
 

Methods in org.apache.nutch.indexer.basic with parameters of type WebPage
 NutchDocument BasicIndexingFilter.filter(NutchDocument doc, String url, WebPage page)
           
 

Uses of WebPage in org.apache.nutch.indexer.more
 

Methods in org.apache.nutch.indexer.more with parameters of type WebPage
 NutchDocument MoreIndexingFilter.filter(NutchDocument doc, String url, WebPage page)
           
 

Uses of WebPage in org.apache.nutch.indexer.subcollection
 

Methods in org.apache.nutch.indexer.subcollection with parameters of type WebPage
 NutchDocument SubcollectionIndexingFilter.filter(NutchDocument doc, String url, WebPage page)
           
 

Uses of WebPage in org.apache.nutch.indexer.tld
 

Methods in org.apache.nutch.indexer.tld with parameters of type WebPage
 NutchDocument TLDIndexingFilter.filter(NutchDocument doc, String url, WebPage page)
           
 

Uses of WebPage in org.apache.nutch.microformats.reltag
 

Methods in org.apache.nutch.microformats.reltag with parameters of type WebPage
 NutchDocument RelTagIndexingFilter.filter(NutchDocument doc, String url, WebPage page)
           
 Parse RelTagParser.filter(String url, WebPage page, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc)
           
 

Uses of WebPage in org.apache.nutch.parse
 

Methods in org.apache.nutch.parse with parameters of type WebPage
 Parse ParseFilters.filter(String url, WebPage page, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc)
          Run all defined filters.
 Parse ParseFilter.filter(String url, WebPage page, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc)
          Adds metadata or otherwise modifies a parse, given the DOM tree of a page.
 Parse Parser.getParse(String url, WebPage page)
           This method parses content in WebPage instance
static boolean ParserJob.isTruncated(String url, WebPage page)
          Checks if the page's content is truncated.
 void ParserJob.ParserMapper.map(String key, WebPage page, Mapper.Context context)
           
 Parse ParseUtil.parse(String url, WebPage page)
          Performs a parse by iterating through a List of preferred Parsers until a successful parse is performed and a Parse object is returned.
 URLWebPage ParseUtil.process(String key, WebPage page)
          Parses given web page and stores parsed content within page.
 

Uses of WebPage in org.apache.nutch.parse.html
 

Methods in org.apache.nutch.parse.html with parameters of type WebPage
 Parse HtmlParser.getParse(String url, WebPage page)
           
 

Uses of WebPage in org.apache.nutch.parse.js
 

Methods in org.apache.nutch.parse.js with parameters of type WebPage
 Parse JSParseFilter.filter(String url, WebPage page, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc)
           
 Parse JSParseFilter.getParse(String url, WebPage page)
           
 

Uses of WebPage in org.apache.nutch.parse.tika
 

Methods in org.apache.nutch.parse.tika with parameters of type WebPage
 Parse TikaParser.getParse(String url, WebPage page)
           
 

Uses of WebPage in org.apache.nutch.protocol
 

Methods in org.apache.nutch.protocol with parameters of type WebPage
 ProtocolOutput Protocol.getProtocolOutput(String url, WebPage page)
          Returns the Content for a fetchlist entry.
 RobotRules Protocol.getRobotRules(String url, WebPage page)
          Retrieve robot rules applicable for this url.
 

Uses of WebPage in org.apache.nutch.protocol.file
 

Methods in org.apache.nutch.protocol.file with parameters of type WebPage
 ProtocolOutput File.getProtocolOutput(String url, WebPage page)
           
 RobotRules File.getRobotRules(String url, WebPage page)
           
 

Constructors in org.apache.nutch.protocol.file with parameters of type WebPage
FileResponse(URL url, WebPage page, File file, Configuration conf)
           
 

Uses of WebPage in org.apache.nutch.protocol.ftp
 

Methods in org.apache.nutch.protocol.ftp with parameters of type WebPage
 ProtocolOutput Ftp.getProtocolOutput(String url, WebPage page)
           
 RobotRules Ftp.getRobotRules(String url, WebPage page)
           
 

Constructors in org.apache.nutch.protocol.ftp with parameters of type WebPage
FtpResponse(URL url, WebPage page, Ftp ftp, Configuration conf)
           
 

Uses of WebPage in org.apache.nutch.protocol.http
 

Methods in org.apache.nutch.protocol.http with parameters of type WebPage
protected  Response Http.getResponse(URL url, WebPage page, boolean redirect)
           
 

Constructors in org.apache.nutch.protocol.http with parameters of type WebPage
HttpResponse(HttpBase http, URL url, WebPage page)
           
 

Uses of WebPage in org.apache.nutch.protocol.http.api
 

Methods in org.apache.nutch.protocol.http.api with parameters of type WebPage
 ProtocolOutput HttpBase.getProtocolOutput(String url, WebPage page)
           
protected abstract  Response HttpBase.getResponse(URL url, WebPage page, boolean followRedirects)
           
 RobotRules HttpBase.getRobotRules(String url, WebPage page)
           
 

Uses of WebPage in org.apache.nutch.protocol.httpclient
 

Methods in org.apache.nutch.protocol.httpclient with parameters of type WebPage
protected  Response Http.getResponse(URL url, WebPage page, boolean redirect)
          Fetches the url with a configured HTTP client and gets the response.
 

Uses of WebPage in org.apache.nutch.protocol.sftp
 

Methods in org.apache.nutch.protocol.sftp with parameters of type WebPage
 ProtocolOutput Sftp.getProtocolOutput(String url, WebPage page)
           
 RobotRules Sftp.getRobotRules(String url, WebPage page)
           
 

Uses of WebPage in org.apache.nutch.scoring
 

Methods in org.apache.nutch.scoring with parameters of type WebPage
 void ScoringFilter.distributeScoreToOutlinks(String fromUrl, WebPage page, Collection<ScoreDatum> scoreData, int allCount)
          Distribute score value from the current page to all its outlinked pages.
 void ScoringFilters.distributeScoreToOutlinks(String fromUrl, WebPage row, Collection<ScoreDatum> scoreData, int allCount)
           
 float ScoringFilter.generatorSortValue(String url, WebPage page, float initSort)
          This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation.
 float ScoringFilters.generatorSortValue(String url, WebPage row, float initSort)
          Calculate a sort value for Generate.
 float ScoringFilter.indexerScore(String url, NutchDocument doc, WebPage page, float initScore)
          This method calculates a Lucene document boost.
 float ScoringFilters.indexerScore(String url, NutchDocument doc, WebPage row, float initScore)
           
 void ScoringFilter.initialScore(String url, WebPage page)
          Set an initial score for newly discovered pages.
 void ScoringFilters.initialScore(String url, WebPage row)
          Calculate a new initial score, used when adding newly discovered pages.
 void ScoringFilter.injectedScore(String url, WebPage page)
          Set an initial score for newly injected pages.
 void ScoringFilters.injectedScore(String url, WebPage row)
          Calculate a new initial score, used when injecting new pages.
 void ScoringFilter.updateScore(String url, WebPage page, List<ScoreDatum> inlinkedScoreData)
          This method calculates a new score during table update, based on the values contributed by inlinked pages.
 void ScoringFilters.updateScore(String url, WebPage row, List<ScoreDatum> inlinkedScoreData)
           
 

Uses of WebPage in org.apache.nutch.scoring.link
 

Methods in org.apache.nutch.scoring.link with parameters of type WebPage
 void LinkAnalysisScoringFilter.distributeScoreToOutlinks(String fromUrl, WebPage page, Collection<ScoreDatum> scoreData, int allCount)
           
 float LinkAnalysisScoringFilter.generatorSortValue(String url, WebPage page, float initSort)
           
 float LinkAnalysisScoringFilter.indexerScore(String url, NutchDocument doc, WebPage page, float initScore)
           
 void LinkAnalysisScoringFilter.initialScore(String url, WebPage page)
           
 void LinkAnalysisScoringFilter.injectedScore(String url, WebPage page)
           
 void LinkAnalysisScoringFilter.updateScore(String url, WebPage page, List<ScoreDatum> inlinkedScoreData)
           
 

Uses of WebPage in org.apache.nutch.scoring.opic
 

Methods in org.apache.nutch.scoring.opic with parameters of type WebPage
 void OPICScoringFilter.distributeScoreToOutlinks(String fromUrl, WebPage row, Collection<ScoreDatum> scoreData, int allCount)
          Get cash on hand, divide it by the number of outlinks and apply.
 float OPICScoringFilter.generatorSortValue(String url, WebPage row, float initSort)
          Use getScore().
 float OPICScoringFilter.indexerScore(String url, NutchDocument doc, WebPage row, float initScore)
          Dampen the boost value by scorePower.
 void OPICScoringFilter.initialScore(String url, WebPage row)
          Set to 0.0f (unknown value) - inlink contributions will bring it to a correct level.
 void OPICScoringFilter.injectedScore(String url, WebPage row)
           
 void OPICScoringFilter.updateScore(String url, WebPage row, List<ScoreDatum> inlinkedScoreData)
          Increase the score by a sum of inlinked scores.
 

Uses of WebPage in org.apache.nutch.scoring.tld
 

Methods in org.apache.nutch.scoring.tld with parameters of type WebPage
 void TLDScoringFilter.distributeScoreToOutlinks(String fromUrl, WebPage page, Collection<ScoreDatum> scoreData, int allCount)
           
 float TLDScoringFilter.generatorSortValue(String url, WebPage page, float initSort)
           
 float TLDScoringFilter.indexerScore(String url, NutchDocument doc, WebPage page, float initScore)
           
 void TLDScoringFilter.initialScore(String url, WebPage page)
           
 void TLDScoringFilter.injectedScore(String url, WebPage page)
           
 void TLDScoringFilter.updateScore(String url, WebPage page, List<ScoreDatum> inlinkedScoreData)
           
 

Uses of WebPage in org.apache.nutch.storage
 

Methods in org.apache.nutch.storage that return WebPage
 WebPage WebPage.newInstance(org.apache.gora.persistency.StateManager stateManager)
           
 

Methods in org.apache.nutch.storage with parameters of type WebPage
 org.apache.avro.util.Utf8 Mark.checkMark(WebPage page)
           
 void Mark.putMark(WebPage page, String markValue)
           
 void Mark.putMark(WebPage page, org.apache.avro.util.Utf8 markValue)
           
 org.apache.avro.util.Utf8 Mark.removeMark(WebPage page)
           
 org.apache.avro.util.Utf8 Mark.removeMarkIfExist(WebPage page)
          Remove the mark only if the mark is present on the page.
 

Method parameters in org.apache.nutch.storage with type arguments of type WebPage
static
<K,V> void
StorageUtils.initMapperJob(Job job, Collection<WebPage.Field> fields, Class<K> outKeyClass, Class<V> outValueClass, Class<? extends org.apache.gora.mapreduce.GoraMapper<String,WebPage,K,V>> mapperClass)
           
static
<K,V> void
StorageUtils.initMapperJob(Job job, Collection<WebPage.Field> fields, Class<K> outKeyClass, Class<V> outValueClass, Class<? extends org.apache.gora.mapreduce.GoraMapper<String,WebPage,K,V>> mapperClass, boolean reuseObjects)
           
static
<K,V> void
StorageUtils.initMapperJob(Job job, Collection<WebPage.Field> fields, Class<K> outKeyClass, Class<V> outValueClass, Class<? extends org.apache.gora.mapreduce.GoraMapper<String,WebPage,K,V>> mapperClass, Class<? extends Partitioner<K,V>> partitionerClass)
           
static
<K,V> void
StorageUtils.initMapperJob(Job job, Collection<WebPage.Field> fields, Class<K> outKeyClass, Class<V> outValueClass, Class<? extends org.apache.gora.mapreduce.GoraMapper<String,WebPage,K,V>> mapperClass, Class<? extends Partitioner<K,V>> partitionerClass, boolean reuseObjects)
           
static
<K,V> void
StorageUtils.initReducerJob(Job job, Class<? extends org.apache.gora.mapreduce.GoraReducer<K,V,String,WebPage>> reducerClass)
           
 

Uses of WebPage in org.apache.nutch.util
 

Methods in org.apache.nutch.util that return WebPage
 WebPage WebPageWritable.getWebPage()
           
 

Methods in org.apache.nutch.util with parameters of type WebPage
 void EncodingDetector.autoDetectClues(WebPage page, boolean filter)
           
 String EncodingDetector.guessEncoding(WebPage page, String defaultValue)
          Guess the encoding with the previously specified list of clues.
 void WebPageWritable.setWebPage(WebPage webPage)
           
 

Method parameters in org.apache.nutch.util with type arguments of type WebPage
protected  void IdentityPageReducer.reduce(String key, Iterable<WebPage> values, Reducer.Context context)
           
 

Constructors in org.apache.nutch.util with parameters of type WebPage
WebPageWritable(Configuration conf, WebPage webPage)
           
 

Uses of WebPage in org.apache.nutch.util.domain
 

Methods in org.apache.nutch.util.domain with parameters of type WebPage
protected  void DomainStatistics.DomainStatisticsMapper.map(String key, WebPage value, Mapper.Context context)
           
 

Uses of WebPage in org.creativecommons.nutch
 

Methods in org.creativecommons.nutch with parameters of type WebPage
 NutchDocument CCIndexingFilter.filter(NutchDocument doc, String url, WebPage page)
           
 Parse CCParseFilter.filter(String url, WebPage page, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc)
          Adds metadata or otherwise modifies a parse of an HTML document, given the DOM tree of a page.
static void CCParseFilter.Walker.walk(Node doc, URL base, WebPage page, Configuration conf)
          Scan the document adding attributes to metadata.
 



Copyright © 2012 The Apache Software Foundation