org.apache.nutch.collection (apache-nutch 1.21 API)

Subcollection is a subset of an index. Subcollections are defined by urlpatterns in form of white/blacklist. So to get the page into subcollection it must match the whitelist and not the blacklist.

Subcollection definitions are read from a file subcollections.xml and the format is as follows (imagine here that you are crawling all the virtualhosts from apache.org and you want to tag pages with url pattern "https://nutch.apache.org" and "https://cwiki.apache.org/confluence/display/nutch" to be part of subcollection "nutch", this allows you to later search specifically from this subcollection)

 
 <xml version="1.0" encoding="UTF-8"?>
 <subcollections>
  <subcollection>
   <name>nutch</name>
   <id>nutch</id>
   <whitelist>https://nutch.apache.org</whitelist>
   <whitelist>https://cwiki.apache.org/confluence/display/nutch</whitelist>
   <blacklist />
  </subcollection>
 </subcollections>

Despite of this configuration you still can crawl any urls as long as they pass through your global url filters. (note that you must also seed your urls in normal nutch way)

Class Summary
Class	Description
CollectionManager
Subcollection	SubCollection represents a subset of index, you can define url patterns that will indicate that particular page (url) is part of SubCollection.

Package org.apache.nutch.collection