Package org.apache.nutch.collection

Subcollection is a subset of an index. Subcollections are defined by urlpatterns in form of white/blacklist. So to get the page into subcollection it must match the whitelist and not the blacklist.

Subcollection definitions are read from a file subcollections.xml and the format is as follows (imagine here that you are crawling all the virtualhosts from apache.org and you want to tag pages with url pattern "https://nutch.apache.org" and "https://cwiki.apache.org/confluence/display/nutch" to be part of subcollection "nutch", this allows you to later search specifically from this subcollection)

 
 <xml version="1.0" encoding="UTF-8"?>
 <subcollections>
  <subcollection>
   <name>nutch</name>
   <id>nutch</id>
   <whitelist>https://nutch.apache.org</whitelist>
   <whitelist>https://cwiki.apache.org/confluence/display/nutch</whitelist>
   <blacklist />
  </subcollection>
 </subcollections>
 
 

Despite of this configuration you still can crawl any urls as long as they pass through your global url filters. (note that you must also seed your urls in normal nutch way)