Package org.apache.nutch.collection
Subcollection is a subset of an index. Subcollections are defined by urlpatterns in form of white/blacklist. So to get the page into subcollection it must match the whitelist and not the blacklist.
Subcollection definitions are read from a file
subcollections.xml
and the format is as follows
(imagine here that you are crawling all the virtualhosts from
apache.org and you want to tag pages with url pattern
"https://nutch.apache.org" and
"https://cwiki.apache.org/confluence/display/nutch" to be part of
subcollection "nutch", this allows you to later search specifically
from this subcollection)
<xml version="1.0" encoding="UTF-8"?>
<subcollections>
<subcollection>
<name>nutch</name>
<id>nutch</id>
<whitelist>https://nutch.apache.org</whitelist>
<whitelist>https://cwiki.apache.org/confluence/display/nutch</whitelist>
<blacklist />
</subcollection>
</subcollections>
Despite of this configuration you still can crawl any urls as long as they pass through your global url filters. (note that you must also seed your urls in normal nutch way)
-
Class Summary Class Description CollectionManager Subcollection SubCollection represents a subset of index, you can define url patterns that will indicate that particular page (url) is part of SubCollection.