Utility class for deleting duplicate documents from a solr index.
The algorithm goes like follows:
Query the solr server for the number of documents (say, N)
Partition N among M map tasks. For example, if we have two map tasks the
first map task will deal with solr documents from 0 - (N / 2 - 1) and the
second will deal with documents from (N / 2) to (N - 1).
Reduce: After map, SolrDeleteDuplicates.SolrRecords with the same digest will be
grouped together. Now, of these documents with the same digests, delete all
of them except the one with the highest score (boost field). If two (or more)
documents have the same score, then the document with the latest timestamp is
kept. Again, every other is deleted from solr index.
Note that we assume that two documents in a solr index will never have the
same URL. So this class only deals with documents with different URLs
but the same digest.