Utility class for deleting duplicate documents from a solr index.
The algorithm goes like follows:
Query the solr server for the number of documents (say, N)
Partition N among M map tasks. For example, if we have two map tasks
the first map task will deal with solr documents from 0 - (N / 2 - 1) and
the second will deal with documents from (N / 2) to (N - 1).
Reduce: After map, SolrDeleteDuplicates.SolrRecords with the same digest will be
grouped together. Now, of these documents with the same digests, delete
all of them except the one with the highest score (boost field). If two
(or more) documents have the same score, then the document with the latest
timestamp is kept. Again, every other is deleted from solr index.
Note that unlike DeleteDuplicates we assume that two documents in
a solr index will never have the same URL. So this class only deals with
documents with different URLs but the same digest.