in our case a collection of web pages, are split into n splits where the size of
the split is chosen to ensure that the work can be distributed evenly (chunks
should not be too large) and efficiently (the total number of chunks we need
to manage should not be too large); 16 or 64MB are good sizes in distributed
indexing.