• We have N data points in our initial training set. This is very large (106-109)
and is distributed over an HDFS cluster.
• We are going to train a set of M different models for an ensemble classifier.
• Each of the M models will be fitted with K data points, where typically K N: In this case, we must resample some of our data with replacements