Abstract
Analyzing data sets of billions of records has
now become a regular task in many companies
and institutions. In the statistical analysis
of those massive data sets, sampling generally
plays a very important role. In this
work, we describe a scalable simple random
sampling algorithm, named ScaSRS, which
uses probabilistic thresholds to decide on the
fly whether to accept, reject, or wait-list an
item independently of others. We prove, with
high probability, it succeeds and needs only
O(
√
k) storage, where k is the sample size.
ScaSRS extends naturally to a scalable stratified
sampling algorithm, which is favorable
for heterogeneous data sets. The proposed algorithms,
when implemented in MapReduce,
can effectively reduce the size of intermediate
output and greatly improve load balancing.
Empirical evaluation on large-scale data sets
clearly demonstrates their superiority