Abstract
Analyzing data sets of billions of records has
now become a regular task in many companies
and institutions. In the statistical analysis
of those massive data sets, sampling generally
plays a very important role. In this
work, we describe a scalable simple random
sampling algorithm, named ScaSRS, which
uses probabilistic thresholds to decide on the
fly whether to accept, reject, or wait-list an
item independently of others. We prove, with
high probability, it succeeds and needs only
O(
√
k) storage, where k is the sample size.
ScaSRS extends naturally to a scalable stratified
sampling algorithm, which is favorable
for heterogeneous data sets. The proposed algorithms,
when implemented in MapReduce,
can effectively reduce the size of intermediate
output and greatly improve load balancing.
Empirical evaluation on large-scale data sets
clearly demonstrates their superiority
AbstractAnalyzing data sets of billions of records hasnow become a regular task in many companiesand institutions. In the statistical analysisof those massive data sets, sampling generallyplays a very important role. In thiswork, we describe a scalable simple randomsampling algorithm, named ScaSRS, whichuses probabilistic thresholds to decide on thefly whether to accept, reject, or wait-list anitem independently of others. We prove, withhigh probability, it succeeds and needs onlyO(√k) storage, where k is the sample size.ScaSRS extends naturally to a scalable stratifiedsampling algorithm, which is favorablefor heterogeneous data sets. The proposed algorithms,when implemented in MapReduce,can effectively reduce the size of intermediateoutput and greatly improve load balancing.Empirical evaluation on large-scale data setsclearly demonstrates their superiority
การแปล กรุณารอสักครู่..
