ABSTRACT
Deploying a classifier to large-scale systems such as the web
requires careful feature design and performance evaluation.
Evaluation is particularly challenging because these large
collections frequently change. In this paper we adapt stratified
sampling techniques to evaluate the precision of classi-
fiers deployed in large-scale systems. We investigate different
types of stratification strategies, and then we derive a
new online sampling algorithm that incrementally approximates
the theoretical optimal disproportionate sampling
strategy. In experiments, the proposed algorithm signifi-
cantly outperforms both simple random sampling as well as
other types of stratified sampling, with an average reduction
of about 20% in labeling effort to reach the same confidence
and interval-bounds on precision.