On statistics, computation and scalability
MICHAEL I. JORDAN
Department of Statistics and Department of EECS, University of California, Berkeley, CA,
USA. E-mail: jordan@stat.berkeley.edu; url: www.cs.berkeley.edu/˜jordan
How should statistical procedures be designed so as to be scalable computationally to the massive
datasets that are increasingly the norm? When coupled with the requirement that an answer to
an inferential question be delivered within a certain time budget, this question has significant
repercussions for the field of statistics. With the goal of identifying “time-data tradeoffs,” we
investigate some of the statistical consequences of computational perspectives on scability, in
particular divide-and-conquer methodology and hierarchies of convex relaxations.
The fields of computer science and statistics have undergone mostly separate evolutions
during their respective histories. This is changing, due in part to the phenomenon of
“Big Data.” Indeed, science and technology are currently generating very large datasets
and the gatherers of these data have increasingly ambitious inferential goals, trends
which point towards a future in which statistics will be forced to deal with problems of
scale in order to remain relevant. Currently the field seems little prepared to meet this
challenge. To the key question “Can you guarantee a certain level of inferential accuracy
within a certain time budget even as the data grow in size?” the field is generally silent.
Many statistical procedures either have unknown runtimes or runtimes that render the
procedure unusable on large-scale data. Although the field of sequential analysis provides
tools to assess risk after a certain number of data points have arrived, this is different from
an algorithmic analysis that predicts a relationship between time and risk. Faced with
this situation, gatherers of large-scale data are often forced to turn to ad hoc procedures
that perhaps do provide algorithmic guarantees but which may provide no statistical
guarantees and which in fact may have poor or even disastrous statistical properties.
On the other hand, the field of computer science is also currently poorly equipped
to provide solutions to the inferential problems associated with Big Data. Database researchers
rarely view the data in a database as noisy measurements on an underlying
population about which inferential statements are desired. Theoretical computer scientists
are able to provide analyses of the resource requirements of algorithms (e.g., time
and space), and are often able to provide comparative analyses of different algorithms
for solving a given problem, but these problems rarely refer to inferential goals. In particular,
the notion that it may be possible to save on computation because of the growth
On statistics, computation and scalabilityMICHAEL I. JORDANDepartment of Statistics and Department of EECS, University of California, Berkeley, CA,USA. E-mail: jordan@stat.berkeley.edu; url: www.cs.berkeley.edu/˜jordanHow should statistical procedures be designed so as to be scalable computationally to the massivedatasets that are increasingly the norm? When coupled with the requirement that an answer toan inferential question be delivered within a certain time budget, this question has significantrepercussions for the field of statistics. With the goal of identifying “time-data tradeoffs,” weinvestigate some of the statistical consequences of computational perspectives on scability, inparticular divide-and-conquer methodology and hierarchies of convex relaxations.The fields of computer science and statistics have undergone mostly separate evolutionsduring their respective histories. This is changing, due in part to the phenomenon of“Big Data.” Indeed, science and technology are currently generating very large datasetsand the gatherers of these data have increasingly ambitious inferential goals, trendswhich point towards a future in which statistics will be forced to deal with problems ofscale in order to remain relevant. Currently the field seems little prepared to meet thischallenge. To the key question “Can you guarantee a certain level of inferential accuracywithin a certain time budget even as the data grow in size?” the field is generally silent.Many statistical procedures either have unknown runtimes or runtimes that render theprocedure unusable on large-scale data. Although the field of sequential analysis providestools to assess risk after a certain number of data points have arrived, this is different froman algorithmic analysis that predicts a relationship between time and risk. Faced withthis situation, gatherers of large-scale data are often forced to turn to ad hoc proceduresthat perhaps do provide algorithmic guarantees but which may provide no statisticalguarantees and which in fact may have poor or even disastrous statistical properties.On the other hand, the field of computer science is also currently poorly equippedto provide solutions to the inferential problems associated with Big Data. Database researchersrarely view the data in a database as noisy measurements on an underlyingpopulation about which inferential statements are desired. Theoretical computer scientistsare able to provide analyses of the resource requirements of algorithms (e.g., timeand space), and are often able to provide comparative analyses of different algorithmsfor solving a given problem, but these problems rarely refer to inferential goals. In particular,the notion that it may be possible to save on computation because of the growth
การแปล กรุณารอสักครู่..
