How should statistical procedures be designed so as to be scalable computationally to the massive datasets that are increasingly the norm? When coupled with the requirement that an answer to an inferential question be delivered within a certain time budget, this question has significant repercussions for the field of statistics. With the goal of identifying “time-data tradeoffs,” we investigate some of the statistical consequences of computational perspectives on scability, in particular divide-and-conquer methodology and hierarchies of convex relaxations.
The fields of computer science and statistics have undergone mostly separate evolutions during their respective histories. This is changing, due in part to the phenomenon of “Big Data.” Indeed, science and technology are currently generating very large datasets and the gatherers of these data have increasingly ambitious inferential goals, trends which point towards a future in which statistics will be forced to deal with problems of scale in order to remain relevant. Currently the field seems little prepared to meet this challenge. To the key question “Can you guarantee a certain level of inferential accuracy within a certain time budget even as the data grow in size?” the field is generally silent. Many statistical procedures either have unknown runtimes or runtimes that render the procedure unusable on large-scale data. Although the field of sequential analysis provides tools to assess risk after a certain number of data points have arrived, this is different from an algorithmic analysis that predicts a relationship between time and risk. Faced with this situation, gatherers of large-scale data are often forced to turn to ad hoc procedures that perhaps do provide algorithmic guarantees but which may provide no statistical guarantees and which in fact may have poor or even disastrous statistical properties. On the other hand, the field of computer science is also currently poorly equipped to provide solutions to the inferential problems associated with Big Data. Database researchers rarely view the data in a database as noisy measurements on an underlying population about which inferential statements are desired. Theoretical computer scientists are able to provide analyses of the resource requirements of algorithms (e.g., time and space), and are often able to provide comparative analyses of different algorithms for solving a given problem, but these problems rarely refer to inferential goals. In particular, the notion that it may be possible to save on computation because of the growth of statistical power as problem instances grow in size is not (yet) a common perspective in computer science. In this paper we discuss some recent research initiatives that aim to draw computer science and statistics closer together, with particular reference to “Big Data” problems. There are two main underlying perspectives driving these initiatives, both of which present interesting conceptual challenges for statistics. The first is that large computational problems are often usefully addressed via some notion of “divide-and-conquer.” That is, the large problem is divided into subproblems that are hopefully simpler than the original problem, these subproblems are solved (sometimes again with a divide-andconquer strategy) and the solutions are pieced together to solve the original problem. In the statistical setting, one natural subdivision strategy involves breaking the data into subsets. The estimator of interest is applied to the subsets and the results are combined. The challenge in the statistical setting is that the analysis of subsets of data may present different statistical properties than the overall dataset. For example, confidence intervals based on subsets of data will generally be wider than confidence intervals based on the original data; thus, care must be taken that the overall divide-and-conquer procedure yields a correctly calibrated interval. The second perspective involves a notion of “algorithmic weakening,” whereby we do not consider a single algorithm for solving an inference problem, but instead consider a hierarchy of algorithms that are ordered by computational complexity. As data accrue, we want to back off to cheaper algorithms that run more quickly and deliver a result that would be viewed as being of poorer quality from a classical algorithmic point of view. We hope to do this in a way such that the increasing statistical strength of the data compensate for the poor algorithmic quality, so that in fact the overall quality of inference increases as data accrue, even if we impose a computational budget. The challenge is to do this in a theoretically sound way. The remainder of the paper is organized into three subsections, the first two concerned with divide-and-conquer algorithms, and the third concerned with algorithmic weakening.
วิธีควรกระบวนงานทางสถิติถูกออกแบบมาเพื่อปรับสเกล computationally จะ datasets ขนาดใหญ่ที่เป็นปกติมากขึ้นได้หรือไม่ เมื่อควบคู่กับความต้องการที่ส่งคำตอบเป็นคำถามเพียงน้อยนิดภายในบางครั้งงบประมาณ คำถามนี้มีสถิติในด้านร้ายอย่างมีนัยสำคัญ มีเป้าหมายจะระบุ "เวลาข้อมูลยืนยัน" เราสอบสวนบางผลสถิติของการคำนวณมุมใน scability วิธีการเฉพาะแบ่ง และพิชิต และลำดับชั้นของ relaxations นูนThe fields of computer science and statistics have undergone mostly separate evolutions during their respective histories. This is changing, due in part to the phenomenon of “Big Data.” Indeed, science and technology are currently generating very large datasets and the gatherers of these data have increasingly ambitious inferential goals, trends which point towards a future in which statistics will be forced to deal with problems of scale in order to remain relevant. Currently the field seems little prepared to meet this challenge. To the key question “Can you guarantee a certain level of inferential accuracy within a certain time budget even as the data grow in size?” the field is generally silent. Many statistical procedures either have unknown runtimes or runtimes that render the procedure unusable on large-scale data. Although the field of sequential analysis provides tools to assess risk after a certain number of data points have arrived, this is different from an algorithmic analysis that predicts a relationship between time and risk. Faced with this situation, gatherers of large-scale data are often forced to turn to ad hoc procedures that perhaps do provide algorithmic guarantees but which may provide no statistical guarantees and which in fact may have poor or even disastrous statistical properties. On the other hand, the field of computer science is also currently poorly equipped to provide solutions to the inferential problems associated with Big Data. Database researchers rarely view the data in a database as noisy measurements on an underlying population about which inferential statements are desired. Theoretical computer scientists are able to provide analyses of the resource requirements of algorithms (e.g., time and space), and are often able to provide comparative analyses of different algorithms for solving a given problem, but these problems rarely refer to inferential goals. In particular, the notion that it may be possible to save on computation because of the growth of statistical power as problem instances grow in size is not (yet) a common perspective in computer science. In this paper we discuss some recent research initiatives that aim to draw computer science and statistics closer together, with particular reference to “Big Data” problems. There are two main underlying perspectives driving these initiatives, both of which present interesting conceptual challenges for statistics. The first is that large computational problems are often usefully addressed via some notion of “divide-and-conquer.” That is, the large problem is divided into subproblems that are hopefully simpler than the original problem, these subproblems are solved (sometimes again with a divide-andconquer strategy) and the solutions are pieced together to solve the original problem. In the statistical setting, one natural subdivision strategy involves breaking the data into subsets. The estimator of interest is applied to the subsets and the results are combined. The challenge in the statistical setting is that the analysis of subsets of data may present different statistical properties than the overall dataset. For example, confidence intervals based on subsets of data will generally be wider than confidence intervals based on the original data; thus, care must be taken that the overall divide-and-conquer procedure yields a correctly calibrated interval. The second perspective involves a notion of “algorithmic weakening,” whereby we do not consider a single algorithm for solving an inference problem, but instead consider a hierarchy of algorithms that are ordered by computational complexity. As data accrue, we want to back off to cheaper algorithms that run more quickly and deliver a result that would be viewed as being of poorer quality from a classical algorithmic point of view. We hope to do this in a way such that the increasing statistical strength of the data compensate for the poor algorithmic quality, so that in fact the overall quality of inference increases as data accrue, even if we impose a computational budget. The challenge is to do this in a theoretically sound way. The remainder of the paper is organized into three subsections, the first two concerned with divide-and-conquer algorithms, and the third concerned with algorithmic weakening.
การแปล กรุณารอสักครู่..
