The volume of data operated upon by modern applications is
growing at a tremendous rate, posing intriguing challenges for
parallel and distributed computing platforms. These challenges
range from building storage systems that can accommodate
these large datasets to collecting data from vastly geographically
distributed sources into storage systems to running a diverse set
of computations on data. Resource and semantic constraints, like
Brewer’s CAP theorem, require handling these problems on a perapplication basis, exploiting application-specific characteristics
and heuristics. Recent efforts towards addressing these challenges
have resulted in scalable distributed storage systems (file systems,
key-value stores, etc.) and execution engines that can handle a
variety of computing paradigms. In the future, as the data sizes
continue to grow and the domains of these applications diverge,
these systems will need to adapt to leverage application-specific
optimizations. To tackle the highly distributed nature of data
sources, future systems might offload some of the computation to
the sources itself to avoid the expensive data movement costs.