Shark is a new data analysis system that marries query processing
with complex analytics on large clusters. It leverages a novel distributed
memory abstraction to provide a unified engine that can run
SQL queries and sophisticated analytics functions (e.g., iterative
machine learning) at scale, and efficiently recovers from failures
mid-query. This allows Shark to run SQL queries up to 100× faster
than Apache Hive, and machine learning programs more than 100× faster than Hadoop. Unlike previous systems, Shark shows that it is
possible to achieve these speedups while retaining a MapReducelike
execution engine, and the fine-grained fault tolerance properties
that such engine provides. It extends such an engine in several
ways, including column-oriented in-memory storage and dynamic
mid-query replanning, to effectively execute SQL. The result
is a system that matches the speedups reported for MPP analytic
databases over MapReduce, while offering fault tolerance properties
and complex analytics capabilities that they lack.