Big data is difficult to analyze and to work with using relational databases and desktop
statistics and visualization packages, requiring instead massively parallel software running on
tens, hundreds, or even thousands of servers.
The analysis of data, or data analytics, is the process of highlighting useful information
extracted from big data sets, usually with the goal to support decision making. Big data
analytics demands real or near-real time information delivery, and latency is therefore avoided
whenever and wherever possible. With this difficulty, a new platform of big data tools has
arisen, such as in the Apache Hadoop Big Data Platform derived from papers on Google's
MapReduce and Google File System.
There are many systems developed today for the parallel processing of big data sets that
provide query languages for expressing analysis tasks over big data sets. However, these
languages are more or less aware of the physical aspects of the underlying system.
In this talk we present a high level query language for expressing analysis tasks as queries
over big data sets, independently of how the analysis is to be carried out or what are the
computing resources used by the system and what is the physical layout of data: a query in
our language is defined at the conceptual level and then mapped to a lower level evaluation
mechanism for computing its answer. We illustrate this process using MapReduce as such a
lower level evaluation mechanism.