2.2 Row/column stores
Traditional RDBMS (row-oriented) provide a good level of expressiveness and they are widely-used for a variety of commercial applications. This has resulted in remarkable and tested optimizations. These systems are optimal when querying many columns of a single row and suitable for OLTP workload. However they cannot overcome the highly heterogeneous data problem and they are not efficient for the read-intense queries of OLAP workload [15]. ROLAP (Relational + OLAP) data model has been proposed to enable analysis of data through the use of a multidimensional data model [13]. A number of schema integration systems have been proposed [22][7]. These systems take into account a few different schemas and try to unify them in a global one. In our context, there could be thousands of different schemas and integrating them in a unified one is impossible.
Column-oriented databases (derived from the decomposition storage model DSM [10]) has gained interest as an appropriate solution for OLAP workload [24]. These systems have shown better performance for ad-hoc/statistical queries [1]. That is due to their I/O efficiency since they only read those attributes required by a query. The DSM model can support heterogeneous data. But some of the recent column store propose the use of virtual keys [24] to link all the attributes pertaining to the same tuple. This technique eliminates the advantage of storing heterogeneous data. The drawbacks of such systems are the high tuple reconstruction time and the high cost of inserts (and updates).
The need to get the best of these two worlds (row/column databases), has resulted in the birth of hybrid paradigm as PAX [3], fractured mirrors [23]. These systems are not conceived for the heterogeneity problem. Actually, they either focus on the optimization of cache-use or they duplicate data on two different storage layouts and/or they cannot provide the needed scalability.
2.3 Motivation
The urgent need of having a scalable system capable of holding extremely huge/ever-growing data size with reducing the infrastructure cost has urged us to investigate the promising characteristics of the cloud systems.
Our objective is to propose a data management system, while maintaining the elasticity, pay per use and availability features of
the cloud to support huge/ever growing volume of data and insures an optimal response time, (1) enables to manage the high heterogeneity of DICOM files and (2) provide high expressiveness necessary for complex ad-hoc/statistical queries. We focus in this paper on the first point.