3.1 Data Storage
We propose a hybrid (row-column) two-layered cloud-enabled data storage structure. Each of these layers is designed to store a special set of DICOM attributes. For that, we decompose DICOM attributes into three categories: (1) Mandatory/frequently used attributes (2) frequently accessed together attributes; and (3) optional/private attributes. Then, we propose the most appropriate layer to store each of them. We link these two layers by creating an internal unique identifier (row-id) that allows us to reconstruct our DICOM files. Both layers are cloud-based, which ensures the elasticity and fault tolerance (e.g. GFS [16] stores automatically several copies of data on geographically separated areas, so a server crash is not a problem). As a consequence, that will ensure the needed availability of medical data at any time. Another important aspect is the pay-per-use feature. A good level of normalization of our data and the choice of the appropriate cloud-enabled storage system for each layer could reduce enormously the storage cost.
3.1.1 Row-oriented layer:
We propose to store mandatory/frequently used attributes and the frequently accessed together attributes (e.g. patient name, birthdate) in a row-oriented database. As a result, we improve the query execution time, by minimizing the tuple reconstruction time for the attributes that are frequently accessed together. The advantage of this layer is its write-optimized feature (each tuple insertion in row-oriented databases needs one disk block I/O for insertion alone). Thus, having a lot of inserts over this layer will not be challenging. For example, if we have one thousand DICOM files, there will be one thousand inserts in this layer (for the mandatory attributes such as study date). Since we store the frequently used attributes in this layer, daily queries access mostly this layer. Sharded DB, like Azure or RDS, is candidate solution for such a layer. However, in order to reduce cost and have a more scalable solution, our current study focus on shared nothing MapReduce based approaches like Pig, Hive or Jaql.
3.1.2 Column-Oriented Layer
Optional/private attributes vary enormously from one medical file/center to another. For this highly heterogeneous attributes we propose storing them in column-oriented databases. Only non-null attributes values will be inserted into their corresponding columns (which improves significantly the performance). Therefore, this model copes perfectly with our heterogeneous data. This layer offers the ability to perform efficiently ad-hoc queries since column-oriented databases are OLAP-optimized. Additionally, it provides a good solution for the evolutive schema issue, since each column is stored in a separate disk block, so adding new columns is not challenging. Attributes stored in this layer are less frequently accessed together, so we minimize the result reconstruction time. Examples of possible implementation are BigTable, Cassandra, Vertica, HBase, and HyperTable. In fact, the high cost and proprietary features of Vertica and BigTable and the OLTP workload orientation of Cassandra lead us to focus on the other systems (HBase, HyperTable).
3.1.3 Column Mover
Our proposal includes a column mover, the column mover is a process that moves (when necessary) some attributes from the row layer to the column layer and vise versa according to: DICOM's evolutive schema, the previous queries and the data (for example optional attributes initially stored in the column layer which are in practice used in most of files can be moved to the row level). This process can be performed periodically (i.e. each month) to maintain the best distribution of attributes over this structure. The implementation of this process includes some important issues such as determining when to execute this process (ideally at off-peak time), under which conditions, and how to treat currently running and incoming queries while executing this process.