3.2 Query Execution
Query execution engine contains two main components: the query compiler and the combiner/disperser.
The Query Compiler (QC) is responsible for compiling user requests and translating them into a cloud query language. So the non-cloud-expert users can perform ad-hoc queries over DICOM files, without dealing with the complexity of query languages.
We propose to extend existing systems (Pig, Hive, Jaql), and adapt them to our hybrid architecture. These systems -under heavy development- have some limitations. The absence of metadata/schema in some of them may result in a limitation of some optimizations (e.g. indexes, reduction of search space), and/or the lack of some functionalities (e.g. join). In particular, these languages are not conceived for such hybrid structure (where the cardinality of different attributes varies enormously).
The Combiner/Disperser is responsible for partitioning the coming queries according to the layers (row oriented, column oriented). After the query execution, the Combiner/Disperser is in charge of combining (joining) the results coming from the both storage layers and send the final results back to the user.
In order to provide a good compromise between storage cost and query response time, we propose a query optimizer. It is responsible for choosing the better query plan for executing the query over our hybrid storage model. A number of possible execution strategies should be evaluated by the optimizer: (1) execute the query over the row layer, then execute it over the column layer, finally combine the results; (2) execute the query over the column layer first, then execute it over the row layer and finally combine the results of both of them; or (3) execute the query in parallel over both layers and then combine the results.
The query optimizer should apply a cost/rule based optimization. Yet the existing CBO/RBO solutions should be rethought for the cloud by taking into account the pay-per-use and elasticity
features. In this context, we distinguish between two query types. The first is the real time search where the doctor may need certain images rapidly. In this case, the response time is crucial; so we may increase the number of resources used from the cloud according to Service Level Agreement (SLA). The second is the data analysis that could be performed at night, here the response time is not crucial, so we can reduce the used resources. Hence we maintain a good correlation between response time and cost.