2. ANALYTICAL QUERY PROCESSING
A number of domain-specific analytical queries are seen in
the scientific databases. Many of these queries take superlinear
processing time if handled in the traditional DBMS
way. Some examples of basic queries in molecular simulation
databases are listed in table 1. Many of these queries return
the results in the form of histograms as approximations of
statistical distributions. The query processing time for large
volume data sets is significantly high. The complexity of the
query adds to this time, making the running time unimaginable.
The disk I/O and network transfer times worsen the
execution time even more. As we have seen in the introduction
section, the queries in scientific data sets are analytical
and therefore make at least one pass over the whole data
set. These queries can not be answered quickly using any of
the traditional query execution methods or plans. Thus, the
processing of queries on scientific data sets is very expensive.
In this thesis we propose to compute queries efficiently and
compress the data for saving disk space, I/O and transmission
time (details in section 3).
2.1 General Approach
The traditional databases answer queries by generating
plans that use the indexes whenever possible. A multidimensional
tree index is built to store a digest of the data.
Each tree node caches such a digest of all the data points
it holds. At query runtime, aggregate queries are answered
by accessing the digest of the specific region. The distributive
queries can be answered by storing the regional query
results as digest. However, storing algebraic aggregates in
the from of digest needs further exploration of the field. We
propose an approach that requires linear time to build thetree. Small queries can be answered in constant time and the
regional queries in time proportional to the number of tree
nodes accessed at runtime. We propose an algorithm that
builds a Quad/Oct-tree structure to partition the data into
different disjoint regions to answer the queries efficiently.
Correlation Functions: Holistic aggregates are one of
the challenging queries in scientific data sets. Efficient processing
of such queries is not possible by caching a very small
digest of the data. A group of holistic aggregates we focus
on are called multi-body correlation functions (m-BCF) [20].
A m-BCF computes statistical measure for all m-particle tuples
in the data. Processing m-BCF queries, on data of Nbodies,
in naive way requires O(Nm) computational time.
As an example of 2-BCF, the distribution of all pairwise
distances is known as a radial distribution function (RDF)
and often computed in the form of a 1D histogram named
Spatial Distribution Histogram (SDH). Similarly, the density
function can be viewed as a 1-BCF. Being the building
blocks of many critical analytics [6, 19], the m-BCFs are of
importance to scientific data analysis and thus the focus of
this part of the thesis work.
The obvious challenges for such query processing are: 1)
identify the correct data digest to store in the tree nodes;
and 2) design of query processing algorithms that utilize
such digest.