maintains the relationships between a foreign key with its
matching primary keys. In the context of a star schema, a join
index can relate the values of one or more attributes of a
dimension table to matching rows in the fact table. For
example, consider the schema of Figure 3. There can be a
join index on City that maintains, for each city, a list of RIDs
of the tuples in the fact table that correspond to sales in that
city. Thus a join index essentially precomputes a binary join.
Multikey join indices can represent precomputed n-way joins.
For example, over the Sales database it is possible to
construct a multidimensional join index from (Cityname,
Productname) to the fact table. Thus, the index entry for
(Seattle, jacket) points to RIDs of those tuples in the Sales
table that have the above combination. Using such a
multidimensional join index can sometimes provide savings
over taking the intersection of separate indices on Cityname
and Productname. Join indices can be used with bitmap
representations for the RID lists for efficient join
processing12.
Finally, decision support databases contain a significant
amount of descriptive text and so indices to support text
search are useful as well.
Materialized Views and their Usage
Many queries over data warehouses require summary data,
and, therefore, use aggregates. Hence, in addition to indices,
materializing summary data can help to accelerate many
common queries. For example, in an investment environment,
a large majority of the queries may be based on the
performance of the most recent quarter and the current fiscal
year. Having summary data on these parameters can
significantly speed up query processing.
The challenges in exploiting materialized views are not unlike
those in using indices: (a) identify the views to materialize,
(b) exploit the materialized views to answer queries, and (c)
efficiently update the materialized views during load and
refresh. The currently adopted industrial solutions to these
problems consider materializing views that have a relatively
simple structure. Such views consist of joins of the fact table
with a subset of dimension tables (possibly after some
selections on those dimensions), with the aggregation of one
or more measures grouped by a set of attributes from the
dimension tables. The structure of these views is a little more
complex when the underlying schema is a snowflake.
Despite the restricted form, there is still a wide choice of
views to materialize. The selection of views to materialize
must take into account workload characteristics, the costs for
incremental update, and upper bounds on storage
requirements. Under simplifying assumptions, a greedy
algorithm was shown to have good performance13. A related
problem that underlies optimization as well as choice of
materialized views is that of estimating the effect of
aggregation on the cardinality of the relations.
A simple, but extremely useful, strategy for using a
materialized view is to use selection on the materialized view,
or rollup on the materialized view by grouping and
aggregating on additional columns. For example, assume that
a materialized view contains the total sales by quarter for
each product. This materialized view can be used to answer a
query that requests the total sales of Levi’s jeans for the year
by first applying the selection and then rolling up from
quarters to years. It should be emphasized that the ability to
do roll-up from a partially aggregated result, relies on
algebraic properties of the aggregating functions (e.g., Sum
can be rolled up, but some other statistical function may not
be).
In general, there may be several candidate materialized views
that can be used to answer a query. If a view V has the same
set of dimensions as Q, if the selection clause in Q implies the
selection clause in V, and if the group-by columns in V are a
subset of the group-by columns in Q, then view V can act as a
generator of Q. Given a set of materialized views M, a query
Q, we can define a set of minimal generators M’ for Q (i.e.,
smallest set of generators such that all other generators
generate some member of M’). There can be multiple
minimal generators for a query. For example, given a query
that asks for total sales of clothing in Washington State, the
following two views are both generators: (a) total sales by
each state for each product (b) total sales by each city for
each category. The notion of minimal generators can be used
by the optimizer to narrow the search for the appropriate
materialized view to use. On the commercial side, HP
Intelligent Warehouse pioneered the use of the minimal
generators to answer queries. While we have defined the
notion of a generator in a restricted way, the general problem
of optimizing queries in the presence of multiple
materialized views is more complex. In the special case of
Select-Project-Join queries, there has been some work in this
area.14 15 16
Transformation of Complex SQL Queries
The problem of finding efficient techniques for processing
complex queries has been of keen interest in query
optimization. In a way, decision support systems provide a
testing ground for some of the ideas that have been studied
before. We will only summarize some of the key
contributions.
There has been substantial work on “unnesting” complex
SQL queries containing nested subqueries by translating them
into single block SQL queries when certain syntactic
restrictions are satisfied17 18 19 20. Another direction that has
been pursued in optimizing nested subqueries is reducing the
number of invocations and batching invocation of inner
524
subqueries by semi-join like techniques21 22. Likewise, the
problem of flattening queries containing views has been a
topic of interest. The case where participating views are SPJ
queries is well understood. The problem is more complex
when one or more of the views contain aggregation23.
Naturally, this problem is closely related to the problem of
commuting group-by and join operators. However,
commuting group-by and join is applicable in the context of
single block SQL queries as well.24 25 26 An overview of the
field appears in a recent paper27.
Parallel Processing
Parallelism plays a significant role in processing massive
databases. Teradata pioneered some of the key technology.
All major vendors of database management systems now
offer data partitioning and parallel query processing
technology. The article by Dewitt and Gray provides an
overview of this area28 . One interesting technique relevant to
the read-only environment of decision support systems is that
of piggybacking scans requested by multiple queries (used in
Redbrick). Piggybacking scan reduces the total work as well
as response time by overlapping scans of multiple concurrent
requests.
Server Architectures for Query Processing
Traditional relational servers were not geared towards the
intelligent use of indices and other requirements for
supporting multidimensional views of data. However, all
relational DBMS vendors have now moved rapidly to support
these additional requirements. In addition to the traditional
relational servers, there are three other categories of servers
that were developed specifically for decision support.
• Specialized SQL Servers: Redbrick is an example of this
class of servers. The objective here is to provide
advanced query language and query processing support
for SQL queries over star and snowflake schemas in
read-only environments.
• ROLAP Servers: These are intermediate servers that sit
between a relational back end server (where the data in
the warehouse is stored) and client front end tools.
Microstrategy is an example of such servers. They
extend traditional relational servers with specialized
middleware to efficiently support multidimensional
OLAP queries, and they typically optimize for specific
back end relational servers. They identify the views that
are to be materialized, rephrase given user queries in
terms of the appropriate materialized views, and generate
multi-statement SQL for the back end server. They also
provide additional services such as scheduling of queries
and resource assignment (e.g., to prevent runaway
queries). There has also been a trend to tune the ROLAP
servers for domain specific ROLAP tools. The main
strength of ROLAP servers is that they exploit the
scalability and the transactional features of relational
systems. However, intri