maintains the relationships between a foreign key with its
matching primary keys. In the context of a star schema, a join
index can relate the values of one or more attributes of a
dimension table to matching rows in the fact table. For
example, consider the schema of Figure 3. There can be a
join index on City that maintains, for each city, a list of RIDs
of the tuples in the fact table that correspond to sales in that
city. Thus a join index essentially precomputes a binary join.
Multikey join indices can represent precomputed n-way joins.
For example, over the Sales database it is possible to
construct a multidimensional join index from (Cityname,
Productname) to the fact table. Thus, the index entry for
(Seattle, jacket) points to RIDs of those tuples in the Sales
table that have the above combination. Using such a
multidimensional join index can sometimes provide savings
over taking the intersection of separate indices on Cityname
and Productname. Join indices can be used with bitmap
representations for the RID lists for efficient join
processing12.
Finally, decision support databases contain a significant
amount of descriptive text and so indices to support text
search are useful as well.
Materialized Views and their Usage
Many queries over data warehouses require summary data,
and, therefore, use aggregates. Hence, in addition to indices,
materializing summary data can help to accelerate many
common queries. For example, in an investment environment,
a large majority of the queries may be based on the
performance of the most recent quarter and the current fiscal
year. Having summary data on these parameters can
significantly speed up query processing.
The challenges in exploiting materialized views are not unlike
those in using indices: (a) identify the views to materialize,
(b) exploit the materialized views to answer queries, and (c)
efficiently update the materialized views during load and
refresh. The currently adopted industrial solutions to these
problems consider materializing views that have a relatively
simple structure. Such views consist of joins of the fact table
with a subset of dimension tables (possibly after some
selections on those dimensions), with the aggregation of one
or more measures grouped by a set of attributes from the
dimension tables. The structure of these views is a little more
complex when the underlying schema is a snowflake.
Despite the restricted form, there is still a wide choice of
views to materialize. The selection of views to materialize
must take into account workload characteristics, the costs for
incremental update, and upper bounds on storage
requirements. Under simplifying assumptions, a greedy
algorithm was shown to have good performance13. A related
problem that underlies optimization as well as choice of
materialized views is that of estimating the effect of
aggregation on the cardinality of the relations.
A simple, but extremely useful, strategy for using a
materialized view is to use selection on the materialized view,
or rollup on the materialized view by grouping and
aggregating on additional columns. For example, assume that
a materialized view contains the total sales by quarter for
each product. This materialized view can be used to answer a
query that requests the total sales of Levi’s jeans for the year
by first applying the selection and then rolling up from
quarters to years. It should be emphasized that the ability to
do roll-up from a partially aggregated result, relies on
algebraic properties of the aggregating functions (e.g., Sum
can be rolled up, but some other statistical function may not
be).
In general, there may be several candidate materialized views
that can be used to answer a query. If a view V has the same
set of dimensions as Q, if the selection clause in Q implies the
selection clause in V, and if the group-by columns in V are a
subset of the group-by columns in Q, then view V can act as a
generator of Q. Given a set of materialized views M, a query
Q, we can define a set of minimal generators M’ for Q (i.e.,
smallest set of generators such that all other generators
generate some member of M’). There can be multiple
minimal generators for a query. For example, given a query
that asks for total sales of clothing in Washington State, the
following two views are both generators: (a) total sales by
each state for each product (b) total sales by each city for
each category. The notion of minimal generators can be used
by the optimizer to narrow the search for the appropriate
materialized view to use. On the commercial side, HP
Intelligent Warehouse pioneered the use of the minimal
generators to answer queries. While we have defined the
notion of a generator in a restricted way, the general problem
of optimizing queries in the presence of multiple
materialized views is more complex. In the special case of
Select-Project-Join queries, there has been some work in this
area.14 15 16
Transformation of Complex SQL Queries
The problem of finding efficient techniques for processing
complex queries has been of keen interest in query
optimization. In a way, decision support systems provide a
testing ground for some of the ideas that have been studied
before. We will only summarize some of the key
contributions.
There has been substantial work on “unnesting” complex
SQL queries containing nested subqueries by translating them
into single block SQL queries when certain syntactic
restrictions are satisfied17 18 19 20. Another direction that has
been pursued in optimizing nested subqueries is reducing the
number of invocations and batching invocation of inner
524
subqueries by semi-join like techniques21 22. Likewise, the
problem of flattening queries containing views has been a
topic of interest. The case where participating views are SPJ
queries is well understood. The problem is more complex
when one or more of the views contain aggregation23.
Naturally, this problem is closely related to the problem of
commuting group-by and join operators. However,
commuting group-by and join is applicable in the context of
single block SQL queries as well.24 25 26 An overview of the
field appears in a recent paper27.
Parallel Processing
Parallelism plays a significant role in processing massive
databases. Teradata pioneered some of the key technology.
All major vendors of database management systems now
offer data partitioning and parallel query processing
technology. The article by Dewitt and Gray provides an
overview of this area28 . One interesting technique relevant to
the read-only environment of decision support systems is that
of piggybacking scans requested by multiple queries (used in
Redbrick). Piggybacking scan reduces the total work as well
as response time by overlapping scans of multiple concurrent
requests.
Server Architectures for Query Processing
Traditional relational servers were not geared towards the
intelligent use of indices and other requirements for
supporting multidimensional views of data. However, all
relational DBMS vendors have now moved rapidly to support
these additional requirements. In addition to the traditional
relational servers, there are three other categories of servers
that were developed specifically for decision support.
• Specialized SQL Servers: Redbrick is an example of this
class of servers. The objective here is to provide
advanced query language and query processing support
for SQL queries over star and snowflake schemas in
read-only environments.
• ROLAP Servers: These are intermediate servers that sit
between a relational back end server (where the data in
the warehouse is stored) and client front end tools.
Microstrategy is an example of such servers. They
extend traditional relational servers with specialized
middleware to efficiently support multidimensional
OLAP queries, and they typically optimize for specific
back end relational servers. They identify the views that
are to be materialized, rephrase given user queries in
terms of the appropriate materialized views, and generate
multi-statement SQL for the back end server. They also
provide additional services such as scheduling of queries
and resource assignment (e.g., to prevent runaway
queries). There has also been a trend to tune the ROLAP
servers for domain specific ROLAP tools. The main
strength of ROLAP servers is that they exploit the
scalability and the transactional features of relational
systems. However, intri
รักษาความสัมพันธ์ระหว่างคีย์มีความตรงกับคีย์หลัก ในบริบทของ schema ดาว รวมดัชนีสามารถเกี่ยวข้องกับค่าของแอตทริบิวต์อย่าง น้อยหนึ่งของการตารางมิติตรงกับแถวในตารางข้อเท็จจริง สำหรับตัวอย่าง พิจารณาเค้าร่างของรูปที่ 3 สามารถมีการดัชนีรวมในเมืองที่รักษา ในแต่ละเมือง รายการ RIDsของ tuples ในตารางข้อเท็จจริงที่สอดคล้องกับการขายในที่เมือง ดังนั้น ดัชนีรวมเป็น precomputes รวมไบนารีดัชนีรวม multikey สามารถแสดง precomputed n-วิธีรวมตัวอย่าง ผ่านฐานข้อมูลการขาย เป็นการสร้างดัชนีแบบหลายมิติรวมจาก (ชื่อเมืองProductname) ตารางความจริง ดังนั้น รายการดัชนีสำหรับ(ซีแอตเทิล เสื้อ) ไป RIDs tuples เหล่านั้นในการขายตารางที่มีชุดข้างต้น ใช้เช่นการดัชนีรวมหลายมิติบางครั้งสามารถให้ประหยัดไปนำของดัชนีแยกในชื่อเมืองและ Productname รวมดัชนีที่สามารถใช้กับบิตแมปนำเสนอรายการ RID สำหรับรวมมีประสิทธิภาพprocessing12สุดท้าย ฐานข้อมูลสนับสนุนการตัดสินใจประกอบด้วยความสำคัญจำนวนข้อความและดัชนีเพื่อให้การสนับสนุนข้อความค้นหามีประโยชน์เช่นMaterialized มุมมองและการใช้งานของพวกเขาแบบสอบถามจำนวนมากผ่านคลังข้อมูลต้องการข้อมูลสรุปและ ดังนั้น ใช้ผล นอกจากดัชนี ดังนั้นmaterializing สรุปข้อมูลสามารถช่วยเร่งมากสอบถามทั่วไป ตัวอย่าง ในสภาพแวดล้อมการลงทุนa large majority of the queries may be based on theperformance of the most recent quarter and the current fiscalyear. Having summary data on these parameters cansignificantly speed up query processing.The challenges in exploiting materialized views are not unlikethose in using indices: (a) identify the views to materialize,(b) exploit the materialized views to answer queries, and (c)efficiently update the materialized views during load andrefresh. The currently adopted industrial solutions to theseproblems consider materializing views that have a relativelysimple structure. Such views consist of joins of the fact tablewith a subset of dimension tables (possibly after someselections on those dimensions), with the aggregation of oneor more measures grouped by a set of attributes from thedimension tables. The structure of these views is a little morecomplex when the underlying schema is a snowflake.Despite the restricted form, there is still a wide choice ofviews to materialize. The selection of views to materializemust take into account workload characteristics, the costs forincremental update, and upper bounds on storagerequirements. Under simplifying assumptions, a greedyalgorithm was shown to have good performance13. A relatedproblem that underlies optimization as well as choice ofmaterialized views is that of estimating the effect ofaggregation on the cardinality of the relations.A simple, but extremely useful, strategy for using amaterialized view is to use selection on the materialized view,or rollup on the materialized view by grouping andaggregating on additional columns. For example, assume thata materialized view contains the total sales by quarter foreach product. This materialized view can be used to answer aquery that requests the total sales of Levi’s jeans for the yearby first applying the selection and then rolling up fromquarters to years. It should be emphasized that the ability todo roll-up from a partially aggregated result, relies onalgebraic properties of the aggregating functions (e.g., Sumcan be rolled up, but some other statistical function may notbe).In general, there may be several candidate materialized viewsthat can be used to answer a query. If a view V has the sameset of dimensions as Q, if the selection clause in Q implies theselection clause in V, and if the group-by columns in V are asubset of the group-by columns in Q, then view V can act as agenerator of Q. Given a set of materialized views M, a queryQ, we can define a set of minimal generators M’ for Q (i.e.,smallest set of generators such that all other generatorsgenerate some member of M’). There can be multipleminimal generators for a query. For example, given a querythat asks for total sales of clothing in Washington State, thefollowing two views are both generators: (a) total sales byeach state for each product (b) total sales by each city foreach category. The notion of minimal generators can be usedby the optimizer to narrow the search for the appropriatematerialized view to use. On the commercial side, HPIntelligent Warehouse pioneered the use of the minimalgenerators to answer queries. While we have defined thenotion of a generator in a restricted way, the general problemof optimizing queries in the presence of multiplematerialized views is more complex. In the special case ofSelect-Project-Join queries, there has been some work in thisarea.14 15 16Transformation of Complex SQL QueriesThe problem of finding efficient techniques for processingcomplex queries has been of keen interest in queryoptimization. In a way, decision support systems provide atesting ground for some of the ideas that have been studiedbefore. We will only summarize some of the keycontributions.There has been substantial work on “unnesting” complexSQL queries containing nested subqueries by translating theminto single block SQL queries when certain syntacticrestrictions are satisfied17 18 19 20. Another direction that hasbeen pursued in optimizing nested subqueries is reducing thenumber of invocations and batching invocation of inner524subqueries by semi-join like techniques21 22. Likewise, theproblem of flattening queries containing views has been atopic of interest. The case where participating views are SPJqueries is well understood. The problem is more complexwhen one or more of the views contain aggregation23.Naturally, this problem is closely related to the problem ofcommuting group-by and join operators. However,commuting group-by and join is applicable in the context ofsingle block SQL queries as well.24 25 26 An overview of thefield appears in a recent paper27.Parallel ProcessingParallelism plays a significant role in processing massivedatabases. Teradata pioneered some of the key technology.All major vendors of database management systems nowoffer data partitioning and parallel query processingtechnology. The article by Dewitt and Gray provides anoverview of this area28 . One interesting technique relevant tothe read-only environment of decision support systems is thatof piggybacking scans requested by multiple queries (used inRedbrick). Piggybacking scan reduces the total work as wellas response time by overlapping scans of multiple concurrentrequests.Server Architectures for Query ProcessingTraditional relational servers were not geared towards theintelligent use of indices and other requirements forsupporting multidimensional views of data. However, allrelational DBMS vendors have now moved rapidly to supportthese additional requirements. In addition to the traditionalrelational servers, there are three other categories of serversthat were developed specifically for decision support.• Specialized SQL Servers: Redbrick is an example of this
class of servers. The objective here is to provide
advanced query language and query processing support
for SQL queries over star and snowflake schemas in
read-only environments.
• ROLAP Servers: These are intermediate servers that sit
between a relational back end server (where the data in
the warehouse is stored) and client front end tools.
Microstrategy is an example of such servers. They
extend traditional relational servers with specialized
middleware to efficiently support multidimensional
OLAP queries, and they typically optimize for specific
back end relational servers. They identify the views that
are to be materialized, rephrase given user queries in
terms of the appropriate materialized views, and generate
multi-statement SQL for the back end server. They also
provide additional services such as scheduling of queries
and resource assignment (e.g., to prevent runaway
queries). There has also been a trend to tune the ROLAP
servers for domain specific ROLAP tools. The main
strength of ROLAP servers is that they exploit the
scalability and the transactional features of relational
systems. However, intri
การแปล กรุณารอสักครู่..
