Consider the Purchases relation shown in Figure 24.1. Assume that we want to find pairs of customers and items such that the customer has purchased the item more than ve times. We can express this query in SQL as follows
Think about how this query would be evaluated by a relational DBMS. Conceptually, for each (custid, item) pair, we need to check whether the sum of the qty field is greater than 5. One approach is to make a scan over the Purchases relation and maintain running sums for each (custid,item) pair. This is a feasible execution strategy as long as the number of pairs is small enough to t into main memory. If the number of pairs is larger than main memory, more expensive query evaluation plans that involve either sorting or hashing have to be used.
The query has an important property that is not exploited by the above execution strategy: Even though the Purchases relation is potentially very large and the number of (custid,item) groups can be huge, the output of the query is likely to be relatively small because of the condition in the HAVING clause. Only groups where the customer has purchased the item more than five times appear in the output. For example, there are nine groups in the query over the Purchases relation shown in Figure 24.1, although the output only contains three records. The number of groups is very large, but the answer to the query|the tip of the iceberg|is usually very small. Therefore, we call such a query an iceberg query. In general, given a relational schema R with attributes A1, A2, ..., Ak, and B and an aggregation function aggr, an iceberg query has the following structure:
Traditional query plans for this query that use sorting or hashing first compute the value of the aggregation function for all groups and then eliminate groups that do not satisfy the condition in the HAVING clause.
Comparing the query with the problem of finding frequent itemsets that we discussed in the previous section, there is a striking similarity. Consider again the Purchases relation shown in Figure 24.1 and the iceberg query from the beginning of this section. We are interested in (custid, item) pairs that have SUM (P.qty) > 5. Using a variation of the a priori property, we can argue that we only have to consider values of the custid field where the customer has purchased at least ve items overall. We can generate such items through the following query:
Similarly, we can restrict the candidate values for the item field through the following query:
If we restrict the computation of the original iceberg query to (custid, item) groups where the eld values are in the output of the previous two queries, we eliminate a large number of (custid, item) pairs a priori! Thus, a possible evaluation strategy is to first compute candidate values for the custid and item fields, and only to use combinations of these values in the evaluation of the original iceberg query. We first generate candidate field values for individual fields and only use those values that survive the a priori pruning step as expressed in the two previous queries. Thus, the iceberg query is amenable to the same bottom-up evaluation strategy that is used to find frequent itemsets. In particular, we can use the a priori property as follows: We only keep a counter for a group if each individual component of the group satisfies the condition expressed in the HAVING clause. The performance improvements of this alternative evaluation strategy over traditional query plans can be very significant in practice.
Even though the bottom-up query processing strategy eliminates many groups a priori, the number of (custid, item) pairs can still be very large in practice|even larger than main memory.