Oracle8i Concepts Release 8.1.5 A67781-01 |
|
Will you, won't you, will you, won't you, will you join the dance?
Lewis Carroll
This chapter discusses how the Oracle optimizer executes SQL statements that contain joins, anti-joins, and semi-joins. It also describes how the optimizer can use bitmap indexes to execute star queries, which join a fact table to multiple dimension tables. This chapter includes:
To choose an execution plan for a join statement, the optimizer must make these interrelated decisions:
access paths |
As for simple statements, the optimizer must choose an access path to retrieve data from each table in the join statement. (See "Choosing Access Paths".) |
join operations |
To join each pair of row sources, Oracle must perform one of these operations: |
join order |
To execute a statement that joins more than two tables, Oracle joins two of the tables, and then joins the resulting row source to the next table. This process is continued until all tables are joined into the result. |
The optimizer can use the following operations to join two row sources:
To perform a nested loops join, Oracle follows these steps:
Figure 24-1 shows the execution plan for this statement using a nested loops join:
SELECT * FROM emp, dept WHERE emp.deptno = dept.deptno;
To execute this statement, Oracle performs these steps:
Oracle can only perform a sort-merge join for an equijoin. To perform a sort-merge join, Oracle follows these steps:
Figure 24-2 shows the execution plan for this statement using a sort-merge join:
SELECT * FROM emp, dept WHERE emp.deptno = dept.deptno;
To execute this statement, Oracle performs these steps:
Oracle can perform a cluster join only for an equijoin that equates the cluster key columns of two tables in the same cluster. In a cluster, rows from both tables with the same cluster key values are stored in the same blocks, so Oracle only accesses those blocks.
Additional Information:
Oracle8i Tuning provides guidelines for deciding which tables to cluster for best performance. |
Figure 24-3 shows the execution plan for this statement in which the EMP and DEPT tables are stored together in the same cluster:
SELECT * FROM emp, dept WHERE emp.deptno = dept.deptno;
To execute this statement, Oracle performs these steps:
A cluster join is nothing more than a nested loops join involving two tables that are stored together in a cluster. Since each row from the DEPT table is stored in the same data blocks as the matching rows in the EMP table, Oracle can access matching rows most efficiently.
Oracle can only perform a hash join for an equijoin. Hash join is not available with rule-based optimization. You must enable hash join optimization, using the initialization parameter HASH_JOIN_ENABLED (which can be set with the ALTER SESSION command) or the USE_HASH hint.
To perform a hash join, Oracle follows these steps:
Figure 24-4 shows the execution plan for this statement using a hash join:
SELECT * FROM emp, dept WHERE emp.deptno = dept.deptno;
To execute this statement, Oracle performs these steps:
The initialization parameter HASH_AREA_SIZE controls the amount of memory used for hash join operations and the initialization parameter HASH_MULTIBLOCK_IO_COUNT controls the number of blocks a hash join operation should read and write concurrently.
Additional Information:
See Oracle8i Tuning for more information about these initialization parameters and the USE_HASH hint. |
This section describes how the optimizer chooses an execution plan for a join statement:
Note these considerations that apply to the cost-based and rule-based approaches:
With the cost-based approach, the optimizer generates a set of execution plans based on the possible join orders, join operations, and available access paths. The optimizer then estimates the cost of each plan and chooses the one with the lowest cost. The optimizer estimates costs in these ways:
With the cost-based approach, the optimizer's choice of join orders can be overridden with the ORDERED hint. If the ORDERED hint specifies a join order that violates the rule for outer join, the optimizer ignores the hint and chooses the order. You can also override the optimizer's choice of join operations with hints.
With the rule-based approach, the optimizer follows these steps to choose an execution plan for a statement that joins R tables:
Usually, the optimizer does not consider the order in which tables appear in the FROM clause when choosing an execution plan. The optimizer makes this choice by applying the following rules in order:
For a view that is on the right side of an outer join, the optimzer can use one of two methods, depending on how many base tables the view accesses:
A view that has one base table and is on the right side of an outer join can be merged into the query block of an accessing statement. (See "Merging the View's Query into the Statement".) View merging is possible even if an expression in the view can return a non-null value for a NULL.
Consider the view NAME_VIEW, which concatenates first and last names from the EMP table:
CREATE VIEW name_view AS SELECT emp.firstname || emp.lastname AS emp_fullname, emp.deptno FROM emp;
and consider this outer join statement, which finds the names of all employees in London and their departments, as well as any departments that have no employees:
SELECT dept.deptno, name_view.emp_fullname FROM emp_fullname, dept WHERE dept.deptno = name_view.deptno(+) AND dept.deptloc = 'London';
The optimizer merges the view's query into the outer join statement. The resulting statement looks like this:
SELECT dept.deptno, DECODE(emp.rowid, NULL, NULL, emp.firstname || emp.lastname) FROM emp, dept WHERE dept.deptno = emp.deptno(+) AND dept.deptloc = 'London';
The transformed statement selects only the employees who work in London.
For a view with multiple base tables on the right side of an outer join, the optimizer can push the join predicate into the view (see "Pushing the Predicate into the View") if the initialization parameter OPTIMIZER_FEATURES_ENABLE is set to TRUE or the accessing query contains the PUSH_JOIN_PRED hint.
Pushing a join predicate is a cost-based transformation that can enable more efficient access path and join methods, such as transforming hash joins into nested loop joins, and full table scans to index scans.
Consider the view LONDON_EMP, which selects the employees who work in London:
CREATE VIEW london_emp AS SELECT emp.ename FROM emp, dept WHERE emp.deptno = dept.deptno AND dept.deptloc = 'London';
and consider this outer join statement, which finds the engineers and accountants working in London who received bonuses:
SELECT bonus.job, london_emp.ename FROM bonus, london_emp WHERE bonus.job IN ('engineer', 'accountant') AND bonus.ename = london_emp.ename(+);
The optimizer pushes the outer join predicate into the view. The resulting statement (which does not conform to standard SQL syntax) looks like this:
SELECT bonus.job, london_emp.ename FROM bonus, (SELECT emp.ename FROM emp, dept WHERE bonus.ename = london_emp.ename(+) AND emp.deptno = dept.deptno AND dept.deptloc = 'London') WHERE bonus.job IN ('engineer', 'accountant');
An anti-join returns rows from the left side of the predicate for which there is no corresponding row on the right side of the predicate. That is, it returns rows that fail to match (NOT IN) the subquery on the right side. For example, an anti-join can select a list of employees who are not in a particular set of departments:
SELECT * FROM emp WHERE deptno NOT IN (SELECT deptno FROM dept WHERE loc = 'HEADQUARTERS');
The optimizer uses a nested loops algorithm for NOT IN subqueries by default, unless the initialization parameter ALWAYS_ANTI_JOIN is set to MERGE or HASH and various required conditions are met that allow the transformation of the NOT IN subquery into a sort-merge or hash anti-join. You can place a MERGE_AJ or HASH_AJ hint in the NOT IN subquery to specify which algorithm the optimizer should use.
A semi-join returns rows that match an EXISTS subquery, without duplicating rows from the left side of the predicate when multiple rows on the right side satisfy the criteria of the subquery. For example:
SELECT * FROM dept WHERE EXISTS (SELECT * FROM emp WHERE dept.ename = emp.ename AND emp.bonus > 5000);
In this query, only one row needs to be returned from DEPT even though many rows in EMP might match the subquery. If there is no index on the BONUS column in EMP, a semi-join can be used to improve query performance.
The optimizer uses a nested loops algorithm for EXISTS subqueries by default, unless the initialization parameter ALWAYS_SEMI_JOIN is set to MERGE or HASH and various required conditions are met. You can place a MERGE_SJ or HASH_SJ hint in the EXISTS subquery to specify which algorithm the optimizer should use.
One type of data warehouse design centers around what is known as a "star" schema, which is characterized by one or more very large fact tables that contain the primary information in the data warehouse and a number of much smaller dimension tables (or "lookup" tables), each of which contains information about the entries for a particular attribute in the fact table.
A star query is a join between a fact table and a number of lookup tables. Each lookup table is joined to the fact table using a primary-key to foreign-key join, but the lookup tables are not joined to each other.
Cost-based optimization recognizes star queries and generates efficient execution plans for them. (Star queries are not recognized by rule-based optimization.)
A typical fact table contains keys and measures. For example, a simple fact table might contain the measure Sales, and keys Time, Product, and Market. In this case there would be corresponding dimension tables for Time, Product, and Market. The Product dimension table, for example, would typically contain information about each product number that appears in the fact table.
A star join is a primary-key to foreign-key join of the dimension tables to a fact table. The fact table normally has a concatenated index on the key columns to facilitate this type of join.
Additional Information:
See Oracle8i Tuning for more information about dimensions and data warehouses. |
This section discusses star queries with reference to the following example:
SELECT SUM(dollars) FROM facts, time, product, market WHERE market.stat = 'New York' AND product.brand = 'MyBrand' AND time.year = 1995 AND time.month = 'March' /* Joins*/ AND time.key = facts.tkey AND product.pkey = facts.pkey AND market.mkey = facts.mkey;
To execute star queries efficiently, you must use cost-based optimization. Begin by gathering statistics (using the DBMS_STATS package or the ANALYZE command) for each of the tables accessed by the query.
In the example above, you would construct a concatenated index on the columns tkey, pkey, and mkey. The order of the columns in the index is critical to performance. the columns in the index should take advantage of any ordering of the data. If rows are added to the large table in time order, then tkey should be the first key in the index. When the data is a static extract from another database, it is worthwhile to sort the data on the key columns before loading it.
If all queries specify predicates on each of the small tables, a single concatenated index suffices. If queries that omit leading columns of the concatenated index are frequent, additional indexes may be useful. In this example, if there are frequent queries that omit the time table, an index on pkey and mkey can be added.
Usually, if you analyze the tables the optimizer will choose an efficient star plan. You can also use hints to improve the plan. The most precise method is to order the tables in the FROM clause in the order of the keys in the index, with the large table last. Then use the following hints:
/*+ ORDERED USE_NL(facts) INDEX(facts fact_concat) */
A more general method is to use the STAR hint /*+ STAR */.
Each of the small tables can be replaced by a join of several smaller tables. For example, the product table could be normalized into brand and manufacturer tables. Normalization of all of the small tables can cause performance problems. One problem is caused by the increased number of permutations that the optimizer must consider. The other problem is the result of multiple executions of the small table joins.
You can solve both of these problems by using denormalized views. For example:
CREATE VIEW prodview AS SELECT /*+ NO_MERGE */ * FROM brands, mfgrs WHERE brands.mfkey = mfgrs.mfkey;
This hint reduces the optimizer's search space and causes caching of the result of the view.
The star transformation is a cost-based query transformation aimed at executing star queries efficiently. Whereas the star optimization works well for schemas with a small number of dimensions and dense fact tables, the star transformation may be considered as an alternative if any of the following holds true:
The star transformation does not rely on computing a Cartesian product of the dimension tables, which makes it better suited for cases where fact table sparsity and/or a large number of dimensions would lead to a large Cartesian product with few rows having actual matches in the fact table. In addition, rather than relying on concatenated indexes, the star transformation is based on combining bitmap indexes on individual fact table columns.
The transformation can thus choose to combine indexes corresponding precisely to the constrained dimensions. There is no need to create many concatenated indexes where the different column orders match different patterns of constrained dimensions in different queries.
Attention: Bitmap indexes are available only if you have purchased the Oracle8i Enterprise Edition. In Oracle8i, bitmap indexes are not available and star query processing uses B*-tree indexes. In the Oracle8i Enterprise Edition, the parallel bitmap index join algorithm is also available for star query processing. See Getting to Know Oracle8i for more information about the features available in Oracle8i and Oracle8i Enterprise Edition. |
The star transformation works by generating new subqueries that can be used to drive a bitmap index access path for the fact table.
Consider a simple case with three dimension tables, "d1", "d2", and "d3", and a fact table, "fact". The following query:
EXPLAIN PLAN FOR SELECT * FROM fact, d1, d2, d3 WHERE fact.c1 = d1.c1 AND fact.c2 = d2.c1 AND fact.c3 = d3.c1 AND d1.c2 IN (1, 2, 3, 4) AND d2.c2 < 100 AND d3.c2 = 35
gets transformed by adding three subqueries:
SELECT * FROM fact, d1, d2, d3 WHERE fact.c1 = d1.c1 AND fact.c2 = d2.c1 AND fact.c3 = d3.c3 AND d1.c2 IN (1, 2, 3, 4) AND d2.c2 < 100 AND d3.c2 = 35 AND fact.c1 IN (SELECT d1.c1 FROM d1 WHERE d1.c2 IN (1, 2, 3, 4)) AND fact.c2 IN (select d2.c1 FROM d2 WHERE d2.c2 < 100) AND fact.c3 IN (SELECT d3.c1 FROM d3 WHERE d3.c2 = 35)
In addition, if it is cost effective, one or more of the subqueries may be further optimized by storing its results in a temporary table. Then the subquery is replaced with a subquery on the temporary table. For example, if the first subquery above was selected for this temporary table transformation, a temporary table named ORA_TEMP_1_123 is created and filled with the results of the subquery:
SELECT d1.c1 from d1 where d1.c2 in (1, 2, 3, 4)
The fully transformed query is now:
SELECT * FROM fact, ORA_TEMP_1_123, d2, d3 WHERE fact.c1 = ORA_TEMP_1_123.c1 AND fact.c2 = d2.c1 and fact.c3 = d3.c1 AND ORA_TEMP_1_123.c1 IN (1, 2, 3, 4) AND d2.c2 < 100 AND d3.c2 = 35 AND fact.c1 IN (SELECT ORA_TEMP_1_123.c1 FROM ORA_TEMP_1_123) AND fact.c2 IN (SELECT d2.c1 FROM d2 WHERE d2.c2 < 100) AND fact.c3 IN (SELECT d3.c1 FROM d3 WHERE d3.c2 = 35)
Given that there are bitmap indexes on fact.c1, fact.c2, and fact.c3, the newly generated subqueries can be used to drive a bitmap index access path in the following way.
For each value of c1 that is retrieved from the first subquery, the bitmap for that value is retrieved from the index on fact.c1 and these bitmaps are merged. The result is a bitmap for precisely those rows in fact that match the condition on d1 in the subquery WHERE clause.
Similarly, the values from the second subquery are used together with the bitmap index on fact.c2 to produce a merged bitmap corresponding to the rows in fact that match the condition on d2 in the second subquery. The same operations apply to the third subquery. The three merged bitmaps can then be ANDed, resulting in a bitmap corresponding to those rows in fact that meet the conditions in all three subqueries simultaneously.
This bitmap can be used to access fact and retrieve the relevant rows. These are then joined to d1, d2, and d3 to produce the answer to the query. No Cartesian product is needed.
The following execution plan might result from the query above:
SELECT STATEMENT TEMP TABLE GENERATION TEMP TABLE GENERATION HASH JOIN HASH JOIN HASH JOIN TABLE ACCESS FACT BY INDEX ROWID BITMAP CONVERSION TO ROWIDS BITMAP AND BITMAP MERGE BITMAP KEY ITERATION TABLE ACCESS D3 FULL BITMAP INDEX FACT_C3 RANGE SCAN BITMAP MERGE BITMAP KEY ITERATION TABLE ACCESS ORA_TEMP_1_123 FULL BITMAP INDEX FACT_C1 RANGE SCAN BITMAP MERGE BITMAP KEY ITERATION TABLE ACCESS D2 FULL BITMAP INDEX FACT_C2 RANGE SCAN TABLE ACCESS ORA_TEMP_1_123 FULL TABLE ACCESS D2 FULL TABLE ACCESS D3 FULL
In this plan the fact table is accessed through a bitmap access path based on a bitmap AND of three merged bitmaps. The three bitmaps are generated by the BITMAP MERGE row source being fed bitmaps from row source trees underneath it. Each such row source tree consists of a BITMAP KEY ITERATION row source which fetches values from the subquery row source tree, which in this example is just a full table access, and for each such value retrieves the bitmap from the bitmap index. After the relevant fact table rows have been retrieved using this access path, they are joined with the dimension tables and temporary tables to produce the answer to the query. The two rows in the execution plan labelled "TEMP TABLE GENERATION" contain the SQL commands used to create and populate the temporary table. These commands are in the OTHER column of the execution plan, which was not displayed in the example above.
The star transformation is a cost-based transformation in the following sense. The optimizer generates and saves the best plan it can produce without the transformation. If the transformation is enabled, the optimizer then tries to apply it to the query and if applicable, generates the best plan using the transformed query. Based on a comparison of the cost estimates between the best plans for the two versions of the query, the optimizer will then decide whether to use the best plan for the transformed or untransformed version.
If the query requires accessing a large percentage of the rows in the fact table, it may well be better to use a full table scan and not use the tranformations. However, if the constraining predicates on the dimension tables are sufficiently selective that only a small portion of the fact table needs to be retrieved, the plan based on the transformation will probably be superior.
Note that the optimizer will generate a subquery for a dimension table only if it decides that it is reasonable to do so based on a number of criteria. There is no guarantee that subqueries will be generated for all dimension tables. The optimizer may also decide, based on the properties of the tables and the query, that the transformation does not merit being applied to a particular query. In this case the best regular plan will be used.
You can enable star transformation by setting the value of the initialization parameter STAR_TRANSFORMATION_ENABLED to TRUE. To use star transformation without temporary tables, set the value of the parameter to TEMP_DISABLE. Use the STAR_TRANSFORMATION hint to make the optimizer use the best plan in which the transformation has been used.
Star transformation is not supported for tables with any of the following characteristics:
In addition, temporary tables will not be used by star transformation under the following conditions: