Oracle8i Tuning Release 8.1.5 A67775-01 |
|
This chapter contains:
An enterprise data warehouse contains historical detailed data about the organization. Typically, data flows from one or more online transaction processing (OLTP) databases into the data warehouse on a monthly, weekly, or daily basis. The data is usually processed in a staging file before being added to the data warehouse. Data warehouses typically range in size from tens of gigabytes to a few terabytes, usually with the vast majority of the data stored in a few very large fact tables.
A data mart contains a subset of corporate data that is of value to a specific business unit, department, or set of users. Typically, a data mart is derived from an enterprise data warehouse.
One of the techniques employed in data warehouses to improve performance is the creation of summaries, or aggregates. They are a special kind of aggregate view which improves query execution times by precalculating expensive joins and aggregation operations prior to execution, and storing the results in a table in the database. For example, a table may be created which would contain the sum of sales by region and by product.
Today, organizations using summaries spend a significant amount of time manually creating summaries, identifying which ones to create, indexing the summaries, updating them, and advising their users on which ones to use. The introduction of summary management in the Oracle server changes the workload of the DBA dramatically and means the end-user no longer has to be aware of which summaries have been defined. The DBA creates one or more materialized views, which are the equivalent of a summary. The end-user queries the tables and views in the database and the query rewrite mechanism in the Oracle server automatically rewrites the SQL query to use the summary tables. This results in a significant improvement in response time for returning results from the query and eliminates the need for the end-user or database application to be aware of the summaries that exist within the data warehouse.
Although summaries are usually accessed indirectly via the query rewrite mechanism, an end-user or database application can construct queries which directly access the summaries. However, serious consideration should be given to whether users should be allowed to do this because, once the summaries are directly referenced in queries, the DBA will not be free to drop and create summaries without affecting applications.
The summaries or aggregates that are referred to in this book and in literature on data warehousing are created in Oracle using a schema object called a materialized view. Materialized views can be used to perform a number of roles, such as improving query performance or providing replicated data, as described below.
In data warehouses, materialized views can be used to precompute and store aggregated data such as sum of sales. Materialized views in these environments are typically referred to as summaries since they store summarized data. They can also be used to precompute joins with or without aggregations. So a materialized view is used to eliminate overhead associated with expensive joins or aggregations for a large or important class of queries.
In distributed environments, materialized views are used to replicate data at distributed sites and synchronize updates done at several sites with conflict resolution methods. The materialized views as replicas provide local access to data which otherwise would have to be accessed from remote sites.
Materialized views are used to download a subset of data from central servers to mobile clients, with periodic refreshes from the central servers and propagation of updates by clients back to the central servers.
This chapter is focused on the use of materialized views in data warehouses. Refer to Oracle8i Replication and Oracle8i Distributed Database Systems for details on distributed and mobile computing.
Summary management consists of:
Many large decision support system (DSS) databases have schemas that do not closely resemble a conventional data warehouse schema, but still require joins and aggregates. The use of summary management features imposes no schema restrictions, and may enable some existing DSS database applications to achieve large gains in performance without requiring a redesign of the database or application. This functionality is thus available to all database users.
Figure 28-1 illustrates where summary management is used in the warehousing cycle. It is available once the data has been transformed and loaded into the data warehouse. Therefore, referring to Figure 28-1, after the data has been transformed, staged, and loaded into the detail data in the warehouse, then the summary management process can be invoked. This means that summaries can be created, queries can be rewritten, and the advisor can be used to plan summary usage and creation.
Understanding the summary management process during the earliest stages of data warehouse design can yield large dividends later on in the form of higher performance, lower summary administration costs, and reduced storage requirements.
The summary management process begins with the creation of dimensions and hierarchies that describe the business relationships and common access patterns in the database. An analysis of the dimensions, combined with an understanding of the typical work load, can then be used to create materialized views. Materialized views improve query execution performance by pre-calculating expensive join or aggregation operations prior to execution time. Query rewrite then automatically recognizes when an existing materialized view can and should be used to satisfy a request, and can transparently rewrite a request to use a materialized view, thus improving performance.
Other considerations when building a warehouse include:
This improves scalabililty, simplifies system administration, and makes it possible to define local indexes that can be efficiently rebuilt.
In this case, only the corresponding local index partitions are rebuilt.
For this reason, it is strongly recommended that all fact table indexes should be defined as local indexes. For example, this can be accomplished by having a bitmap index on each key column (bitmap indexes are always local), and a single multi-key index that includes all the key columns, with the partitioning attribute as the leading column of the index.
The following clarifies some basic data warehousing terms:
Dimension tables usually change slowly over time and are not modified on a periodic schedule. They are typically not large, but they affect the performance of long-running decision support queries that consist of joins of fact tables with dimension tables, followed by aggregation to specific levels of the dimension hierarchies.
The vast majority of data in a data warehouse is stored in a few very large fact tables. They are updated periodically with data from one or more operational online transaction processing (OLTP) databases.
Fact tables include measures such as sales, units, and inventory.
Fact tables also contain one or more keys that organize the business transactions by the relevant business entities such as time, product, and market. In most cases, the fact keys are non-null, form a unique compound key of the fact table, and join with one and only one row of a dimension table.
The most common situations where you would find materialized views useful are in data warehousing applications and distributed systems. In warehousing applications, large amounts of data are processed and similar queries are frequently repeated. If these queries are pre-computed and the results stored in the data warehouse as a materialized view, using materialized views significantly improves performance by providing fast lookups into the set of results.
A materialized view definition can include any number of aggregates, as well as any number of joins. In several ways, a materialized view behaves like an index:
This chapter shows how materialized views are used in a data warehousing environment. However, the materialized view that is a key component of summary management can also be used in a distributed environment to manage replicated data. For further information, see Oracle8i Replication.
Before starting to define and use the various components of summary management, it is recommended that you review your schema design to, wherever possible, abide by these guidelines:
Guideline 1: |
Your dimensions should either be denormalized (each dimension contained in one table) or the joins between tables in a normalized or partially normalized dimension should guarantee that each child-side row joins with one and only one parent-side row. The benefits of maintaining this condition are described in "Creating a Dimension". |
|
If desired, this condition can be enforced by adding |
Guideline 2: |
If dimensions are denormalized or partially denormalized, hierarchical integrity must be maintained between the key columns of the dimension table. Each child key value must uniquely identify its parent key value, even if the dimension table is denormalized. Hierarchical integrity in a denormalized dimension can be verified by calling the |
Guideline 3: |
Fact tables and dimension tables should similarly guarantee that each fact table row joins with one and only one dimension table row. This condition must be declared, and optionally enforced, by adding |
Guideline 4: |
Incremental loads of your detail data should be done using the SQL*Loader direct-path option, or any bulk loader utility that uses Oracle's direct path interface (including |
Guideline 5: |
Horizontally partition your tables by a monotonically increasing time column if possible (preferably of type |
Guideline 6: |
After each load and before refreshing your materialized view, use the |
Guideline 7: |
Horizontally partition and index the materialized view like the fact tables. Include a local concatenated index on all the materialized view keys. |
While guidelines 1, 2, and 3 are each important during schema design, guidelines 1 and 2 are even more important than guideline 3. If your schema design does not follow guidelines 1 and 2, it does not then matter whether it follows guideline 3. Guidelines 1, 2, and 3 affect both query rewrite performance and materialized view refresh performance. Guideline 4 affects materialized view refresh performance only. If your schema design does not follow guideline 4, then incremental refresh of your materialized views will be either impossible or much less efficient.
If you are concerned with the time required to enable constraints and whether any constraints may be violated, use the ENABLE NOVALIDATE clause to turn on constraint checking without validating any of the existing constraints. The risk with this approach is that incorrect query results could occur if any constraints are broken. Therefore, this is a decision for the designer to determine how clean the data is and whether the risk of potential wrong results is too great.
Materialized view management can perform many useful functions, including query rewrite and materialized view refresh, even if your data warehouse design does not follow these guidelines; however, you will realize significantly greater query execution performance and materialized view refresh performance benefits, and you will require fewer materialized views if your schema design complies with these guidelines.
The availability of powerful tools to help automate the analysis and administration of the materialized views is an important factor in controlling data warehouse costs. The following Oracle tools are available to help you create and manage a data warehouse:
Data Mart Designer or Oracle Designer can be used to design the warehouse schema. Data is then extracted, transformed, and transferred (ETT) from the operational systems into the data warehouse or data mart. Data Mart Builder can be used to specify the ETT process, populate the target data mart, and automatically schedule loads and index rebuilds.
Discoverer can be used to query the database and queries executed via Discoverer will be rewritten when appropriate. The Discoverer summary wizard can be used to recommend which materialized views to create because Discoverer retains its own workload statistics with respect to query usage.
The data mart may be analyzed natively with Discoverer or it can be optionally exported to the Express multidimensional database server through the Relational Access Manager (RAM). Analysis of the data in Express supports reach-through to detail data stored in the Oracle8i server through RAM, and provides relational access to tools like Oracle Sales Analyzer (OSA) and Oracle Express Objects (OEO).
The following chapters describe how to create materialized views and dimensions. Although materialized views can be created at any time, so that they can used by the other features in summary management such as warehouse refresh and query rewrite, some parameters must be set. These can be defined either within the initialization parameter file or using the ALTER SYSTEM or ALTER SESSION commands. The required parameters are identified by subject area.
JOB_QUEUE_PROCESSES
The number of background processes. This parameter determines how many materialized views can be refreshed concurrently.
JOB_QUEUE_INTERVAL
In seconds, the interval between which the job queue scheduler checks to see if a new job has been submitted to the job queue.
UTL_FILE_DIR
OPTIMIZER_MODE="ALL_ROWS", "FIRST_ROWS", or "CHOOSE"
With tables analyzed, ensures that the cost-based optimizer is used, which is a requirement to get Query Rewrite.
QUERY_REWRITE_ENABLED = True
QUERY_REWRITE_INTEGRITY = enforced or trusted or stale_tolerated
Optional. Advises how fresh a materialized view must be to be eligible for query rewrite. See Oracle8i Reference for further information about the values for QUERY_REWRITE_INTEGRITY.
COMPATIBLE
Recommended Parameters:
ORACLE_TRACE_COLLECTION_NAME = oraclsm
ORACLE_TRACE_COLLECTION_PATH = ?/otrace/admin/cdf
ORACLE_TRACE_COLLECTION_SIZE = 0
Required Parameters and their Settings:
ORACLE_TRACE_ENABLE=true
ORACLE_TRACE_FACILITY_NAME = oraclesm
ORACLE_TRACE_FACILITY_PATH = ?/otrace/admin/cdf
PARALLEL_MAX_SERVERS
SORT_AREA_SIZE
OPTIMIZER_MODE
OPTIMIZER_PERCENT_PARALLEL
Once these parameters have been set to the appropriate values, you will be ready to move on to using the summary management features.