28
Data Warehousing with Materialized Views

This chapter contains:

Overview of Data Warehousing with Materialized Views

An enterprise data warehouse contains historical detailed data about the organization. Typically, data flows from one or more online transaction processing (OLTP) databases into the data warehouse on a monthly, weekly, or daily basis. The data is usually processed in a staging file before being added to the data warehouse. Data warehouses typically range in size from tens of gigabytes to a few terabytes, usually with the vast majority of the data stored in a few very large fact tables.

A data mart contains a subset of corporate data that is of value to a specific business unit, department, or set of users. Typically, a data mart is derived from an enterprise data warehouse.

One of the techniques employed in data warehouses to improve performance is the creation of summaries, or aggregates. They are a special kind of aggregate view which improves query execution times by precalculating expensive joins and aggregation operations prior to execution, and storing the results in a table in the database. For example, a table may be created which would contain the sum of sales by region and by product.

Today, organizations using summaries spend a significant amount of time manually creating summaries, identifying which ones to create, indexing the summaries, updating them, and advising their users on which ones to use. The introduction of summary management in the Oracle server changes the workload of the DBA dramatically and means the end-user no longer has to be aware of which summaries have been defined. The DBA creates one or more materialized views, which are the equivalent of a summary. The end-user queries the tables and views in the database and the query rewrite mechanism in the Oracle server automatically rewrites the SQL query to use the summary tables. This results in a significant improvement in response time for returning results from the query and eliminates the need for the end-user or database application to be aware of the summaries that exist within the data warehouse.

Although summaries are usually accessed indirectly via the query rewrite mechanism, an end-user or database application can construct queries which directly access the summaries. However, serious consideration should be given to whether users should be allowed to do this because, once the summaries are directly referenced in queries, the DBA will not be free to drop and create summaries without affecting applications.

The summaries or aggregates that are referred to in this book and in literature on data warehousing are created in Oracle using a schema object called a materialized view. Materialized views can be used to perform a number of roles, such as improving query performance or providing replicated data, as described below.

Materialized Views for Data Warehouses

In data warehouses, materialized views can be used to precompute and store aggregated data such as sum of sales. Materialized views in these environments are typically referred to as summaries since they store summarized data. They can also be used to precompute joins with or without aggregations. So a materialized view is used to eliminate overhead associated with expensive joins or aggregations for a large or important class of queries.

Materialized Views for Distributed Computing

In distributed environments, materialized views are used to replicate data at distributed sites and synchronize updates done at several sites with conflict resolution methods. The materialized views as replicas provide local access to data which otherwise would have to be accessed from remote sites.

Materialized Views for Mobile Computing

Materialized views are used to download a subset of data from central servers to mobile clients, with periodic refreshes from the central servers and propagation of updates by clients back to the central servers.

This chapter is focused on the use of materialized views in data warehouses. Refer to Oracle8i Replication and Oracle8i Distributed Database Systems for details on distributed and mobile computing.

Components of Summary Management

Summary management consists of:

Mechanisms to define summaries and dimensions
A refresh mechanism to ensure that all summaries contain the latest data
A query rewrite capability to transparently rewrite a query to use a summary
An advisor utility to recommend which summaries to create, retain, and drop

Many large decision support system (DSS) databases have schemas that do not closely resemble a conventional data warehouse schema, but still require joins and aggregates. The use of summary management features imposes no schema restrictions, and may enable some existing DSS database applications to achieve large gains in performance without requiring a redesign of the database or application. This functionality is thus available to all database users.

Figure 28-1 illustrates where summary management is used in the warehousing cycle. It is available once the data has been transformed and loaded into the data warehouse. Therefore, referring to Figure 28-1, after the data has been transformed, staged, and loaded into the detail data in the warehouse, then the summary management process can be invoked. This means that summaries can be created, queries can be rewritten, and the advisor can be used to plan summary usage and creation.

Figure 28-1 Overview of Summary Management

Understanding the summary management process during the earliest stages of data warehouse design can yield large dividends later on in the form of higher performance, lower summary administration costs, and reduced storage requirements.

The summary management process begins with the creation of dimensions and hierarchies that describe the business relationships and common access patterns in the database. An analysis of the dimensions, combined with an understanding of the typical work load, can then be used to create materialized views. Materialized views improve query execution performance by pre-calculating expensive join or aggregation operations prior to execution time. Query rewrite then automatically recognizes when an existing materialized view can and should be used to satisfy a request, and can transparently rewrite a request to use a materialized view, thus improving performance.

Other considerations when building a warehouse include:

Horizontally partitioning the fact tables by a time attribute.

This improves scalabililty, simplifies system administration, and makes it possible to define local indexes that can be efficiently rebuilt.
SQL*Loader can be directed to load a single partition of a table.

In this case, only the corresponding local index partitions are rebuilt.
Global indexes must be fully rebuilt after a direct load, which can be very costly when loading a relatively small number of rows into a large table.

For this reason, it is strongly recommended that all fact table indexes should be defined as local indexes. For example, this can be accomplished by having a bitmap index on each key column (bitmap indexes are always local), and a single multi-key index that includes all the key columns, with the partitioning attribute as the leading column of the index.

Terminology

The following clarifies some basic data warehousing terms:

Dimension tables describe the business entities of an enterprise, which usually represent hierarchical, categorical information such as time, departments, locations, and products. Dimension tables are sometimes called lookup or reference tables.

Dimension tables usually change slowly over time and are not modified on a periodic schedule. They are typically not large, but they affect the performance of long-running decision support queries that consist of joins of fact tables with dimension tables, followed by aggregation to specific levels of the dimension hierarchies.
Fact tables describe the business transactions of an enterprise. Fact tables are sometimes called detail tables.

The vast majority of data in a data warehouse is stored in a few very large fact tables. They are updated periodically with data from one or more operational online transaction processing (OLTP) databases.

Fact tables include measures such as sales, units, and inventory.
- A simple measure is a numeric or character column of one table such as FACT.SALES.
- A computed measure is an expression involving only simple measures of one table, for example, FACT.REVENUES - FACT.EXPENSES.
- A multi-table measure is a computed measure defined on multiple tables, for example, FACT_A.REVENUES - FACT_B.EXPENSES.
Fact tables also contain one or more keys that organize the business transactions by the relevant business entities such as time, product, and market. In most cases, the fact keys are non-null, form a unique compound key of the fact table, and join with one and only one row of a dimension table.
A materialized view is a pre-computed table comprising aggregated and/or joined data from fact and possibly dimension tables. Builders of data warehouses will know a materialized view as a summary or aggregation.

Materialized Views

The most common situations where you would find materialized views useful are in data warehousing applications and distributed systems. In warehousing applications, large amounts of data are processed and similar queries are frequently repeated. If these queries are pre-computed and the results stored in the data warehouse as a materialized view, using materialized views significantly improves performance by providing fast lookups into the set of results.

A materialized view definition can include any number of aggregates, as well as any number of joins. In several ways, a materialized view behaves like an index:

The purpose of a materialized view is to increase request execution performance.
The existence of a materialized view is transparent to SQL applications, so a DBA can create or drop materialized views at any time without affecting the validity of SQL applications.
A materialized view consumes storage space.
The contents of the materialized view must be maintained when the underlying detail tables are modified.

This chapter shows how materialized views are used in a data warehousing environment. However, the materialized view that is a key component of summary management can also be used in a distributed environment to manage replicated data. For further information, see Oracle8i Replication.

Schema Design Guidelines for Materialized Views

Before starting to define and use the various components of summary management, it is recommended that you review your schema design to, wherever possible, abide by these guidelines:

Guideline 1:

Your dimensions should either be denormalized (each dimension contained in one table) or the joins between tables in a normalized or partially normalized dimension should guarantee that each child-side row joins with one and only one parent-side row. The benefits of maintaining this condition are described in "Creating a Dimension".

If desired, this condition can be enforced by adding FOREIGN KEY and NOT NULL constraints on the child-side join key(s) and PRIMARY KEY constraints on the parent-side join key(s). If your materialized view contains only a single detail table, or if it performs no aggregation, a preferred alternative would be to use outer joins in place of inner joins. In this case, the Oracle optimizer can guarantee the integrity of the result without enforced referential integrity constraints.

Guideline 2:

If dimensions are denormalized or partially denormalized, hierarchical integrity must be maintained between the key columns of the dimension table. Each child key value must uniquely identify its parent key value, even if the dimension table is denormalized. Hierarchical integrity in a denormalized dimension can be verified by calling the VALIDATE_DIMENSION procedure of the DBMS_OLAP package.

Guideline 3:

Fact tables and dimension tables should similarly guarantee that each fact table row joins with one and only one dimension table row. This condition must be declared, and optionally enforced, by adding FOREIGN KEY and NOT NULL constraints on the fact key column(s) and PRIMARY KEY constraints on the dimension key column(s), or by using outer joins as described in Guideline 1. In a data warehouse, constraints are typically enabled with the NOVALIDATE and RELY options to avoid constraint enforcement performance overhead.

Guideline 4:

Incremental loads of your detail data should be done using the SQL*Loader direct-path option, or any bulk loader utility that uses Oracle's direct path interface (including INSERT AS SELECT with the APPEND or PARALLEL hints). If the materialized view contains more than one table and performs aggregation, or if materialized view logs are not defined, then performing any other type of DML to your data will necessitate a complete refresh.

Guideline 5:

Horizontally partition your tables by a monotonically increasing time column if possible (preferably of type DATE). For each table, create a bitmap index for each key column, and create one local index that includes all the key columns. Stripe each horizontal partition across several storage devices for maximum performance.

Guideline 6:

After each load and before refreshing your materialized view, use the VALIDATE_DIMENSION procedure of the DBMS_OLAP package to incrementally verify dimensional integrity.

Guideline 7:

Horizontally partition and index the materialized view like the fact tables. Include a local concatenated index on all the materialized view keys.

While guidelines 1, 2, and 3 are each important during schema design, guidelines 1 and 2 are even more important than guideline 3. If your schema design does not follow guidelines 1 and 2, it does not then matter whether it follows guideline 3. Guidelines 1, 2, and 3 affect both query rewrite performance and materialized view refresh performance. Guideline 4 affects materialized view refresh performance only. If your schema design does not follow guideline 4, then incremental refresh of your materialized views will be either impossible or much less efficient.

If you are concerned with the time required to enable constraints and whether any constraints may be violated, use the ENABLE NOVALIDATE clause to turn on constraint checking without validating any of the existing constraints. The risk with this approach is that incorrect query results could occur if any constraints are broken. Therefore, this is a decision for the designer to determine how clean the data is and whether the risk of potential wrong results is too great.

Materialized view management can perform many useful functions, including query rewrite and materialized view refresh, even if your data warehouse design does not follow these guidelines; however, you will realize significantly greater query execution performance and materialized view refresh performance benefits, and you will require fewer materialized views if your schema design complies with these guidelines.

Oracle Tools for Data Warehousing

The availability of powerful tools to help automate the analysis and administration of the materialized views is an important factor in controlling data warehouse costs. The following Oracle tools are available to help you create and manage a data warehouse:

Data Mart Designer or Oracle Designer can be used to design the warehouse schema. Data is then extracted, transformed, and transferred (ETT) from the operational systems into the data warehouse or data mart. Data Mart Builder can be used to specify the ETT process, populate the target data mart, and automatically schedule loads and index rebuilds.

Discoverer can be used to query the database and queries executed via Discoverer will be rewritten when appropriate. The Discoverer summary wizard can be used to recommend which materialized views to create because Discoverer retains its own workload statistics with respect to query usage.

The data mart may be analyzed natively with Discoverer or it can be optionally exported to the Express multidimensional database server through the Relational Access Manager (RAM). Analysis of the data in Express supports reach-through to detail data stored in the Oracle8i server through RAM, and provides relational access to tools like Oracle Sales Analyzer (OSA) and Oracle Express Objects (OEO).

Getting Started

The following chapters describe how to create materialized views and dimensions. Although materialized views can be created at any time, so that they can used by the other features in summary management such as warehouse refresh and query rewrite, some parameters must be set. These can be defined either within the initialization parameter file or using the ALTER SYSTEM or ALTER SESSION commands. The required parameters are identified by subject area.

Warehouse Refresh

JOB_QUEUE_PROCESSES

The number of background processes. This parameter determines how many materialized views can be refreshed concurrently.

JOB_QUEUE_INTERVAL

In seconds, the interval between which the job queue scheduler checks to see if a new job has been submitted to the job queue.

UTL_FILE_DIR

The directory where the refresh log is written. If unspecified, no refresh log will be created.
Query Rewrite

OPTIMIZER_MODE="ALL_ROWS", "FIRST_ROWS", or "CHOOSE"

With tables analyzed, ensures that the cost-based optimizer is used, which is a requirement to get Query Rewrite.

QUERY_REWRITE_ENABLED = True

Turns on query rewrite.

QUERY_REWRITE_INTEGRITY = enforced or trusted or stale_tolerated

Optional. Advises how fresh a materialized view must be to be eligible for query rewrite. See Oracle8i Reference for further information about the values for QUERY_REWRITE_INTEGRITY.

COMPATIBLE

Must be 8.1 or higher.
Advisor Workload

Recommended Parameters:

ORACLE_TRACE_COLLECTION_NAME = oraclsm

Trace collection file name.

ORACLE_TRACE_COLLECTION_PATH = ?/otrace/admin/cdf

Location where the collection file is stored.

ORACLE_TRACE_COLLECTION_SIZE = 0

Initial size of the collection file.

Required Parameters and their Settings:

ORACLE_TRACE_ENABLE=true

Turns on Oracle Trace collection.

ORACLE_TRACE_FACILITY_NAME = oraclesm

Trace facility to collect data.

ORACLE_TRACE_FACILITY_PATH = ?/otrace/admin/cdf

Location of the Trace facility definition files.

Recommended Parameters for Parallelism

PARALLEL_MAX_SERVERS

Should be set high enough to take care of parallelism.

SORT_AREA_SIZE

Should be less than HASH_AREA_SIZE.

OPTIMIZER_MODE

Should equal CHOOSE (cost based optimizer).

OPTIMIZER_PERCENT_PARALLEL

Should equal 100.

Once these parameters have been set to the appropriate values, you will be ready to move on to using the summary management features.

Guideline 1:	Your dimensions should either be denormalized (each dimension contained in one table) or the joins between tables in a normalized or partially normalized dimension should guarantee that each child-side row joins with one and only one parent-side row. The benefits of maintaining this condition are described in "Creating a Dimension".
	If desired, this condition can be enforced by adding `FOREIGN` `KEY` and `NOT` `NULL` constraints on the child-side join key(s) and `PRIMARY` `KEY` constraints on the parent-side join key(s). If your materialized view contains only a single detail table, or if it performs no aggregation, a preferred alternative would be to use outer joins in place of inner joins. In this case, the Oracle optimizer can guarantee the integrity of the result without enforced referential integrity constraints.
Guideline 2:	If dimensions are denormalized or partially denormalized, hierarchical integrity must be maintained between the key columns of the dimension table. Each child key value must uniquely identify its parent key value, even if the dimension table is denormalized. Hierarchical integrity in a denormalized dimension can be verified by calling the `VALIDATE_DIMENSION` procedure of the `DBMS_OLAP` package.
Guideline 3:	Fact tables and dimension tables should similarly guarantee that each fact table row joins with one and only one dimension table row. This condition must be declared, and optionally enforced, by adding `FOREIGN` `KEY` and `NOT` `NULL` constraints on the fact key column(s) and `PRIMARY` `KEY` constraints on the dimension key column(s), or by using outer joins as described in Guideline 1. In a data warehouse, constraints are typically enabled with the `NOVALIDATE` and `RELY` options to avoid constraint enforcement performance overhead.
Guideline 4:	Incremental loads of your detail data should be done using the SQL*Loader direct-path option, or any bulk loader utility that uses Oracle's direct path interface (including `INSERT` `AS` `SELECT` with the `APPEND` or `PARALLEL` hints). If the materialized view contains more than one table and performs aggregation, or if materialized view logs are not defined, then performing any other type of DML to your data will necessitate a complete refresh.
Guideline 5:	Horizontally partition your tables by a monotonically increasing time column if possible (preferably of type `DATE`). For each table, create a bitmap index for each key column, and create one local index that includes all the key columns. Stripe each horizontal partition across several storage devices for maximum performance.
Guideline 6:	After each load and before refreshing your materialized view, use the `VALIDATE_DIMENSION` procedure of the `DBMS_OLAP` package to incrementally verify dimensional integrity.
Guideline 7:	Horizontally partition and index the materialized view like the fact tables. Include a local concatenated index on all the materialized view keys.

28 Data Warehousing with Materialized Views