Oracle8i Parallel Server Concepts and Administration Release 8.1.5 A67778-01 |
|
This chapter explains how Cache Fusion resolves reader/writer conflicts in Oracle Parallel Server. It describes Cache Fusion and its benefits in general terms that apply to most types of systems and applications. The chapter also describes OPS- and Cache Fusion-related statistics and provides many procedures that explain how to use these statistics to monitor and tune performance.
The topics in this chapter are:
Inter-instance contention for data blocks and the resulting cache coherency issues are the main performance problems of OPS. In most cases, proper partitioning resolves most contention problems.
In reality, however, most packaged applications are not effectively partitioned, or are partitioned only to a limited extent. Such applications experience 3 types of inter-instance contention:
Reader/writer contention occurs when one instance needs to read a data block in consistent mode and the correct version of the block does not exist in the instance's cache. OPS easily resolves this type of contention because multiple instances can share the same blocks for read access without cache coherency conflicts. The other types of contention, however, are more complex from a cache coherency point-of-view.
In the case of inserts into tables, for example, writer/writer conflicts are partially addressed by free list groups. In other cases, however, the only alternative is to address writer/writer cache coherency issues by isolating hot blocks using locking, by implementing deferred pinging, or by application partitioning. Reader/writer conflicts, on the other hand, are more prevalent and easier to resolve.
Reader/writer contention is the most common type of contention in OLTP and hybrid applications. The ability to combine DSS and OLTP processing in a typical application depends on OPS' efficiency in resolving such conflicts.
For the "reader" part of reader/writer conflicts there are two subcategories: the contention caused by current readers and contention caused by consistent read readers. Of these two, consistent read readers are typically more prevalent and Cache Fusion directly addresses these.
If a data block requested by one instance is in the memory cache of a remote instance, Cache Fusion resolves the conflict using remote memory access, not disk access. The requesting instance sends a request for a consistent-read copy of the block to the holding instance. The BSP (Block Server Process) on the holding instance transmits the consistent-read image of the requested block directly from the holding instance's buffer cache to the requesting instance's buffer cache across a high speed interconnect.
As Figure 21-1 illustrates, Cache Fusion enables the buffer cache of one node to ship data blocks directly to the buffer cache of another node by way of low latency, high bandwidth interconnects. This reduces the need for expensive disk I/O in parallel cache management.
Cache Fusion also leverages new interconnect technologies for low latency, user-space based, interprocessor communication. This drastically lowers CPU usage by reducing operating system context switches for inter-node messages.
Oracle manages write/write contention using conventional disk-based PCM (Parallel Cache Management). A later version of Oracle will use Cache Fusion to provide faster writer/writer contention resolution.
Cache Fusion only solves part of the block conflict resolution issue by providing improved scalability for applications that experience high levels of reader/writer contention. For applications with high writer/writer contention levels, you also need to accurately partition your application's tables to reduce the potential for writer/writer conflicts.
See Also:
For more information on partitioning, please refer to Oracle8i Tuning and Oracle8i Concepts. |
Cache Fusion improves application transaction throughput and scalability by providing:
Applications demonstrating high reader/writer conflict rates under disk-based PCM benefit the most from Cache Fusion.
Packaged applications also scale more effectively as a result of Cache Fusion. Applications in which OLTP and reporting functions execute on separate nodes may also benefit from Cache Fusion. Reporting functions that access data from tables modified by OLTP functions receive their versions of data blocks by way of high-speed interconnects. This reduces the 'pinging' of data blocks to disk.
Performance gains are derived primarily from reduced X-to-S lock conversions and the corresponding reduction in disk I/O for X-to-S lock conversions.
The direct mapping of data buffers across instances during inter-processor communication also avoids having to copy memory from one address space to another. This shortened execution path reduces CPU requirements and increases the communications bandwidth as explained under the following headings.
Cache Fusion dramatically reduces operating system context switches. This results in reduced CPU utilization and frees CPU cycles for applications processing.
Cache Fusion reduces CPU utilization by taking advantage of user-mode IPCs, also known as "memory-mapped IPCs", for both Unix- and NT-based platforms. If the appropriate hardware support is available, operating system context switches are minimized beyond the basic reductions achieved with Cache Fusion alone. This also eliminates costly data copying and system calls.
User-mode IPCs reduce CPU utilization because user processes can communicate without using the operating system kernel. In other words, there is no need to switch from user execution mode to kernel execution mode.
Cache Fusion reduces expensive lock operations and disk I/O for data and undo segment blocks by transmitting consistent-read blocks directly from one instance's buffer cache to another. This can reduce the latency required to resolve reader/writer conflicts by as much as 90%.
Disk-based PCM may require as much as 80ms (milliseconds) to resolve reader/writer conflicts. This involves disk I/O for the requested block as well as I/O to write rollback segment blocks to disk.
Cache Fusion resolves reader/writer conflicts with approximately 1/10th the processing effort required by disk-based PCM using little or no disk I/O. To do this, Cache Fusion only incurs overhead for processing the consistent-read request and for constructing a consistent-read copy of the requested block in memory and transferring it to the requesting instance. On some platforms this can take less than 1ms.
Because Cache Fusion exploits high speed IPCs, OPS benefits from the performance gains of the latest technologies for low latency communication across cluster interconnects. Further performance gains can be expected with even more efficient protocols, such as VIA and user-mode IPCs.
The primary components affecting Cache Fusion performance are the interconnect and the protocols that process inter-node communication. The interconnect bandwidth, its latency, and the efficiency of the IPC protocol determine the speed with which Cache Fusion processes consistent-read block requests.
Once your interconnect is operative, you cannot significantly influence its performance. However, you can influence a protocol's efficiency by adjusting the IPC buffer sizes.
Interconnects supporting OPS and Cache Fusion use one of these protocols:
OPS can use any interconnect product that supports these protocols. The interconnect product must also be certified for OPS hardware cluster platforms.
Cache Fusion performance levels may vary in terms of latency and throughput from application to application. Performance is further influenced by the type and mixture of transactions your system processes.
The performance gains from Cache Fusion also vary with each workload. The hardware, the interconnect protocol specifications, and the operating system resource usage also affect performance.
As mentioned earlier, if your application did not demonstrate a significant amount of consistent-read contention prior to Cache Fusion, your performance with Cache Fusion will likely remain unchanged. However, if your application experienced numerous lock conversions and heavy disk I/O as a result of consistent-read conflicts, your performance with Cache Fusion should improve dramatically.
As an example, query statistics in V$SYSSTAT and you should observe that your system's processing with Cache Fusion has fewer X-to-S lock converts. The fewer X-to-S lock conversions your application generates, the less disk I/O your system requires. The following section, "Monitoring Cache Fusion and Inter-instance Performance", describes how to evaluate Cache Fusion performance in more detail.
Block request latencies in Cache Fusion can vary according to the protocol in use. With TCP/UDP, the latency can range from 5 to 50ms. The performance statistics achieved with TCP/UDP protocols, however, vary from system to system.
With VIA, which is only available on NT, the latency can be less than 1ms. Performance statistics for the VIA protocol do not vary significantly because VIA uses fewer context switches. Fewer context switches mean reduced overhead.
This section describes how to obtain and analyze OPS and Cache Fusion statistics to monitor inter-instance performance. Topics in this section include:
The main goal of monitoring Cache Fusion and OPS performance is to examine the latency and fine-tune your system's processing by observing trends over time. Do this by analyzing the performance statistics from several views as described in the following sections. Use these monitoring procedures on an on-going basis to observe processing trends and to maintain processing at optimal levels.
The procedures in the following sections describe performance issues in terms of latency. The procedures also describe how to analyze other performance-related issues such as Integrated Distributed Lock Manager (IDLM) resource use and the status of general system events.
Latency is the most important aspect of OPS and Cache Fusion performance. Latency is the amount of time required to complete a request for a consistent-read block. Latency is influenced by the type of requests and responses involved in consistent-read operations. Each type of request may have a different outcome as described in the following:
Oracle collects Cache Fusion-related performance statistics from the buffer cache and IDLM layers. Oracle also collects general OPS statistics for lock requests and lock waits. You can use several views to examine these statistics.
Maintaining an adequate history of system performance helps you more easily identify trends as these statistics change. This is especially important for identifying increasing latencies and adverse workload changes.
Procedures in this section use statistics that are grouped according to the following topics:
In many cases, resolving performance issues requires that you first identify a problem using the specific procedures in each statistics group. You then use the V$SYSTEM_EVENT view to pinpoint the cause as described under the heading, "Events in V$SYSTEM_EVENTS Specifically Related to OPS".
You must set the parameter TIMED_STATISTICS to TRUE for Oracle to collect statistics for most views discussed in the procedures in this section. The timed statistics from views discussed in this chapter are displayed in units of 1/100ths of a second.
The SQL script CATPARR.SQL creates parallel server data dictionary views. To run this script, you must have SYSDBA privileges.
CATALOG.SQL creates the standard V$ dynamic views, as described in the Oracle8i Reference, as well as:
You can rerun CATPARR.SQL if you want the EXT_TO_OBJ table to contain the latest information after you add extents. If you drop objects without rerunning CATPARR.SQL, EXT_TO_OBJ may display misleading information.
Tuning and performance information for the Oracle database is stored in a set of dynamic performance tables known as the "V$ fixed views". Each active instance has its own set of fixed views. In OPS, you can query a global dynamic performance (GV$) view to retrieve the V$ view information from all qualified instances. A global fixed view is available for all of the existing dynamic performance views except for V$ROLLNAME, V$CACHE_LOCK, V$LOCK_ACTIVITY, and V$LOCKS_WITH_COLLISIONS.
The global view contains all the columns from the local view, with an additional column, INST_ID (datatype INTEGER). This column displays the instance number from which the associated V$ information was obtained. You can use the INST_ID column as a filter to retrieve V$ information from a subset of available instances. For example, the query:
SELECT * FROM GV$LOCK WHERE INST_ID = 2 or INST_ID = 5;
Retrieves information from the V$ views on instances 2 and 5.
Each global view contains a GLOBAL hint that creates a parallel query to retrieve the contents of the local view on each instance.
If you have reached the limit of PARALLEL_MAX_SERVERS on an instance and you attempt to query a GV$ view, one additional parallel server process will be spawned for this purpose. The extra process is not available for parallel operations other than GV$ queries.
If you have reached the limit of PARALLEL_MAX_SERVERS on an instance and issue multiple GV$ queries, all but the first query will fail. In most parallel queries, if a server process could not be allocated this would result in either an error or a sequential execution of the query by the query coordinator.
For global views, it may be acceptable to continue running the query in parallel and return the data from the instances that could allocate servers for the query. If it is acceptable to retrieve results only from instances where server allocation succeeded, set the value to TRUE.
See Also:
"Specifying Instance Groups" and Oracle8i Reference for restrictions on GV$ views and complete descriptions of all related parameters and V$ dynamic performance views. |
Oracle collects global cache statistics at the buffer cache layer within an instance. These statistics include counts and timings of requests for global resources.
Requests for global locks on data blocks originate in the buffer cache of the requesting instance. Before a request enters the IDLM, Oracle allocates data structures in the SGA to track the state of the request. These structures are called "lock elements".
To monitor global cache statistics, query the V$SYSSTAT view and analyze its output as described in the following procedures.
Complete the following steps to analyze global cache statistics.
Oracle responds with output similar to:
NAME VALUE ---------------------------------------------------------------- ---------- global cache gets 12480 global cache get time 996 global cache converts 21 global cache convert time 48 global cache cr blocks received 1 global cache cr block receive time 1 global cache cr read from disk 0 global cache freelist waits 0 global cache defers 0 global cache convert timeouts 0 global cache cr timeouts 0 global cache fairness down converts 0
Use your V$SYSSTAT output to perform the calculations and analyses described in the remaining procedures for this statistics group.
Procedures 2 and 3 use the following Cache Fusion statistics from V$SYSSTAT:
The result, which typically varies from 5 to 40ms depending on your system configuration and volume, is the average latency of a consistent-read request round trip from requesting instance, to holding instance, and back to the requesting instance. If your CPU has limited idle time and your system typically processes long-running queries, the latency may be higher. However, it is possible to have an average latency of less than 1ms if your interconnect protocol is user-mode IPC.
Consistent-read server request latency can also be influenced by a high value for the DB_MULTI_BLOCK_READ_COUNT parameter. This is because a requesting process may issue more than one request for a block depending on the setting of this parameter. Correspondingly, the requesting process may wait longer.
A high proportion of read permissions from disk indicates that blocks are rapidly aging out of the buffer cache of the holding instance. To resolve this, increase the size of your buffer cache.
Procedures 4 and 5 require that you take snapshots of your statistics, for example, by using UTLBSTAT and UTLESTAT.
Procedure 4 uses the following global cache statistics from V$SYSSTAT:
If the average convert or get time is high, there is excessive contention. Another cause may be that latencies for lock operations are high due to overall system workload or system problems. A reasonable value for a cache gets is 20-30ms while converts should take 10-20ms on average.
Oracle increments global cache gets when a new lock on a resource is opened. A convert is counted when there is already an open lock and Oracle converts it to another mode.
The elapsed time for a get thus includes the allocation and initialization of new locks. If the average cache get or average convert times are excessive, your system may be experiencing timeouts.
If the global cache convert times or global cache get times are high, refer to statistics in the V$SYSTEM_EVENTS view to identify events with a high value for TIME_WAITED statistics.
Global lock statistics provide counts and timings for both PCM and non-PCM lock activity. Oracle collects global lock statistics from the IDLM API layer. All Oracle clients to the IDLM, of which the buffer cache is only one, make their requests to the IDLM through this layer. Thus, global lock statistics include lock requests originating from all layers of the kernel, while global lock statistics relate to buffer cache OPS activity.
Use procedures in this section to monitor data from V$SYSSTAT to derive averages, latencies, and counts. This establishes a rough indicator of the OPS workload generated by an instance.
Use the following procedures to view and analyze statistics from V$SYSSTAT for global lock processing.
Oracle responds with output similar to:
NAME VALUE ---------------------------------------------------------------- ---------- global lock sync gets 703 global lock async gets 12748 global lock get time 1071 global lock sync converts 303 global lock async converts 41 global lock convert time 93 global lock releases 573
Use your V$SYSSTAT output to perform the calculations and analyses described in the remaining procedures in this statistics group.
If the result is more than 20 or 30ms, query the TIME_WAITED column in V$SYSTEM_EVENTS using the DESCEND keyword to identify which lock events are waited for most frequently using this query:
SELECT EVENT_TIME_WAITED, AVERAGE_WAIT FROM V$SYSTEM_EVENTS ORDER BY TIME_WAITED DESCEND;
Oracle increments global lock gets when a new lock on a resource is opened. A convert is counted when there is already an open lock and Oracle converts it to another mode.
The elapsed time for a get thus includes the allocation and initialization of new locks. If the average lock get or average lock convert times are excessive, your system may be experiencing timeouts.
If the global lock convert times or global lock get times are high, refer to statistics in the V$SYSTEM_EVENTS view to identify events with a high value for TIME_WAITED statistics.
If the result is more than 20ms, query the TIME_WAITED column in V$SYSTEM_EVENTS using the DESCEND keyword to identify the event causing the delay.
Oracle collects IDLM resource, lock, and message statistics at the IDLM level. Use these statistics to monitor IDLM latency and workloads. These statistics appear in the V$DLM_CONVERT_LOCAL and V$DLM_CONVERT_REMOTE views.
These views record average convert times, count information, and timed statistics for each type of lock request. V$DLM_CONVERT_LOCAL shows statistics for local lock operations. V$DLM_CONVERT_REMOTE shows values for remote conversions. The average convert times in these views are in 100ths of a second.
The IDLM workload is an important aspect of OPS and Cache Fusion performance because each consistent-read request results in a lock request. High IDLM workloads as a result of heavy request rates can adversely affect performance.
The IDLM performs local lock operations entirely within the local node, or in other words, without sending messages. Remote lock operations require sending messages to and waiting for responses from other nodes. Most down-converts, however, are local operations for the IDLM.
The following procedures for analyzing IDLM resource, locks, and message statistics appear in two groups. The first group of procedures explains how to monitor IDLM resources and locks. The second group explains how to monitor message statistics.
Use the following procedures to obtain and analyze statistics from V$DLM_CONVERT_LOCAL and V$DLM_CONVERT_REMOTE for DLM resource processing.
You must enable event 29700 to populate the V$DLM_CONVERT_LOCAL and V$DLM_CONVERT_REMOTE views. Do this by entering this syntax:
EVENT="29700 TRACE NAME CONTEXT FOREVER"
SELECT CONVERT_TYPE, AVERAGE_CONVERT_TIME, CONVERT_COUNT FROM V$DLM_CONVERT_LOCAL;
Oracle responds with output similar to:
CONVERT_TYPE AVERAGE_CONVERT_TIME CONVERT_COUNT -------------------------------------- -------------------- ------------- NULL -> SS 0 0 NULL -> SX 0 0 NULL -> S 1 146 NULL -> SSX 0 0 NULL -> X 1 92 SS -> SX 0 0 SS -> S 0 0 SS -> SSX 0 0 SS -> X 0 0 SX -> S 0 0 SX -> SSX 0 0 SX -> X 0 0 S -> SX 0 0 S -> SSX 0 0 S -> X 3 46 SSX -> X 0 0 16 rows selected.
SELECT * FROM V$DLM_CONVERT_REMOTE;
Oracle responds with output identical in format to the output for V$DLM_CONVERT_LOCAL.
Use your output from V$DLM_CONVERT_LOCAL and V$DLM_CONVERT_REMOTE to perform the calculation described in the following procedure.
SELECT r.CONVERT_TYPE, r.AVERAGE_CONVERT_TIME, l.AVERAGE_CONVERT_TIME, r.CONVERT_COUNT, l.CONVERT_COUNT, FROM V$DLM_CONVERT_LOCAL l, V$DLM_CONVERT_REMOTE r GROUP BY r.CONVERT_TYPE;
The IDLM sends messages either directly or by using flow control. For both methods, the IDLM attaches markers known as "tickets" to each message. The allotment of tickets for each IDLM is limited. However, the IDLM can re-use tickets indefinitely.
IDLMs send messages directly until no more tickets are available. When an IDLM runs out of tickets, messages must wait in a flow control queue until outstanding messages finish processing and more tickets are available. Flow-controlled messaging is managed by the LMD process.
The rationing of tickets prevents one node from sending an excessive amount of messages to another node during periods of heavy inter-instance communication. This also prevents one node with heavy remote consistent-read block requirements from assuming control of messaging resources throughout a cluster at the expense of other, less-busy nodes.
The V$DLM_MISC view contains the following statistics about message activity:
Use the following procedures to obtain and analyze message statistics in V$DLM_MISC.
Oracle responds with output similar to:
STATISTIC# NAME ---------- -------------------------------------VALUE----- 0 dlm messages sent directly 29520 1 dlm messages flow controlled 1851 2 dlm messages received 29668 3 dlm total incoming msg queue length 297 4 rows selected.
Use your output from V$DLM_MISC to perform the following procedure.
Oracle increments the value for 'total incoming message queue length' whenever a new request enters the LMD process' message queue. When messages leave the LMD queue to begin processing, Oracle increments the value for 'messages received'.
The size of the queue may increase if a large number of requests simultaneously arrives at the LMD. This can occur when the volume of locking activity is high or when the LMD processes a large quantity of consistent-read requests. Typically, the average receive queue length is less than 10.
In addition to the global cache and global lock statistics that were previously discussed, you can also use statistics in V$SYSSTAT to measure the I/O workload related to global conflict resolution. There are three important sets of statistics in V$SYSSTAT for this purpose:
DBWR forced writes occur when Oracle resolves inter-instance data block contention by writing the requested block to disk before the requesting node can use it.
If a consistent-read request requires information from another instance's cache to roll back a block and make it read consistent, Oracle must also write rollback segment headers and rollback segment blocks to disk. One instance must write the undo blocks and undo headers to disk while another instance reads them.
Cache Fusion minimizes the disk I/O for consistent-reads. This can lead to a substantial reduction in physical writes performed by each instance. Before Cache Fusion, a consistent-read involving data from a remote instance required up to 3 writes, 3 reads, a rollback segment header, an undo segment block, and multiple lock converts for one requested block.
Use the following procedures to obtain and analyze message statistics in V$SYSSTAT.
SELECT NAME, VALUE FROM V$SYSSTAT WHERE NAME IN ('DBWR forced writes', 'remote instance undo block writes', 'remote instance undo header writes', 'physical writes');.
Oracle responds with output similar to:
NAME VALUE --------------------------------------------------------- ---------- physical writes 41802 DBWR forced writes 5403 remote instance undo block writes 0 remote instance undo header writes 2 4 rows selected.
Use your V$SYSSTAT output to perform the following calculations.
You should see a noticeable decrease in this ratio between this calculation and pre-Cache Fusion statistics.
The ratio shows how much disk I/O is related to writes to rollback segments. With Cache Fusion, this ratio should be very low.
This section describes how to analyze output from three views to quantify lock conversions by type. The tasks and the views discussed in this section are:
V$LOCK_ACTIVITY summarizes how many lock up- and down-converts have occurred during an instance's lifetime. X-to-N down-converts denote the number of times a lock was down-converted because another instance wanted to modify a resource.
The other major type of down-convert is X-to-S. This type of down-convert occurs when an instance reads a resource that was last modified by a local instance. Both types of lock conversions involve I/O. However, Cache Fusion should reduce X-to-S down-converts because they are not needed for buffer locks.
V$CLASS_PING summarizes lock conversion activity by showing whether disk I/O is occurring on the following classes of blocks:
All X_2_NULL_FORCED_WRITE and X_2_S_FORCED_WRITE conversions involve write I/O. In other words, values in the columns for each block class provide an indicator of the cause of the disk I/O.
V$PING helps identify "hot" blocks and "hot" objects. The sum of the columns FORCED_READS and FORCED_WRITES indicates the actual pinging activity on a particular block or object.
All three views provide different levels of detail. If you suspect that pinging or OPS itself is the cause of a performance problem, monitor V$LOCK_ACTIVITY to generate an overall OPS workload profile. Use information from V$LOCK_ACTIVITY to record the rate at which lock conversions occur.
For more details, use V$CLASS_PING to identify the type of block on which lock conversions and pinging are occurring. Once you have identified the class, use V$PING to obtain details about a particular table or index and the file and block numbers on which there is significant lock conversion activity.
Latches are low level locking mechanisms that protect SGA data structures. Excessive contention for latches degrades performance.
Use V$DLM_LATCH and V$LATCH_MISSES to monitor latch contention within the IDLM. These views show information about a particular latch, its statistics, and the location in the code from where the latch is acquired.
Use the following procedures to analyze latch, OPS, and IDLM-related statistics.
Oracle responds with output similar to:
SELECT PARENT_NAME, "WHERE", SLEEP_COUNT FROM V$LATCH_MISSES ORDER BY SLEEP_COUNT DESCENDING;
Oracle responds with output similar to:
Use your V$LATCH and V$LATCH_MISSES output to perform the following procedures.
High numbers for misses usually indicate contention for the same resources and locks. Acceptable ratios range from 90 to 95%.
The following section describes how to use V$SYSTEM_EVENTS in more detail.
Data about Cache Fusion and OPS events appears in the V$SYSTEM_EVENT view. To identify events for which processes have waited the longest, query V$SYSTEM_EVENT on the TIME_WAITED column using the DESCENDING keyword. The TIME_WAITED column shows the total wait time for each system event listed. For an example of how to query V$SYSTEM_EVENTS, refer to Step 2.
By generating an ordered list of event waits, you can easily locate performance bottlenecks. Each COUNT represents a voluntary context switch. The TIME_WAIT value is the cumulative time that processes waited for particular system events. The values in the TOTAL_TIMEOUT and AVERAGE_WAIT columns provide additional information about system efficiency.
The following events appearing in V$SYSTEM_EVENT output represent waits for OPS events:
You can monitor other events in addition to those listed under the previous heading because performance problems may be related to OPS. These events are:
If the time waited for global cache events is high relative to other waits, look for increased latencies, contention, or excessive system workloads using V$SYSSTAT statistics and operating system performance monitors. A high number of global cache busy or buffer busy waits indicates increased contention in the buffer cache.
In OLTP systems with data block address locking and a high degree of contention, it is not unusual when the global cache wait events represent a high proportion of the sum of the total time waited.
If a lot of wait time is used by waits for non-buffer cache resources as indicated by statistics in the rows 'row cache lock', 'enqueues', and 'library cache pin', monitor the V$ROWCACHE and V$LIBRARYCACHE views for OPS-related issues. Specifically, observe values in the IDLM columns of each of these views.
Common OPS problems arise from poorly managed space parameters or sequences that are not cached. In such cases, processes wait for row cache locks and enqueues and V$ROWCACHE will show a high number of conflicts for certain dictionary caches.