Oakies Blog Aggregator

randolf.geist's picture

Enabling Edition Based Redefinition On A Schema With Object-Relational Tables

This is just a heads-up for those thinking about using Edition Based Redefinition (EBR) and enabling it on an existing schema with objects. Although EBR isn't exactly a new feature its current adoption level is probably not that high (which probably changes in future as Oracle E-Business Suite uses EBR now as part of their default upgrade procedure as far as I understood).I was recently contacted by someone who enabled EBR on an existing schema using ALTER USER ... ENABLE EDITIONS and had to use the "FORCE" option since there were (according to the official ALTER USER documentation) "objects that are not editionable and that depend on editionable type objects in the schema. ... In this case, all the objects that are not editionable and that depend on the editionable type objects in the schema being editions-enabled become invalid".Although one could say it is clearly mentioned in the documentation, the consequences are probably not that obvious to everyone if those non-editionable objects are tables having columns based on user-defined types. So I state it here to make it hopefully clear enough:If you use the FORCE option of ALTER USER ... ENABLE EDITIONS to enable editions on an existing schema already containing objects and among those objects are tables having columns based on user-defined types, then effectively those tables will become invalid and stay invalid. There is no officially documented way to reverse this step or compile tables to become valid again (there's no ALTER TABLE COMPILE or similar). The table cannot be accessed any longer and all data contained is hence officially gone, too.This means the affected tables need to be restored from a (hopefully existing and usable) backup (I don't know if there's a dictionary hack available that is officially sanctioned by Oracle support to make the table valid again).In my opinion the FORCE option should check if the schema contains such dependent tables and in such a case error out with an error message that the table needs to be dropped first (or moved to a different schema) before ENABLE EDITIONS can succeed. This would make the situation much clearer, rather than leaving tables in INVALID state behind that cannot be fixed/recompiled afterwards.Below is a simple test case that demonstrates the issue:


-- Create a test user for enabling editions
drop user ebr_test cascade;

create user ebr_test identified by ebr_test;

grant dba to ebr_test;

-- Connect as test user
connect ebr_test/ebr_test

-- and create a object relational type
create type test_coll_type as table of number;

-- and a table having such a column of that type
create table test_ebr_table (col1 number, col2 number, test_coll test_coll_type) nested table test_coll store as test_coll_table;

-- Some test data
insert into test_ebr_table (col1, col2, test_coll) values (1, 1, test_coll_type(1, 2 ,3));

commit;

-- Everything is fine so far
select * from test_ebr_table, table(test_coll);

-- Enable editions, FORCE is required
alter user ebr_test enable editions force;

-- This no longer works (ORA-04063: table "EBR_TEST.TEST_EBR_TABLE" has errors)
select * from test_ebr_table, table(test_coll);

-- Not even simple scalar values can be accessed from the table
select col1 from test_ebr_table;

-- The table has gone INVALID
select status from user_objects where object_name = 'TEST_EBR_TABLE';
Jonathan Lewis's picture

Not Exists

Another question on a seemingly simple “not exists” query has appeared on OTN just a few days after my last post about the construct. There are two little differences between the actual form of the two queries that make it worth repeating the analysis.

The first query was of the form:


select from big_table
where  not exists (select exact_matching_row from small table);

while the new query is of the form:


select from big_table alias1
where not exists (select inexact_matching_row from big_table alias2)

In the absence of a complete explanation, we might guess that the intention of the first query is to: “check correctness of undeclared foreign key constraint” i.e. small_table is the parent table with unique values, and big_table is the child end with some data that may be invalid due to a recent bulk data load. (In my example for the previous posting I relaxed the uniqueness assumption in small_table to make the problem a little more expensive.)

Our guess for the second query ought to be different; we are using the same table twice, and we are checking for the non-existence of “imperfect matches”. This introduces two potential threats – first that the (possibly pseudo-)join between the two tables is between two large tables and therefore inherently likely to be expensive; second that we may be allowed to have very large numbers of “perfect matches” that will escalate the scale of the join quite dramatically. Here’s the actual query (second version) from the posting; this will make it easier to explain why the structure of the query introduces the second threat and requires us (as so often) to understand the data in order to optimise the query execution:


select  count(*)
from    ubl_stg.wk_sap_fat w1
where   not exists (
                select  1
                from    ubl_stg.wk_sap_fat w2
                where   w2.mes_mese_id =  w1.mes_mese_id
                and     w2.sistema     <> w1.sistema
        )
;

Note the schema name ubl_stg – doesn’t that hint at “staging tables” for a data load.
Note the column name mes_mese_id – an “id” column, but clearly not one that’s supposed to be unique, so possibly a foreign key column that allows repetitions

The code is looking for, then counting, cases where for a given value of mes_mese_id there is only one corresponding value used for sistema. Since we don’t know the application we don’t know whether this count should be low or high relative to the number of rows in the table, nor do we know how many distinct values there might be of mes_mese_id – and these pieces of information are critical to identifying the best execution plan for the query.

The owner of this query told us that the table held 2 million rows which are “deleted and reloaded” every day, showed us that the (default) execution plan was a hash anti-join which took two hours to complete, told us that he (or she) didn’t think that an index would help because of the “not equal” predicate, and finally reported back that the problem was solved (no timing information supplied) by the addition of the /*+ no_unnest */ hint.

The question is – what has happened and why is there such a difference in performance ?

The first observation is that the 2 hours does seem an unreasonably long time for the query to run – and since it’s only a select statement it would be good to run it again and take a snapshot of the session statistics and session events (v$sesstat, v$session_event) for the session to see what the workload was and where the time was spent. Given the “deleted and reloaded” comment it’s possible that there may be some unexpected overhead due to some strange effects of read-consistency or delayed block cleanout, so we might also take a snapshot of tablespace or file I/O to check for lots of I/O on the undo tablespace (which might also show an I/O problem on the table’s tablespace or the user’s TEMP tablespace). The snapshots give us a very cheap, non-invasive option for getting some summary stats – but the tablespace/file stats are system-wide, of course, so may not tell us anything about our specific task, so we might even enable extended tracing (event 10046 level 8 / dbms_monitor with waits) to see in detail where we are losing time.

Since we don’t currently have the information we need to explain the two hours (it may have appeared by the time I post this note) it might be instructive to make some guesses about where the time could go in the hash anti-join, and why the /*+ no_unnest */ hint could make a difference.

Ignoring the possibility of strange undo/read-consistency/cleanout effects the first possiblity is simply that the hash join is large and turns into a multi-pass I/O thrash. The mes_mese_id column looks like it might be a number (id columns so often are) but the sistema column has the flavour of a reasonably large character column – so maybe our hash table has to be a couple of hundred megabytes – that could certainly be enough to spill to disc, though you’d have to have a really small PGA availability for it to turn into a multi-pass hash join.

Another possibility is that the pattern in the data makes the hash join burn up a huge amount of CPU – that should be easy to see on a re-run.  If there are relatively few distinct sets of values for (mes_mese_id, sistema) and there are very few cases where a mes_mese_id is associated with more than one sistema, then a large fraction of the hash table probes would have to follow a very long chain of matches to the very end, and that would take a large amount of CPU.

Pursue that “long chain” hypothesis to a slight extreme – what if there’s one mes_mese_id that appears in 250,000 of the 2M rows, and the sistema value is the same for every one of those quarter million rows,  which would require Oracle to walk a chain of 250,000 elements 250,000 times, for a total of 62.5 billion pointers to follow and comparisions to make – how much CPU might that take.  Worse still,since having the same mes_mese_id (the hashing is only on the equality predicate) means the quarter of a million rows would have to go into the same hash bucket so we might end up doing a multipass operation because that one bucket was very much larger than anything Oracle had anticipated when it did its internal partitioning calculations for the hash table. (There are some related notes in this article and in chapter 12 of Cost Based Oracle – Fundamentals)

Why would the /*+ unnest */ hint address these performance problems ? My first guess would be that the OP may have created the index on (mes_mese_id, sistema), in which case a full scan of that index (or even a fast-full scan if the index were newly created, or even a tablescan if the data had been loaded in the right order) followed by the filter subquery being driven by an index range scan would result in a relatively efficient subquery being executed once per distinct value of (mes_mese_id, sistema) rather than once per row. This could be much more efficient than a really badly skewed data set doing the hash anti-join. In fact, even in the absence of the index, if the number of distinct combinations was really quite small, that many tablescans – if the table were cached, whether in Oracle or the filesystem or the SAN cache – might still be a lot faster than the hash anti-join. (We were told that the problem was solved – but not told how much time constituted a viable solution.)

Models

Hand-waving and talk is fine – but a few modelled results might make it easier to comprehend, so here’s a way of generating a few versions of data sets and testing:


drop table t1 purge;

create table t1
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                                          id,
/*
        --
        --      Random generation nearly guaranteeing no random
        --      duplicates of (x,y) but quite a lot of mismatches
        --      Count 735,715 in 3.36 seconds.
        --
        trunc(dbms_random.value(0,2e6))                 x,
        lpad(trunc(dbms_random.value(0,1e6)),64,'0')    y
*/
/*
        --
        --      One specific pair repeated many times with no mismatch the rest
        --      as previously generated above. Times with different repeat counts:
        --       10,000            14 seconds
        --       20,000            52 seconds
        --       40,000           202 seconds
        --      CPU time quadruples as the count doubles (twice the number of
        --      probes walking twice the number of steps in the hash chain)
        --       80,000 =>        800 seconds
        --      160,000 =>      3,200 seconds
        --      240,000 =>      ca. 2 hours.
*/
        case
                when rownum <= 40000
                        then 2e6
                        else trunc(dbms_random.value(0,2e6))
        end                                             x,
        case
                when rownum <= 40000
                        then
                                lpad(0,64,'0')
                        else
                                lpad(
                                        trunc((rownum - 1)/1000) +
                                                case mod(rownum-1,1000) when 0 then 0 else 0 end,
                                        64,'0'
                                )
        end                                             y
/*
        --
        --      2,000 distinct values repeated 1,000 times each
        --      Query result: 2,000,000 in 235 seconds.
        --
        trunc((rownum - 1)/1000)                x,
        lpad(
                trunc((rownum - 1)/1000) +
                        case mod(rownum-1,1000) when 0 then 0 else 0 end,
                64,'0'
        )                                       y
*/
from
        generator       v1,
        generator       v2
where
        rownum <= 2e6
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

select
        count(*)
from    t1 w1
where   not exists (
                select  1
                from    t1 w2
                where   w2.x = w1.x
                and     w2.y <> w1.y
);  

As you can see I’ve changed the table and column names – this keeps them in line with the original SQL statement presented on OTN before we got the graphic display of a similar statement and plan. The SQL to create the data includes three variants and 5 sets of results (and 3 conjectures) running 11.2.0.4. These results appeared when the optimizer took the hash anti-join execution plan, and also spilled to disc with a one-pass workarea operation that first extended to about 200MB of PGA then dropped back to about 8MB.

To summarise the results recorded in the SQL – if we use the term “bad data” to describe rows where more than one Y value appears for a given X value then:

  1. With a large number of distinct pairs and a lot of bad data: the anti-join is pretty fast at 3.36 seconds.
  2. With no bad data, a small number of distinct pairs (2,000) and lots of rows per pair (1,000): the anti-join takes 235 CPU seconds
  3. As for #1 above, but with one extreme “good” pair that appears a large number of times: CPU time is proportional to the square of the number of duplicates of this value

I didn’t actually test beyond 40,000 duplicates for the last case, but you can see the double/quadruple pattern very clearly and the CPU time would have hit 2 hours at around 240,000 identical copies of one (x,y) pair.

/*+ no_unnest */

So what happens if you try using the /*+ no_unnest */ hint ? The target here is that the more repetitive the data the smaller the number of times you may have to run the subquery; and if you can get the driving data ordered you can guarantee the smallest possible number of runs of the subquery. I haven’t worked through all the possibilities, but to give you the flavour of what can happen, when I added the /*+ no_unnest */ hint to the query (and ensured that the table would be cached rather than read using direct path reads into the PGA) the execution time for the test with 1,000 copies of 2,000 pairs took 181 seconds to do 2,001 tablescans (compared with 235 seconds to do 2 tablescans and a hash anti-join) with the following execution path:


----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |    69 |  8256K  (4)| 11:28:04 |
|   1 |  SORT AGGREGATE     |      |     1 |    69 |            |          |
|*  2 |   FILTER            |      |       |       |            |          |
|   3 |    TABLE ACCESS FULL| T1   |  2000K|   131M|  2877   (4)| 00:00:15 |
|*  4 |    TABLE ACCESS FULL| T1   |     2 |   138 |     4   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( NOT EXISTS (SELECT /*+ NO_UNNEST NO_INDEX ("W2") */ 0
              FROM "T1" "W2" WHERE "W2"."X"=:B1 AND "W2"."Y"<>:B2))
   4 - filter("W2"."X"=:B1 AND "W2"."Y"<>:B2)

More significantly, when I created the optimum index the execution time dropped to 0.9 seconds – here’s the create index statement and subsequent plan – the extreme benefit appears because the data was effectively loaded in sorted order; if this had not been the case I would have forced an index full scan for the driving data set (with a “not null” predicate or constraint to make it possible for the optimizer to use the index to drive the query)


create index t1_i1 on t1(x,y) compress pctfree 0;

-----------------------------------------------------------------------------
| Id  | Operation           | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |       |     1 |    69 |  6008K  (1)| 08:20:45 |
|   1 |  SORT AGGREGATE     |       |     1 |    69 |            |          |
|*  2 |   FILTER            |       |       |       |            |          |
|   3 |    TABLE ACCESS FULL| T1    |  2000K|   131M|  2877   (4)| 00:00:15 |
|*  4 |    INDEX RANGE SCAN | T1_I1 |     2 |   138 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( NOT EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T1" "W2"
              WHERE "W2"."X"=:B1 AND "W2"."Y"<>:B2))
   4 - access("W2"."X"=:B1)
       filter("W2"."Y"<>:B1)

Similarly, when I had the index in place and the 40,000 repetitions of a “good pair”, Oracle took a total of 9 seconds even though it had to run the subquery 1,960,000 times for the non-repetitive data (and once for the repetitive_pair). I have to say I was a little surprised at how rapidly it managed to get through that 2M subquery executions – but then I keep forgetting how ridiculously overpowered my new laptop is.

With these figures in mind you can appreciate that if the OP had lots of pairs with tens of thousands of repetitions, then even without creating the index on the table, the query time might drop from 2 hours for the hash anti-join to “a few minutes” for the filter subquery with a time that was good enough to look like “problem solved”.

Summary

If you have a few “long hash chains” in the build table – i.e. rows from the “first” table in the join that hash to the same value – then the amount of work Oracle has to do to check a single row from the probe (“second”) table that matches on the hash value can become significant. If a large number of rows from the probe table hit a long hash chain then the CPU time for the whole join can climb dramatically and you may want to force Oracle away from the hash join.

If the the long chains are the result of a skewed distribution where a small number of values appear very frequently in a table with a large number of distinct values that each appears infrequently then the optimizer may not notice the threat and may choose the hash plan when there is a much less resource-intensive alternative available.

Footnote:

There was another interesting observations I made while doing the experiments relating to whether the chains are due to hash collisions or require exact matches in the data – but I’ve spent an hour on desgining and running tests, and nearly 4 hours writing up the results so far. I need to do a few more tests to work out whether I’m seeing a very clever optimisation or a lucky coincidence in a certain scenario – so I’m going to save that for another day.

 

 

tanelpoder's picture

Advanced Oracle Troubleshooting Guide – Part 12: control file reads causing enq: SQ – contention waits?

Vishal Desai systematically troubleshooted an interesting case where the initial symptoms of the problem showed a spike of enq: SQ – contention waits, but he dug deeper – and found the root cause to be quite different. He followed the blockers of waiting sessions manually to reach the root cause – and also used my @ash/ash_wait_chains.sql and @ash/event_hist.sql scripts to extract the same information more conveniently (note that he had modified the scripts to take AWR snap_ids as time range parameters instead of the usual date/timestamp):

Definitely worth a read if you’re into troubleshooting non-trivial performance problems :)

marco's picture

Database landscape 2014 visualization

I saw this database landscape 2014 overview from “451 Research” with an very nice visualization…

martin.bach's picture

Little things worth knowing: Is there a penalty in establishing a connection to Oracle using the MAA connection string? Part 2

In the very lengthy previous post about the MAA connect string I wanted to explain the use of the MAA connection string as promoted by Oracle. I deliberately kept the first part simple: both primary and standby cluster were up, and although the database was operating in the primary role on what I called standby cluster (again it’s probably not a good idea to include the intended role in the infrastructure names) there was no penalty establishing a connection.

As pointed out by readers of the blog entry that is of course only one part of the story (you may have guessed by the TNS alias MAA_TEST1 … there are more to come). When you define the connection string this way, you cater for the situation where something goes wrong. Let’s try a few more scenarios. Here is the current Data Guard situation:

DGMGRL> show configuration

Configuration - martin

  Protection Mode: MaxPerformance
  Members:
  CDB   - Primary database
    Error: ORA-16810: multiple errors or warnings detected for the database

    STDBY - Physical standby database 
      Error: ORA-12543: TNS:destination host unreachable

Fast-Start Failover: DISABLED

Configuration Status:
ERROR   (status updated 71 seconds ago)
DGMGRL> show database 'CDB'

Database - CDB

  Role:               PRIMARY
  Intended State:     TRANSPORT-ON
  Instance(s):
    CDB1
      Error: ORA-16737: the redo transport service for standby database "STDBY" has an error

    CDB2
      Error: ORA-16737: the redo transport service for standby database "STDBY" has an error

Database Status:
ERROR

As you can see CDB is operating in primary role, and redo transport is broken. This simple reason for broken redo transport is the fact that the standby cluster in its entirety is down. For the sake of completeness I have repeated the MAA connection string here:

MAA_TEST1 =
  (DESCRIPTION_LIST=
     (LOAD_BALANCE=off)(FAILOVER=on)
       (DESCRIPTION=
         (CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)
           (ADDRESS_LIST= (LOAD_BALANCE=on) (ADDRESS=(PROTOCOL=TCP)(HOST=rac12pri-scan)(PORT=1521)))
           (CONNECT_DATA=(SERVICE_NAME=rootsrv)))
       (DESCRIPTION= (CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)
           (ADDRESS_LIST= (LOAD_BALANCE=on) (ADDRESS=(PROTOCOL=TCP)(HOST= rac12sby-scan)(PORT=1521)))
           (CONNECT_DATA=(SERVICE_NAME=rootsrv)))
  )

Scenario 1: shut down of the standby cluster, database in primary role on primary cluster

This does not seem to be much of a problem. Using the same test as in part 1 I get no noteworthy penalty at all:

[oracle@lab tns]$ time sqlplus system/secret@maa_test1 < 
SYS_CONTEXT('USERENV','INSTANCE_NAME')
--------------------------------------------------------------------------------
CDB2

SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options

real	0m0.246s
user	0m0.020s
sys	0m0.025s
[oracle@lab tns]$ time sqlplus system/secret@maa_test1 < 
SYS_CONTEXT('USERENV','INSTANCE_NAME')
--------------------------------------------------------------------------------
CDB1

SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options

real	0m0.171s
user	0m0.018s
sys	0m0.018s
[oracle@lab tns]$ 

Trying from a 12.1.0.1.0 client on a different machine:

[oracle@oracledev ~]$ time sqlplus system/secret@maa_test1 < select sys_context('userenv','instance_name') from dual;
> exit
> EOF

SQL*Plus: Release 12.1.0.1.0 Production on Thu Apr 23 10:13:50 2015

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Last Successful login time: Thu Apr 23 2015 10:12:11 +01:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options

SQL> 
SYS_CONTEXT('USERENV','INSTANCE_NAME')
--------------------------------------------------------------------------------
CDB1

SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options

real	0m0.427s
user	0m0.012s
sys	0m0.027s
[oracle@oracledev ~]$ time sqlplus system/secret@maa_test1 < 
SYS_CONTEXT('USERENV','INSTANCE_NAME')
--------------------------------------------------------------------------------
CDB2

SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options

real	0m0.158s
user	0m0.012s
sys	0m0.016s

That seems ok, but can be easily explained by the (LOAD_BALANCE=off)(FAILOVER=on) in the DESCRIPTION_LIST. But what if we do it the other way around? Let’s start by reversing the roles:

DGMGRL for Linux: Version 12.1.0.2.0 - 64bit Production

Copyright (c) 2000, 2013, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected as SYSDBA.
DGMGRL> switchover to 'STDBY'
Performing switchover NOW, please wait...
New primary database "STDBY" is opening...
Oracle Clusterware is restarting database "CDB" ...
Switchover succeeded, new primary is "STDBY"

DGMGRL> show configuration

Configuration - martin

  Protection Mode: MaxPerformance
  Members:
  STDBY - Primary database
    CDB   - Physical standby database 

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS   (status updated 48 seconds ago)


Scenario 2: black-out of the primary cluster, database in primary role on standby cluster

The next is the worst case: a date centre has gone down, and none of the hosts in the other cluster (rac12pri) is reachable. To simulate this I simply shut down rac12pri1 and rac12pri2, that’s as dead as it gets. The SCAN will still resolve in DNS, but there is neither a SCAN VIP nor a SCAN LISTENER anywhere to answer on the primary cluster.

As you saw in the output leading to this paragraph the primary database is started on rac12sby, nodes rac12sby1 and rac12sby2. In the current format, there is – for the first time – a connection penalty:

preventde
[oracle@rac12sby1 ~]$ time sqlplus system/secret@maa_test1 < select sys_context('userenv','instance_name') from dual;
> exit;
> EOF

SQL*Plus: Release 12.1.0.2.0 Production on Tue Apr 21 07:17:36 2015

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Last Successful login time: Mon Apr 20 2015 10:39:29 +01:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options

SQL>
SYS_CONTEXT('USERENV','INSTANCE_NAME')
--------------------------------------------------------------------------------
STDBY2

SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options

real    0m35.770s
user    0m0.024s
sys     0m0.043s

[oracle@rac12sby1 ~]$ time sqlplus system/secret@maa_test1 <
SYS_CONTEXT('USERENV','INSTANCE_NAME')
--------------------------------------------------------------------------------
STDBY2

SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options

real    0m36.219s
user    0m0.021s
sys     0m0.022s

During testing with 12c these ranged between 30 seconds and 38. Here is proof that I truly cannot connect to the primary cluster:

[oracle@lab tns]$ ping rac12pri-scan
PING rac12pri-scan.example.com (192.168.100.111) 56(84) bytes of data.
From lab.example.com (192.168.100.1) icmp_seq=1 Destination Host Unreachable
From lab.example.com (192.168.100.1) icmp_seq=2 Destination Host Unreachable
From lab.example.com (192.168.100.1) icmp_seq=3 Destination Host Unreachable
From lab.example.com (192.168.100.1) icmp_seq=4 Destination Host Unreachable
^C
--- rac12pri-scan.example.com ping statistics ---
5 packets transmitted, 0 received, +4 errors, 100% packet loss, time 4001ms

[oracle@rac12sby1 ~]$ time telnet rac12pri-scan 1521
Trying 192.168.100.112...
telnet: connect to address 192.168.100.112: No route to host
Trying 192.168.100.113...
telnet: connect to address 192.168.100.113: No route to host
Trying 192.168.100.111...
telnet: connect to address 192.168.100.111: No route to host

real	0m9.022s
user	0m0.001s
sys	0m0.003s

I found it interesting that the telnet command took almost no time at all to complete, yet trying to connect using SQL_Net it seems to take forever. Why could that be? Trace!

Getting to the bottom of this

Tracing can help sometimes, so I tried that. I try and keep my working TNS configuration separate from the production entries in $ORACLE_HOME. Using ~/tns I defined sqlnet.ora to read:

  • diag_adr_enabled = off
  • trace_level_client = support
  • trace_directory_client = /home/oracle/tns
  • log_directory_client = /home/oracle/tns
  • trace_unique_client = true

And connected again. I noticed time increases whenever nsc2addr was invoked:

[oracle@lab tns]$ grep "nsc2addr.*DESC" cli_6630.trc | nl
     1  (1947452928) [23-APR-2015 11:07:45:376] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.112)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))
     2  (1947452928) [23-APR-2015 11:07:48:383] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.111)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))
     3  (1947452928) [23-APR-2015 11:07:51:390] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.113)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))
     4  (1947452928) [23-APR-2015 11:07:54:396] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.113)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))
     5  (1947452928) [23-APR-2015 11:07:57:402] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.111)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))
     6  (1947452928) [23-APR-2015 11:08:00:408] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.112)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))
     7  (1947452928) [23-APR-2015 11:08:03:414] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.113)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))
     8  (1947452928) [23-APR-2015 11:08:06:421] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.111)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))
     9  (1947452928) [23-APR-2015 11:08:09:427] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.112)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))
    10  (1947452928) [23-APR-2015 11:08:12:433] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.111)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))
    11  (1947452928) [23-APR-2015 11:08:15:439] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.112)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))
    12  (1947452928) [23-APR-2015 11:08:18:445] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.113)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))
    13  (1947452928) [23-APR-2015 11:08:21:452] nsc2addr: (DESCRIPTION=(CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.120)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))

Sorry for the lengthy listing but that way it shows best! The primary SCAN is set to 111,112 and 113 with the standby SCAN on 119,120, and 121.

What you can see clearly by looking at the timestamps is that the transport_connect_timout works perfectly well-a new address is tried every 3 seconds. What surprises me is that each SCAN VIP is tried 4 times before the second SCAN is tried (lines 1-12). I thought I had that limited to 3 retries … Line 13 is the first reference to the second SCAN, and as soon as that is referenced a connection is established.

That can’t be it, can it?

Hmm maybe Oracle went a bit too far by adding 3 retries to every single SCAN VIP. The whole process can be sped up by trying every SCAN VIP only once before failing over to the second SCAN. This can be done with the following connection string:

MAA_TEST2 =
  (DESCRIPTION_LIST=
     (LOAD_BALANCE=off)(FAILOVER=on)
     (CONNECT_TIMEOUT=5)(TRANSPORT_CONNECT_TIMEOUT=3)(RETRY_COUNT=3)
       (DESCRIPTION=
           (ADDRESS_LIST= (LOAD_BALANCE=on) (ADDRESS=(PROTOCOL=TCP)(HOST=rac12pri-scan)(PORT=1521)))
           (CONNECT_DATA=(SERVICE_NAME=rootsrv)))
       (DESCRIPTION= 
           (ADDRESS_LIST= (LOAD_BALANCE=on) (ADDRESS=(PROTOCOL=TCP)(HOST= rac12sby-scan)(PORT=1521)))
           (CONNECT_DATA=(SERVICE_NAME=rootsrv)))
  )

Trying this works a lot better from a timing perspective:

[oracle@lab tns]$ time sqlplus system/secret@maa_test2< Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options

real	0m9.829s
user	0m0.041s
sys	0m0.022s
[oracle@lab tns]$ 

Visible in the SQL*Net trace:

[oracle@lab tns]$ grep "nsc2addr.*DESC" cli_7305.trc | nl 
     1	(3474875904) [23-APR-2015 11:35:17:935] nsc2addr: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.113)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))
     2	(3474875904) [23-APR-2015 11:35:20:940] nsc2addr: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.112)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))
     3	(3474875904) [23-APR-2015 11:35:23:948] nsc2addr: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.111)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))
     4	(3474875904) [23-APR-2015 11:35:26:956] nsc2addr: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.100.119)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=rootsrv)(CID=(PROGRAM=sqlplus)(HOST=lab.example.com)(USER=oracle))))

That was the solution to the “slow” connectivity that the person who asked me the question in Paris shared, but I haven’t been able to find proof as to why it works better: this is it. Now the question remains why the MAA string in the Application Continuity white paper is more complex? I think what’s failing here is the CONNECT_TIMEOUT. According to tnsnames.ora reference in 12c the purpose of the parameter is “To specify the timeout duration in seconds for a client to establish an Oracle Net connection to an Oracle database”. You could read this so that after CONNECT_TIMEOUT is reached the failover to the next SCAN occurs. However it appears – by looking at the trace – that the connection timeout is similar to transport timeout and the lower of both is used.

There is a high probability that the syntax is in the document there for a reason, although it escapes me right now. I guess the penalty when creating new connections with connection pools doesn’t really matter that much, because you create the pool (where you “pay” the penalty) only once, and subsequently pull connections from there, which should be very fast.

Richard Foote's picture

Singapore Maths Question Solution and Very Interesting Observation (The Trickster)

OK, time to reveal the solution to the somewhat tricky Singapore maths exam question I blogged previously. Remember, there were 10 dates: May 13   May 15   May 19 June 13   June 14 July 16   July 18 August 14   August 15   August 16 Bowie only knew the month of my birthday, Ziggy only knew the day. Bowie […]

Jonathan Lewis's picture

Golden Oldies

I’ve just been motivated to resurrect a couple of articles I wrote for DBAZine about 12 years ago on the topic of bitmap indexes. All three links point to Word 97 documents which I posted on my old website in September 2003. Despite their age they’re still surprisingly good.

Update: 26th April 2015

Prompted by my reply to comment #2 below to look at what I said about bitmap indexes in Practical Oracle 8i (published 15 years ago), and found this gem:

An interesting feature of bitmap indexes is that it is rather hard to predict how large the index segment will be. The size of a B-tree index is based very closely on the number of rows and the typical size of the entries in the index column. The size of a bitmap index is dictated by a fairly small number of bit-strings which may have been compressed to some degree depending upon the number of consecutive 1’s and 0’s.

To pick an extreme example, imagine a table of one million rows that has one column that may contain one of eight values ‘A’ to ‘H’ say, which has been generated in one of of the two following extreme patterns:

  • All the rows for a given value appear together, so scanning down the table we get 125,000 rows with ‘A’ followed by 125,000 rows of ‘B’ and so on.
  • The rows cycle through the values in turn, so scanning down the table we get ‘A’,’B’. . . ‘H’ repeated 125,000 times.

What will the bitmap indexes look like in the two cases case?

For the first example, the basic map for the ‘A’ value will be 125,000 one-bits, followed by 875,000 zero bits – which will be trimmed off. Splitting the 125,000 bits into bytes and adding the necessary overhead of about 12% we get an entry of the ‘A’ rows of 18K. A similar argument applies for each of the values ‘B’ to ‘H’, so we get a total index size of around 8 x 18K – giving 156K.

For the second example, the basic map for the ‘A’ value will be a one followed by 7 zeros, repeated 125,000 times. There is no chance of compression here, so the ‘A’ entry will start at 125,000 bytes. Adding the overhead this goes up to 140K, and repeating the argument for the values ‘B’ to ‘H’ we get a total index of 1.12 MB.

This wild variation in size looks like a threat, but to put this into perspective, a standard B-tree index on this column would run to about 12 Mb irrespective of the pattern of the data. It would probably take about ten times as long to build as well.

As we can see, the size of a bitmap index can be affected dramatically by the packing of the column it depends upon as well as the number of different possible values the column can hold and the number of rows in the table. The compression that is applied before the index is stored, and the amazing variation in the resulting index does mean that the number of different values allowed in the column can be much larger than you might first expect. In fact it is often better to think of bitmap indexes in terms of how many occurrences of each value there are, rather than in terms of how many different values exist. Viewing the issue from this direction, a bitmap is often better than a B-tree when each value occurs more than a few hundred times in the table (but see the note below following the description of bitmap index entries).

 

dbakevlar's picture

Kickstarting After a Failed Addition to the AWR Warehouse

A common issue I’ve noted are dump files generated from the AWR Warehouse, but upon failure to transfer, the dumpfiles simply exist, never upload and the data is stuck in a “limbo” state between the source database, (target) and the AWR Warehouse.  This can be a very difficult issue to troubleshoot, as no errors are seen in the actual AWR Warehouse “View Errors” and no data from the source is present in the AWR Warehouse.

awrw_nw2

Empowered by EM Jobs

If you go to Enterprise –>  Jobs –> Activity and inspect the Log Report after a search for %CAW% jobs that perform the extraction, transfer and load that experienced a problem, you will then be able to view the error and can inspect the details of the issue.

awr_nw4

 

If you double click on the job in question and note in the example above, you’ll notice that the jobs have the naming convention of CAW_RUN_ETL_NOW.  This is due to a force run via the AWR Warehouse console, (Actions –>  Upload Snapshots Now.)  If the job was a standard ETL run, the naming convention will be CAW_TRANSFER.

The job is a multi-step process, so you will see where the extraction step failed.  This is a bit misleading, as the previous failure that set into motion was a preferred credential issue that stopped the transfer step to the AWR Warehouse.  If you look far enough back in the jobs, you’ll find the original error, but now we are stuck in a loop-  the source can’t go forward once the credentials are fixed and yet it’s difficult for someone unfamiliar with the process to know where it all went wrong.  The first recommendation is often to remove the database and re-add it, but in this scenario, we are going to kickstart the process now that the credentials have been fixed.

Digging into a Job Error

awrw_nw5

 

 

As we stated earlier, you’ll see in the steps, that the extract failed, but the transfer would have been successful if a file had been created.  Double click on “Failed” to see the error that occurred:

awrw_nw4

 

In the SQL output, the error states, ORA-20137:  NO NEW SNAPSHOTS TO EXTRACT

Now we know that no files have been uploaded to the AWR Warehouse, yet, the logic written to the AWR Warehouse package that is in the DBNSMP schema in the source database thinks it’s already pulled all of these snapshots to an extract and created a dumpfile.  This error is very clear in telling the user what is going on.  There is a data telling the ETL process the data ALREADY has been extracted.

Disclaimer:  This solution we are about to undertake is for a BRAND NEW ADDITION TO THE AWR WAREHOUSE ONLY.  You wouldn’t want to perform this on a source database that had been loading properly and then stopped after successful uploads to the warehouse, (if you have one of those, I would want to proceed differently, so please keep this in mind before you attempt this in your own environment….) This fix is also dependent upon all preferred credential issues to be resolved BEFOREHAND.

The Kickstart

To “kickstart” the process after a failure, first, verify that there are no errors that aren’t displaying in the console:

select * from dbsnmp.caw_extract_metadata;

Next, gather the location for the dumpfiles:

select * from dbsnmp.caw_extract_properties;

There will be one line in this table-  It will include the oldest snapID, the newestID and the location of the dumpfile, (often the agent home unless otherwise configured.) This is the table the logic in the package is using to verify what has been already extracted.  We now need to remove this tracking information and the pre-existing dumpfiles created in the previous failed processes:

  1. Make a copy of this table, (create table dbsnmp.caw_extract_prop2 as select * from dbsnmp.caw_extract_properties;)
  2. Truncate dbsnmp.caw_extract_properties table.
  3. Delete the extract, (dumpfiles) from the directory shown in the caw_extract_properties table.  Don’t remove anything else from that directory!
  4. Log into the AWR Warehouse console.
  5. Click on the source database you just scoured of the ETL extract files.
  6. Highlight the database, click on Actions, Click on “Upload Snapshots Now.”
  7. View the job via the link displayed at the top of the console and monitor to completion.
  8. Once the job has succeeded completely, remove the dbsnmp.caw_extract_prop2 table and the backup files you moved that were created earlier from failed extracts.

You should now see successful upload jobs from this point on in the job, along with data in your AWR Wareshouse:

awrw_nw6



Tags:  ,


Del.icio.us



Facebook

TweetThis

Digg

StumbleUpon




Copyright © DBA Kevlar [Kickstarting After a Failed Addition to the AWR Warehouse], All Right Reserved. 2015.

mwidlake's picture

My First Published Article

I’ve been blogging now for almost 6 years and presenting at conferences for… 12 years (really? Good grief!). I’ve even written and delivered several courses, ranging from 1 day to 3 days in length. However, up until now I’ve never been what I would term published – ie managed to persuade another organisation or person to publish something I have written.

That changed a few days ago when the latest UKOUG “Oracle Scene” magazine came out, which included the first of a small series of articles I am doing on how the Oracle RDMBS works – the processes and activities that underlie the core RDBMS engine. It’s based on my “how Oracle works in under an hour” presentation where I give the audience an overview of things like the redo mechanism, what a commit *is*, how data is moved into and out of memory and which parts of memory it resides in, how a point-in-time view is maintained… things like that. Many people don’t really know any of this stuff, even skilled and experienced developers and DBAs, as you can get by without knowing it. But understanding the core architecture makes a lot of how oracle works make more sense.

The below is a screen shot of the title and first paragraph, but you can use the link above to see the whole article.

Title and first paragraph of the article

Title and first paragraph of the article

I’m not sure why it has taken me so long to publish something other than via my blog and presentations. I know part of it is the fear of putting something out there that is wrong or misleading. If it is on my blog, heck it’s only a blog and I stick to things I give test cases for or my thoughts and opinions (which are intrinsically open to interpretation). My presentations are certainly put “out there” but again I of course try to ensure what I say I can back up. I think the key thing is that in both cases it is very obvious who you can blame if it turns out I have made a mistake. Me.

But when something is going to be published I feel that (a) it might be taken more seriously so I need to make extra sure it is correct and (b) if I get something wrong or, more concerning, mislead anyone then the people publishing the article could also be put in a poor light. I think that is what has made me wary.

The irony is that the first thing I get published, I know that there are some inaccuracies in there! The article (and also the presentation it is derived from) is an introduction to a lot of technology and I have to simplify things and ignore many exceptions to keep it small and easy to digest. It’s how it works 90% of the time and you need to know that so you will better understand the exceptions and finer detail I don’t have time to tell you about. For the physical presentation I spend a minute at the top of the talk saying I have simplified, occasionally lied, but the overall principles and feel is correct. I had to drop that bit out of the article as, well, it took a lot of words to explain that and the article was long enough already!

Another reason NOT to publish is it takes a lot of time and effort to prepare the material in a way that is polished enough to be printed and I know from friends that the actual financial payback for eg writing a book is very, very, very poor. No one I know makes enough from royalties on technical books to make the effort worth while {though there are other less tangible benefits}. But I have time at present so I can afford to do these things.  If you want to make money out of publishing, write about a load of elves, an often-wimpy trainee wizard or something with sex in. Or all three together.

I did nearly put a technical book together about 10, 12 years ago, called “The Little Book of Very Large Databases” as it was something I knew a lot about but the issues were rarely discussed publicly – most VLDBS were (and are) run by financial organisation or “defence” {why can’t they be honest and refer to themselves as “Killing & Spying”} and they don’t talk. O’Reilly was doing several small, A6 booklet-type-books at the time that it would have suited. I can’t do it now, I know nothing about Cloud and some of the 12C features that would help with VLDBS, so I missed the boat. I regret not giving it a go. However, there is a possibility I might be involved in a book sometime in the future.

I have to thank Brendan Tierney for hassling me into doing this series of articles. I’m not being derogatory when I say he hassled me, he did, but Brendan did so in a very nice way and also gave me the odd toe in the backside when I needed it.

I also have to thank Jonathan Lewis. If this article had been a book he would have got a huge mention for being my technical reviewer. He was good enough to look over the article and let me know a couple of things he felt I had over simplified, some things with the flow and also something I had simply got wrong. You know that bit in books about “thanks to Dave for assisting but all mistakes are mine”. Well, I always thought it was a bit overly… defensive? Well now I don’t.

All mistakes are mine. I want no blame falling on the people who helped me!

I still can't take my Bio too seriously

I still can’t take my Bio too seriously

Jonathan Lewis's picture

Manuals

From time to time I read a question (or, worse, an answer) on OTN and wonder how someone could have managed to misunderstand some fundamental feature of Oracle – and then, as I keep telling people everyone should do – I re-read the manuals and realise that that sometimes the manuals make it really easy to come to the wrong conclusion.

Having nothing exciting to do on the plane to Bucharest today, I decided it was time to read the Concepts manual again – 12c version – to remind myself of how much I’ve forgotten. Since I was reading the mobi version on an iPad mini I can’t quote page numbers, but at “location 9913 of 16157″ I found the following text in a sidebar:

“LGWR can write redo log entries to disk before a transaction commits. The redo entries become permanent only if the transaction later commits.”

Now I know what that’s trying to say because I already know how Oracle works – but it explains the various questions that I’ve seen on OTN (and elsewhere) struggling with the idea of how Oracle manages to “not have” redo for transactions that didn’t commit.

The redo entries become permanent the moment they are written to disc – nothing makes any of the content of the redo log files disappear 1, nothing goes back and flags some bits of the redo log as “not really there”. It’s the changes to the data blocks that have been described by the redo that become permanent only if the transaction later commits. If the transaction rolls back2 the session doesn’t “seek and destroy” the previous redo, it generates MORE redo (based on the descriptions that it originally put into the undo segment) and applies the changes described by that redo to reverse out the effects of the previous changes.

So next time you see a really bizarre question about how Oracle works remember that it could have arisen from someone reading the manual carefully; because sometimes the manual writers know exactly what they mean to say but don’t actually say it clearly and unambiguously.

1 I am aware that strange and rare events such disc crashes could make all sorts of things disappear, but I think it’s reasonable to assume here that we’re talking about standard processing mechanisms.

2 I am also aware that there are variations dependent on events like sessions being killed, or instance failure that could need some further explanation, but there’s a time, place, and pace, for everything.