Index splits

Jonathan Lewis's picture

After writing this note I came to the conclusion that it will be of no practical benefit to anyone …  but I’m publishing it anyway because it’s just an interesting little observation about the thought processes of some Oracle designer/developer. (Or maybe it’s an indication of how it’s sensible to re-use existing code rather than coding for a particular boundary case, or maybe it’s an example of how to take advantage of “dead time” to add a little icing to the cake when the extra time required might not get noticed). Anyway, the topic came up in a recent thread on the OTN/ODC database forum and since the description given there wasn’t quite right I thought I’d write up a correction and a few extra notes.

When an index leaf block is full and a new row has to be inserted in the block Oracle will usually allocate a new leaf block, split the contents of the full block fairly evenly between two leaf blocks, then update various pointers to bring the index structure up to date. At various moments in this process the branch block above the leaf block and the leaf blocks either side of the splitting block have to be pinned. The number of times this happens is reported under the statistic “leaf node splits” but there is a special subset of leaf node splits that handles the case when the key in the new row is greater than the current high value in the block and the block is the “rightmost” (i.e. high values) block in the index. In this case Oracle adds a new leaf block to the end of the index and inserts the new value in the new block; it doesn’t share the data at all between the old and new leaf blocks. This special case is reported under the statistic “leaf node 90-10 splits”, even though “100-0” would be a more accurate description than “90-10”.

This note is a description of the work done by  a leaf node split and compares the work for a “50-50” split (as the general case is often called) and a 90-10 split. You might think that the latter would be less resource-intensive than the former but, in fact, that’s not the case. Here’s a little script to get things going – I’m using an 8KB block size and ASSM (automatic segment space management); if your default tablespace definition is different the number of rows you have to use will need to be changed.


rem
rem     Script:         index_splits3.sql
rem     Author:         Jonathan Lewis
rem     Dated:          September 2007
rem

start setenv
set timing off
set feedback off

define m_limit = 292

drop table t1 purge;

create table t1 (id number, idx_pad varchar2(40), padding varchar2(50));
alter table t1 add constraint t1_pk primary key(id, idx_pad);

column object_id new_value m_index_id

select
        object_id
from
        user_objects
where
        object_name = 'T1_PK'
and     object_type = 'INDEX'
;

begin
        for i in 1..&m_limit loop
                insert into t1 values(
                        i, rpad('x',40,'x'), rpad(' ',50)
                );
                commit;
        end loop;
end;
/


I’ve created a table with a two-column primary key and inserted “m_limit” rows into that table in an order that matches the primary key. The limit of 292 rows (which I’ve declared at the top of the program) means that the index entries for the data set will exactly fill two leaf blocks. I’ve captured the object_id of the index because I want to do a “treedump” of the index before and after inserting the next row.

I now need to run one of two tests inserting a single row. Either insert a row that is above the current highest value to force a 90-10 index leaf node split, or insert a row a little below the current high value to force a 50-50 index node split in the 2nd of the two index leaf blocks.


alter session set events 'immediate trace name treedump level &m_index_id';

alter system switch logfile;
execute snap_my_stats.start_snap;

begin
        for i in &m_limit + 1 .. &m_limit + 1  loop
                insert into t1 values(
                        i, rpad('x',40,'x'), rpad(' ',50)
--                      i - 2 , rpad('y',40,'y'), rpad(' ',50)
                );
                commit;
        end loop;
end;
/

execute snap_my_stats.end_snap

alter session set events 'immediate trace name treedump level &m_index_id';

execute dump_log

The calls to snap_my_stats are using a package I wrote a long time ago to report a delta in my session’s stats. The call to dump_log uses another little package to identify the current log file and issue an “alter system dump logfile …” command. Of the two possible sets of values for the row being inserted the first one will cause a 90-10 split the second (commented out) will cause a 50-50 split.

Here are the results from the calls to treedump – first the dump taken before the insert then the dump after a 90-10 split, finally the dump after re-running the test and forcing a 50-50 split. These results came from a database running 11.2.0.4, but the results are the same for 12.1.0.2 and 18.3.0.0:


----- begin tree dump
branch: 0x140008b 20971659 (0: nrow: 2, level: 1)
   leaf: 0x140008e 20971662 (-1: nrow: 147 rrow: 147)
   leaf: 0x140008f 20971663 (0: nrow: 145 rrow: 145)
----- end tree dump

----- begin tree dump
branch: 0x140008b 20971659 (0: nrow: 3, level: 1)
   leaf: 0x140008e 20971662 (-1: nrow: 147 rrow: 147)
   leaf: 0x140008f 20971663 (0: nrow: 145 rrow: 145)
   leaf: 0x140008c 20971660 (1: nrow: 1 rrow: 1)
----- end tree dump


----- begin tree dump
branch: 0x140008b 20971659 (0: nrow: 3, level: 1)
   leaf: 0x140008e 20971662 (-1: nrow: 147 rrow: 147)
   leaf: 0x140008f 20971663 (0: nrow: 78 rrow: 78)
   leaf: 0x140008c 20971660 (1: nrow: 68 rrow: 68)
----- end tree dump


As you can see the extra row in the first case has been inserted into a new leaf block leaving the 2nd leaf block (apparently) unchanged; in the second case the 145 initial rows plus the one extra row have been shared fairly evenly between two leaf block. I can’t explain the imbalance in this case, it doesn’t affect the length of the branch entry. If you’re wondering why the first leaf block held 147 entries while the original 2nd leaf block held 145 it’s because the first 100 entries in the first leaf block had a value for the id column that was 2 bytes long, after which the id needed 3 bytes storage for Oracle’s internal representation.)

Having examined the treedumps to see that the splits are 90-10 and 50-50 respectively we can now look at the undo and redo generated by the different cases. Here are the relevant values extracted from the snapshots of the session stats. Again the first set comes from the 90-10 split, the second from the 50-50 split.


Redo/Undo stats 90/10 split
--------------------------------------------
redo entries                               9
redo size                             18,500
undo change vector size                8,736

Redo/Undo stats 50/50 split
--------------------------------------------
redo entries                               9
redo size                             18,520
undo change vector size                8,736

In both cases the volume of undo and redo is the same (plus or minus a tiny bit – with tiny variations across versions). The undo is equivalent to roughly a whole block plus a few percent (and that will be copied into the redo, of course) and the “pure” redo is also equivalent to a whole block plus a few percent for a total of two data blocks worth plus a couple of thousand bytes. (The extra percentage is mostly about the old and new pointers as we break and make links in the leaf blocks and update and insert links from the branch block above.)

So why does a 90/10 split, which appears simply to add a leaf block and insert one row, do so much work? The answer lies (to start with) in the dump of the redo log file. The session statistics show 9 redo entries (redo change records) generated in both cases, so I’m going to start by picking out a summary of those records from the log file dumps using egrep to identify the lines showing the redo change record length (LEN:) and the redo change vector op codes (OP:). Here’s the output, with a little cosmetic modification, for the 90-10 split.


egrep -e "LEN:" -e"OP:" test_ora_20690.trc

REDO RECORD - Thread:1 RBA: 0x00314b.00000002.0010 LEN: 0x0074 VLD: 0x05
(LWN RBA: 0x00314b.00000002.0010 LEN: 0038 NST: 0001 SCN: 0x0b86.0cfca74e)
CHANGE #1 TYP:2 CLS:1 AFN:5 DBA:0x0140008f OBJ:349950 SCN:0x0b86.0cfca74a SEQ:2 OP:4.1 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314b.00000002.0084 LEN: 0x0144 VLD: 0x01
CHANGE #1 TYP:0 CLS:73 AFN:3 DBA:0x00c00160 OBJ:4294967295 SCN:0x0b86.0cfca746 SEQ:2 OP:5.2 ENC:0 RBL:0
CHANGE #2 TYP:1 CLS:74 AFN:3 DBA:0x00c0465f OBJ:4294967295 SCN:0x0b86.0cfca74e SEQ:1 OP:5.1 ENC:0 RBL:0
CHANGE #3 TYP:0 CLS:1 AFN:5 DBA:0x0140008f OBJ:349950 SCN:0x0b86.0cfca74e SEQ:1 OP:10.6 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314b.00000002.01c8 LEN: 0x20a4 VLD: 0x01                                        ***
CHANGE #1 TYP:0 CLS:73 AFN:3 DBA:0x00c00160 OBJ:4294967295 SCN:0x0b86.0cfca74e SEQ:1 OP:5.2 ENC:0 RBL:0
CHANGE #2 TYP:1 CLS:74 AFN:3 DBA:0x00c04660 OBJ:4294967295 SCN:0x0b86.0cfca74e SEQ:1 OP:5.1 ENC:0 RBL:0
CHANGE #3 TYP:0 CLS:1 AFN:5 DBA:0x0140008f OBJ:349950 SCN:0x0b86.0cfca74e SEQ:2 OP:10.9 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314b.00000013.017c LEN: 0x0044 VLD: 0x01
CHANGE #1 TYP:0 CLS:8 AFN:5 DBA:0x01400088 OBJ:349950 SCN:0x0b86.0cfca638 SEQ:3 OP:13.22 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314b.00000013.01c0 LEN: 0x01ac VLD: 0x01
CHANGE #1 TYP:0 CLS:73 AFN:3 DBA:0x00c00160 OBJ:4294967295 SCN:0x0b86.0cfca74e SEQ:2 OP:5.2 ENC:0 RBL:0
CHANGE #2 TYP:1 CLS:74 AFN:3 DBA:0x00c04661 OBJ:4294967295 SCN:0x0b86.0cfca74e SEQ:1 OP:5.1 ENC:0 RBL:0
CHANGE #3 TYP:0 CLS:1 AFN:5 DBA:0x0140008c OBJ:349950 SCN:0x0b86.0cfca638 SEQ:2 OP:10.8 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314b.00000014.017c LEN: 0x0048 VLD: 0x01
CHANGE #1 TYP:2 CLS:1 AFN:5 DBA:0x0140008b OBJ:349950 SCN:0x0b86.0cfca639 SEQ:2 OP:4.1 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314b.00000014.01c4 LEN: 0x00e0 VLD: 0x01
CHANGE #1 TYP:0 CLS:74 AFN:3 DBA:0x00c04661 OBJ:4294967295 SCN:0x0b86.0cfca74e SEQ:2 OP:5.1 ENC:0 RBL:0
CHANGE #2 TYP:0 CLS:1 AFN:5 DBA:0x0140008b OBJ:349950 SCN:0x0b86.0cfca74e SEQ:1 OP:10.15 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314b.00000015.00b4 LEN: 0x1fb0 VLD: 0x01                                        ***
CHANGE #1 TYP:0 CLS:73 AFN:3 DBA:0x00c00160 OBJ:4294967295 SCN:0x0b86.0cfca74e SEQ:3 OP:5.4 ENC:0 RBL:0
CHANGE #2 TYP:0 CLS:1 AFN:5 DBA:0x0140008f OBJ:349950 SCN:0x0b86.0cfca74e SEQ:3 OP:10.8 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314b.00000025.0164 LEN: 0x0320 VLD: 0x09
CHANGE #1 TYP:2 CLS:1 AFN:5 DBA:0x01400084 OBJ:349949 SCN:0x0b86.0cfca74a SEQ:2 OP:11.2 ENC:0 RBL:0
CHANGE #2 TYP:0 CLS:83 AFN:3 DBA:0x00c004a8 OBJ:4294967295 SCN:0x0b86.0cfca738 SEQ:2 OP:5.2 ENC:0 RBL:0
CHANGE #3 TYP:0 CLS:1 AFN:5 DBA:0x0140008c OBJ:349950 SCN:0x0b86.0cfca74e SEQ:1 OP:10.5 ENC:0 RBL:0
CHANGE #4 TYP:0 CLS:83 AFN:3 DBA:0x00c004a8 OBJ:4294967295 SCN:0x0b86.0cfca750 SEQ:1 OP:5.4 ENC:0 RBL:0
CHANGE #5 TYP:0 CLS:84 AFN:3 DBA:0x00c04b0f OBJ:4294967295 SCN:0x0b86.0cfca738 SEQ:2 OP:5.1 ENC:0 RBL:0
CHANGE #6 TYP:0 CLS:84 AFN:3 DBA:0x00c04b0f OBJ:4294967295 SCN:0x0b86.0cfca750 SEQ:1 OP:5.1 ENC:0 RBL:0

I’ve highlighted two redo records with ‘***’ at the end of line. One of these records has length 0x20a4, the other has length 0x1fb0 i.e. roughly a whole data block each. We’ll look at those in more detail in a moment. Here, for comparison, is the result from the 50-50 split – again with a few highlighted lines:

REDO RECORD - Thread:1 RBA: 0x00314f.00000002.0010 LEN: 0x0074 VLD: 0x05
(LWN RBA: 0x00314f.00000002.0010 LEN: 0038 NST: 0001 SCN: 0x0b86.0cfcbc25)
CHANGE #1 TYP:2 CLS:1 AFN:5 DBA:0x0140008f OBJ:349962 SCN:0x0b86.0cfcbc21 SEQ:2 OP:4.1 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314f.00000002.0084 LEN: 0x0144 VLD: 0x01
CHANGE #1 TYP:0 CLS:69 AFN:3 DBA:0x00c000e8 OBJ:4294967295 SCN:0x0b86.0cfcbc15 SEQ:2 OP:5.2 ENC:0 RBL:0
CHANGE #2 TYP:0 CLS:70 AFN:3 DBA:0x00c10c43 OBJ:4294967295 SCN:0x0b86.0cfcbc15 SEQ:2 OP:5.1 ENC:0 RBL:0
CHANGE #3 TYP:0 CLS:1 AFN:5 DBA:0x0140008f OBJ:349962 SCN:0x0b86.0cfcbc25 SEQ:1 OP:10.6 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314f.00000002.01c8 LEN: 0x20a4 VLD: 0x01                                        ***
CHANGE #1 TYP:0 CLS:69 AFN:3 DBA:0x00c000e8 OBJ:4294967295 SCN:0x0b86.0cfcbc25 SEQ:1 OP:5.2 ENC:0 RBL:0
CHANGE #2 TYP:1 CLS:70 AFN:3 DBA:0x00c10c44 OBJ:4294967295 SCN:0x0b86.0cfcbc25 SEQ:1 OP:5.1 ENC:0 RBL:0
CHANGE #3 TYP:0 CLS:1 AFN:5 DBA:0x0140008f OBJ:349962 SCN:0x0b86.0cfcbc25 SEQ:2 OP:10.9 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314f.00000013.017c LEN: 0x0044 VLD: 0x01
CHANGE #1 TYP:0 CLS:8 AFN:5 DBA:0x01400088 OBJ:349962 SCN:0x0b86.0cfcbb24 SEQ:3 OP:13.22 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314f.00000013.01c0 LEN: 0x1010 VLD: 0x01$                                       ***
CHANGE #1 TYP:0 CLS:69 AFN:3 DBA:0x00c000e8 OBJ:4294967295 SCN:0x0b86.0cfcbc25 SEQ:2 OP:5.2 ENC:0 RBL:0
CHANGE #2 TYP:1 CLS:70 AFN:3 DBA:0x00c10c45 OBJ:4294967295 SCN:0x0b86.0cfcbc25 SEQ:1 OP:5.1 ENC:0 RBL:0
CHANGE #3 TYP:0 CLS:1 AFN:5 DBA:0x0140008c OBJ:349962 SCN:0x0b86.0cfcbb24 SEQ:2 OP:10.8 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314f.0000001c.0060 LEN: 0x0048 VLD: 0x01
CHANGE #1 TYP:2 CLS:1 AFN:5 DBA:0x0140008b OBJ:349962 SCN:0x0b86.0cfcbb25 SEQ:2 OP:4.1 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314f.0000001c.00a8 LEN: 0x00e0 VLD: 0x01
CHANGE #1 TYP:0 CLS:70 AFN:3 DBA:0x00c10c45 OBJ:4294967295 SCN:0x0b86.0cfcbc25 SEQ:2 OP:5.1 ENC:0 RBL:0
CHANGE #2 TYP:0 CLS:1 AFN:5 DBA:0x0140008b OBJ:349962 SCN:0x0b86.0cfcbc25 SEQ:1 OP:10.15 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314f.0000001c.0188 LEN: 0x1150 VLD: 0x01                                        ***
CHANGE #1 TYP:0 CLS:69 AFN:3 DBA:0x00c000e8 OBJ:4294967295 SCN:0x0b86.0cfcbc25 SEQ:3 OP:5.4 ENC:0 RBL:0
CHANGE #2 TYP:0 CLS:1 AFN:5 DBA:0x0140008f OBJ:349962 SCN:0x0b86.0cfcbc25 SEQ:3 OP:10.8 ENC:0 RBL:0

REDO RECORD - Thread:1 RBA: 0x00314f.00000025.0168 LEN: 0x0330 VLD: 0x09
CHANGE #1 TYP:2 CLS:1 AFN:5 DBA:0x01400084 OBJ:349961 SCN:0x0b86.0cfcbc21 SEQ:2 OP:11.2 ENC:0 RBL:0
CHANGE #2 TYP:0 CLS:73 AFN:3 DBA:0x00c00160 OBJ:4294967295 SCN:0x0b86.0cfcbc1a SEQ:2 OP:5.2 ENC:0 RBL:0
CHANGE #3 TYP:0 CLS:1 AFN:5 DBA:0x0140008c OBJ:349962 SCN:0x0b86.0cfcbc25 SEQ:1 OP:10.5 ENC:0 RBL:0
CHANGE #4 TYP:0 CLS:73 AFN:3 DBA:0x00c00160 OBJ:4294967295 SCN:0x0b86.0cfcbc27 SEQ:1 OP:5.4 ENC:0 RBL:0
CHANGE #5 TYP:0 CLS:74 AFN:3 DBA:0x00c04c64 OBJ:4294967295 SCN:0x0b86.0cfcbc1a SEQ:2 OP:5.1 ENC:0 RBL:0
CHANGE #6 TYP:0 CLS:74 AFN:3 DBA:0x00c04c64 OBJ:4294967295 SCN:0x0b86.0cfcbc27 SEQ:1 OP:5.1 ENC:0 RBL:0

There are three interesting records in the 50-50 split with lengths 0x20a4 (the same as the 90-10 split), 0x1010, 0x1150. So we seem to start the same way with a “full block” record, and follow up with two “half block” records. The numbers allow you to make a reasonable guess – Oracle copies the original leaf block into the undo, then writes the two new leaf blocks as “pure” redo; in one case the two new leaf block redo records constitute a whole block and a tiny fraction of a block; in the other case the two new leaf block redo records constitute two half blocks.

I won’t show you the full detail that I checked in the log file dump, but the big 0x20a4 record in the 90-10 split is mostly made up of an “OP:5.1” change vector labelled “restore block before image (8032)”, while the 5th and 8th records in both dumps hold “OP:10.8” change vectors labelled “(kdxlne) … new block has NNN rows”. In the case of the 90-10 split the values for NNN are 1 and 145, in the case of the 50-50 split the values for NNN are 68 and 78 – in that (higher values leaf block first) order.

The 90-10 split and the 50-50 split act in exactly the same way – save the old block, allocate a new block, populate two blocks. It really looks as if code re-use has missed an easy opportunity for some optimisation – why save and rewrite a block when the data content is not going to change ?

Before assuming there’s a bug (or defect, or sub-optimal implementation) though it’s wise to consider whether there might be something else going on – Oracle developers (especially at the deeper levels) tend to have good reasons for what they do so maybe the rewrite is deliberate and serves a useful purpose.

If you do anything with my current test you won’t spot the extra little feature because my test is running a very special edge case – but I had a thought that would justify the cost (and timing) of the rewrite, and I’ll be publishing the idea, the test, and the results tomorrow.

Footnote

It is possible that a leaf node split means Oracle has to insert a pointer into a level 1 branch node that is already full – in which case Oracle will have to allocate a new branch node, share the branch data (including the new leaf pointer) between the two nodes, and insert a new branch pointer into the relevant level 2 branch block … and that may split etc. all the way up to the root. When the root node splits Oracle allocates two new blocks, increasing the branch level by one and keeping the original root block in place (immediately after all the space management blocks) but now pointing to just 2 “branch N-1” level blocks. Oracle will update the statistics “branch node splits” and “root node splits”.

In certain situations (typically relating to very large deletes followed by highly concurrent small inserts) Oracle may run into problems identifying a suitable “free” block while trying to do a split, and this can result in very slow splits that generate a lot of undo and redo which pinning index leaf blocks exclusively (leading to a couple of the more rare “enq – TX:” waits. In this case you may see statistics “failed probes on index block reclamation” and “recursive aborts on index block reclamation” starting to climb.  In theory I think you shouldn’t see more than a handful of “failed probes” per “recursive abort” – but I’ve never been able to model the problem to check that.

 

 

To prevent automated spam submissions leave this field empty.