Index Splits – 2

Jonathan Lewis's picture

In yesterday’s article I described the mechanism that Oracle for an index leaf block split when you try to insert a new entry into a leaf block that is already full, and I demonstrated that the “50-50” split and the “90-10” split work in the same way, namely:

  • save the old block into the undo
  • prepare a new leaf block
  • “share” the data between the old and new leaf blocks
  • sort out pointers

The obvious question to ask about this process is: “Why does Oracle save and rewrite the whole of the old leaf block during a 90-10 split when the data in the block doesn’t appear to change ?” The “sharing” in the 90-10 split is most uneven, and it appears that Oracle simply attaches a new leaf block to the index structure and writes the new index entry into it, leaving the existing index entries unchanged in the current leaf block.

The answer to that question can be found by doing block dumps – except you won’t see the answer if you use my original test data. So here’s a follow-on script to the previous test (written 11 years after the previous script):


rem
rem     Script:         index_splits3a.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2018
rem     Purpose:
rem

drop table t2 purge;

create table t2 as select * from t1 where id <= 148;
alter table t2 add constraint t2_pk primary key(id, idx_pad) using index pctfree 0;

column object_id new_value m_index_id
select object_id from user_objects where object_name = 'T2_PK' and object_type = 'INDEX';

begin
        for r in (select * from t1 where id between 149 and 292 order by dbms_random.value) loop
                insert into t2 values(r.id, r.idx_pad, r.padding);
        end loop;
        commit;
end;
/

alter session set events 'immediate trace name treedump level &m_index_id';
alter system flush buffer_cache;

prompt  check the trace file for the last block of the index
prompt  then do a block dump for it.
prompt  then press return

pause

insert into t2 values(293, rpad('x',40,'x'), rpad('y',50,'y'));
commit;

alter session set events 'immediate trace name treedump level &m_index_id';
alter system flush buffer_cache;


This test depends on the number of rows used for the previous test – and I have four hard-coded values (148, 149, 292, 293) in it that matter. If you’ve had to use a different number of rows in your version of the first test you will need to adjust these values to match.

I’ve created a clone of the t1 table copying only the first 148 rows – this is just enough rows that when I create a unique (PK) index on the table the index will have two leaf blocks, the first holding 147 entries and the second holding one entry. I’ve then inserted the next 144 rows from t1 into t2 in random order, so that I end up with two full leaf blocks.

Once the data is ready the code issues a treedump command (so that we can check the index is as I’ve described it) and flushes the buffer_cache, then prompts you with some instructions and waits for you to press return. At this point you need some manual intervention from another session – you can examine the treedump to work out the file and block addresses of the two leaf blocks and dump the second leaf block (‘alter database dump datafile N block M;’).

After you’ve done the block dump press return and my code resumes and inserts a new row that will cause a 90-10 split to happen, then it does another treedump (to let you check the block addresses and see that the split was 90-10), and flushes the buffer cache again. This is where you can check the block address of the second leaf block (in case it has managed to change – which it shouldn’t) and dump the block again.

Here, with a huge chunk removed from the middle, are the results I got from searching for the expression “row#” in the two block dumps that I generated in my test.


Before 90/10 block split:
-------------------------
row#0[7979] flag: -------, lock: 0, len=53, data:(6):  01 40 01 84 00 03
row#1[1885] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 2b
row#2[1938] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 2a
row#3[5595] flag: -------, lock: 2, len=53, data:(6):  01 40 01 f9 00 2c
row#4[3581] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 0b
...
row#142[1408] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 34
row#143[2150] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 26
row#144[878] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 3e


After 90/10 block split
-----------------------
row#0[348] flag: -------, lock: 0, len=53, data:(6):  01 40 01 84 00 03
row#1[401] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 2b
row#2[454] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 2a
row#3[507] flag: -------, lock: 0, len=53, data:(6):  01 40 01 f9 00 2c
row#4[560] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 0b
...
row#142[7873] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 34
row#143[7926] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 26
row#144[7979] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 3e

The “row#” is in ascending order – these lines in an index leaf block dump show Oracle walking through the block’s “row directory”; the number in square brackets following the row number is the offset into the block where the corresponding index entry will be found. When Oracle inserts an index entry into a leaf block it adjusts the row directory to make a gap in the right place so that walking the directory in row# order allows Oracle to jump around the block and find the index entries in key order.

When Oracle rewrites the block it first sorts the index entries into key order so that the actual index entries are written into the block in key order and a range scan that moves a pointer smoothly through the row directory will be moving another pointer smoothly down the block rather than making the pointer jump all over the place. Presumably this has (or maybe had) a benefit as far as the CPU cache and cache lines are concerned.

So there is a method in the madness of “copy the whole block even when the content doesn’t change”. The content doesn’t change but the order does, and paying the cost of sorting once may return a benefit in efficiency many times in the future.

 

To prevent automated spam submissions leave this field empty.