partitioning

Jonathan Lewis's picture

Indexing

A question about partitioning came up on OTN a few days ago – but the really interesting part of the thread was about the choice of indexes, and how the choice of partitioning, combined with the data patterns, could make a big difference to the choice of indexes. I think the comments I made about indexes are worth seeing, so I’ve linked to the thread.


Jonathan Lewis's picture

Partitioned Bitmaps

The following question appeared in a comment to an earlier posting on multi-column bitmap indexes and the inability of Oracle to create a bitmap index join when (to the human eye) the strategy was an obvious choice.

    I have a query which is using 2 indexes both are bitmap indexes (sizes are 37 and 24 Mbs) and table size is 17gb. While i ran the following query which can very well get the index itself, it takes around 6-8 minutes and using pga around 3 gb.

could you please explain me why ?

SQL_ID  5z0a50a2yzdky, child number 0
-------------------------------------
select count(1) from (select distinct SRC_SYS,PROD_KEY from dds.REV_F)

Plan hash value: 867747470

--------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name                 | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
--------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |                      |       |       |   221K(100)|          |       |       |
|   1 |  SORT AGGREGATE                   |                      |     1 |       |            |          |       |       |
|   2 |   VIEW                            |                      | 24533 |       |   221K  (6)| 00:44:22 |       |       |
|   3 |    HASH UNIQUE                    |                      | 24533 |   479K|   221K  (6)| 00:44:22 |       |       |
|   4 |     VIEW                          | index$_join$_002     |    63M|  1209M|   212K  (2)| 00:42:28 |       |       |
|*  5 |      HASH JOIN                    |                      |       |       |            |          |       |       |
|   6 |       PARTITION LIST ALL          |                      |    63M|  1209M|  3591   (1)| 00:00:44 |     1 |   145 |
|   7 |        BITMAP CONVERSION TO ROWIDS|                      |    63M|  1209M|  3591   (1)| 00:00:44 |       |       |
|   8 |         BITMAP INDEX FULL SCAN    | REV_F_IDX1           |       |       |            |          |     1 |   145 |
|   9 |       PARTITION LIST ALL          |                      |    63M|  1209M| 13724   (1)| 00:02:45 |     1 |   145 |
|  10 |        BITMAP CONVERSION TO ROWIDS|                      |    63M|  1209M| 13724   (1)| 00:02:45 |       |       |
|  11 |         BITMAP INDEX FULL SCAN    | REV_F_IDX5           |       |       |            |          |     1 |   145 |
--------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access(ROWID=ROWID)

28 rows selected.

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.01       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2    610.89    1464.86     707459      17090          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4    610.90    1464.87     707459      17090          0           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: SYS

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  SORT AGGREGATE (cr=17090 pr=707459 pw=446115 time=1464867976 us)
  26066   VIEW  (cr=17090 pr=707459 pw=446115 time=1464795748 us)
  26066    HASH UNIQUE (cr=17090 pr=707459 pw=446115 time=1464769678 us)
63422824     VIEW  index$_join$_002 (cr=17090 pr=707459 pw=446115 time=1084846545 us)
63422824      HASH JOIN  (cr=17090 pr=707459 pw=446115 time=958000889 us)
63422824       PARTITION LIST ALL PARTITION: 1 145 (cr=3561 pr=0 pw=0 time=63423134 us)
63422824        BITMAP CONVERSION TO ROWIDS (cr=3561 pr=0 pw=0 time=9554 us)
   7112         BITMAP INDEX FULL SCAN REV_F_IDX1 PARTITION: 1 145 (cr=3561 pr=0 pw=0 time=155525 us)(object id 366074)
63422824       PARTITION LIST ALL PARTITION: 1 145 (cr=13529 pr=8864 pw=0 time=190268771 us)
63422824        BITMAP CONVERSION TO ROWIDS (cr=13529 pr=8864 pw=0 time=63553723 us)
 432700         BITMAP INDEX FULL SCAN REV_F_IDX5 PARTITION: 1 145 (cr=13529 pr=8864 pw=0 time=3157351 us)(object id 366658)

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                       2        0.00          0.00
  direct path write temp                      29741        1.62        107.05
  db file sequential read                      8864        0.20          2.35
  direct path read temp                       46573        0.79        211.02
  SQL*Net message from client                     2       29.22         29.22

In this case Oracle is clearly doing the bitmap join, but it’s managing to do something very inefficient. The problem lies in the partitioning or, to be more precise, Oracle’s failure to take advantage of partitioning. The OP complains of using 3GB of memory and several minutes of elapsed time. The plan shows us that the we have 145 partitions (PARTITION LIST ALL PARTITION: 1 145), and we have been told that the table is about 17GB is size, so the “average” partition is about 120MB – so why isn’t Oracle using a partition-wise approach and processing the data one partition at a time ? The answer is simple – it can’t be done (at present).

An index join works by doing a hash join between rowids – and since we are using bitmap indexes we have to convert bitmaps to rowids as part of the plan. In this query we then want to count the number of distinct combintations of (SRC_SYS,PROD_KEY) – and the same combination may appear in different partitions, so Oracle has had to generate a plan that handles the entire data set in a single join rather than trying to handle each partition separately (notice how the “partition list all” operator appears twice, once for each index).

The “Row Source Operation” tells us we only had to scan a few thousand block, but we have to build a hash table of 64 million entries:

63422824        BITMAP CONVERSION TO ROWIDS (cr=3561 pr=0 pw=0 time=9554 us)

At 10 bytes per rowid (for a partitioned table), plus the length of the input column, plus linked list overheads (8 bytes per pointer) you can see the 3 GB beginning to appear out of the volume of work being done. And then the Oracle dumped the whole thing to disc (perhaps in anticipation of the impending “hash unique”, perhaps because it still needed to do a one-pass operation – it would have been interesting to see the full row source execution statistics from a call to dbms_xplan.display_cursor())

What we need to see is a plan more like the following:

---------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name                 | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |                      |       |       |   221K(100)|          |       |       |
|   1 |  SORT AGGREGATE                    |                      |     1 |       |            |          |       |       |
|   2 |   VIEW                             |                      | 24533 |       |   221K  (6)| 00:44:22 |       |       |
|   3 |    HASH UNIQUE                     |                      | 24533 |   479K|   221K  (6)| 00:44:22 |       |       |
|   4 |     PARTITION LIST ALL             |                      |    63M|  1209M|  3591   (1)| 00:00:44 |     1 |   145 |
|*  5 |      HASH UNIQUE                   |                      |       |       |            |          |       |       |
|   6 |       VIEW                         | index$_join$_002     |    63M|  1209M|   212K  (2)| 00:42:28 |       |       |
|*  7 |        HASH JOIN                   |                      |       |       |            |          |       |       |
|   8 |         BITMAP CONVERSION TO ROWIDS|                      |    63M|  1209M|  3591   (1)| 00:00:44 |       |       |
|   9 |          BITMAP INDEX FULL SCAN    | REV_F_IDX1           |       |       |            |          |     1 |   145 |
|  10 |         BITMAP CONVERSION TO ROWIDS|                      |    63M|  1209M| 13724   (1)| 00:02:45 |       |       |
|  11 |          BITMAP INDEX FULL SCAN    | REV_F_IDX5           |       |       |            |          |     1 |   145 |
-------------------------------------------- ------------------------------------------------------------------------------

Line 4 shows us that we do something for each partition in turn. The sub-plan to line 4 tells us that we are collecting the unique combinations of (SRC_SYS,PROD_KEY) for a given partition. Line 3 tells us that we are collating the results from the different partitions and generating the set of values that is unique across all partitions.

The problem is this: can we engineer a strange piece of SQL that makes plan appear – because the optimizer isn’t going to do it automatically (yet).

Obviously it would be pretty easy to write some sort of solution using pl/sql and pipelined functions - perhaps a function that takes a table_name loops through each partition of the table in turn returning the distinct combinations for that partition, as this would allow you to “select count(distinct(…)) from table_function(…);” to get your result.

You might be able to avoid pl/sql by creating a piece of SQL joining the table’s metadata to the table by partition identifier – except you would probably need to use a lateral view, which Oracle doesn’t support, and make the partition extended syntax part of the lateral dependency .. which Oracle definitely doesn’t support.

So is there an alternative, purely SQL, strategy ?

I’m thinking about it – I have a cunning plan, but I haven’t had time to test it yet.

to be continued …

Update (7th July 2011):

I’ve finally managed to make a bit of time to write up my notes about this – and in the interim Randolf Geist has said everything that needs to be said, and the OP has pointed out that a full tablescan works faster anyway. However, here goes …

My cunning plans involved finding ways of forcing Oracle into doing a single partition hash join for each partition in turn by playing clever  tricks with dbms_mview.pmarker or the dba_tab_partitions view, or even the tbl$or$idx$part$num() function. But I realised I was in trouble as soon as I wrote the first  little test that was supposed to do an index hash join on a single explicitly referenced partition. Here’s my table description, index definitions, query and execution plan:

SQL> desc t1
 Name                    Null?    Type
 ----------------------- -------- ----------------
 NVCUSTATUS                       VARCHAR2(20)        -- list partitined on this column, multiple values in some partitions
 FREQ_COLUMN                      NUMBER(4)
 HBAL_COLUMN                      NUMBER(8)
 RAND_COLUMN                      NUMBER(8)

create bitmap index t1_freq on t1(freq_column) local;
create bitmap index t1_hbal on t1(hbal_column) local;

select
	count(1)
from	(
	select
		/*+index_join(t1) */
		distinct freq_column, hbal_column
	from t1 partition (p6)
	)
;

-------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name             | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                  |   894K|  6112K|  2966   (2)| 00:00:36 |       |       |
|   1 |  VIEW                          | index$_join$_001 |   894K|  6112K|  2966   (2)| 00:00:36 |       |       |
|*  2 |   HASH JOIN                    |                  |       |       |            |          |       |       |
|   3 |    PARTITION LIST SINGLE       |                  |   894K|  6112K|   208   (0)| 00:00:03 |   KEY |   KEY |
|   4 |     BITMAP CONVERSION TO ROWIDS|                  |   894K|  6112K|   208   (0)| 00:00:03 |       |       |
|   5 |      BITMAP INDEX FULL SCAN    | T1_FREQ          |       |       |            |          |     6 |     6 |
|   6 |    PARTITION LIST SINGLE       |                  |   894K|  6112K|   299   (0)| 00:00:04 |   KEY |   KEY |
|   7 |     BITMAP CONVERSION TO ROWIDS|                  |   894K|  6112K|   299   (0)| 00:00:04 |       |       |
|   8 |      BITMAP INDEX FULL SCAN    | T1_HBAL          |       |       |            |          |     6 |     6 |
-------------------------------------------------------------------------------------------------------------------

Even though we address a single partition explicitly, and even though the optimizer recognises it as partition 6 in lines 5 and 8, the plan isn’t doing a “partition-wise” join between the two indexes – the hash join calls two independent partition operations. In effect, Oracle is joining two different tables, and there isn’t a join condition on the partitioning column.

If we change the index definitions, though, to append the partitioning column (and given the small number of distinct values in each partition this didn’t affect the index size by much), and then rewrite the query to include the join, we get a better result – but only if we do a manual rewrite of the query (and if the partitioning column is declared not null).


alter table t1 modify nvcustatus not null;

drop index t1_freq;
drop index t1_hbal;

create bitmap index t1_freq on t1(freq_column, nvcustatus) local;
create bitmap index t1_hbal on t1(hbal_column, nvcustatus) local;

select
	count(*)
from	(
	select
		/*+
			qb_name(main)
			no_eliminate_join(@SEL$96BB32CC t1@hbal)
			no_eliminate_join(@SEL$96BB32CC t1@freq)
		*/
		distinct ftb.freq_column, htb.hbal_column
	from
		(
		select
			/*+ qb_name(freq) index(t1 t1_freq) */
			nvcustatus, freq_column, rowid frid
		from
			t1
		)	ftb,
		(
		select
			/*+ qb_name(hbal) index(t1 t1_hbal) */
			nvcustatus, hbal_column, rowid hrid
		from
			t1
		)	htb
	where
		ftb.frid = htb.hrid
	and	ftb.nvcustatus= htb.nvcustatus
	)
;

-------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
-------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |         |      1 |        |      1 |00:00:05.44 |     719 |   4380 |   3660 |       |       |          |         |
|   1 |  SORT AGGREGATE                  |         |      1 |      1 |      1 |00:00:05.44 |     719 |   4380 |   3660 |       |       |          |         |
|   2 |   VIEW                           |         |      1 |  35118 |  23223 |00:00:05.50 |     719 |   4380 |   3660 |       |       |          |         |
|   3 |    HASH UNIQUE                   |         |      1 |  35118 |  23223 |00:00:05.46 |     719 |   4380 |   3660 |  1268K|  1268K| 2647K (0)|         |
|   4 |     PARTITION LIST ALL           |         |      1 |    330K|   1000K|00:00:18.44 |     719 |   4380 |   3660 |       |       |          |         |
|*  5 |      HASH JOIN                   |         |      8 |    330K|   1000K|00:00:15.34 |     719 |   4380 |   3660 |   283M|    16M|   27M (1)|   31744 |
|   6 |       BITMAP CONVERSION TO ROWIDS|         |      8 |   1000K|   1000K|00:00:00.99 |     296 |    289 |      0 |       |       |          |         |
|   7 |        BITMAP INDEX FULL SCAN    | T1_FREQ |      8 |        |   1831 |00:00:00.30 |     296 |    289 |      0 |       |       |          |         |
|   8 |       BITMAP CONVERSION TO ROWIDS|         |      7 |   1000K|   1000K|00:00:06.48 |     423 |    416 |      0 |       |       |          |         |
|   9 |        BITMAP INDEX FULL SCAN    | T1_HBAL |      7 |        |   7353 |00:00:00.41 |     423 |    416 |      0 |       |       |          |         |
-------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access("NVCUSTATUS"="NVCUSTATUS" AND ROWID=ROWID)

As you can see from the plan, I ran one of my tests with rowsource execution statistics enabled, so that I could count starts.

My query basically breaks the implicit join of two indexes into an explcit join so that I can include the join on the partitioning column. You’ll notice that I now iterate through each partition – line 4 – and do a hash join for each pair of partitions. This is the target I was hoping for. The theory was that by doing (up to) eight small hash joins, each join will operate in memory, rather than flooding one large build table to disc. Of course, the thing I’m building the hash table with is now three values (rowid, freq_value, nvcustatus) rather than the two values from the implicit join (rowid, freq_value), so if I’m unlucky this version of the query could take even longer than the original because it may have to do more I/O.

If you’re wondering why I only started lines 8 and 9 once, this was because one of may partitions was empty, so Oracle had to scan it once as the build table to find that it was empty, then didn’t need to scan it as the probe table.

Footnote: Although I got the plan I wanted – it really didn’t make much difference, and even with the million rows I was using it was fast to solve the problem with a tablescan. The relative benefit of the different plans would be dicatated by the length of rows, the lengths of the relevant columns, the number of partitions, and the available memory for hashing.

Jonathan Lewis's picture

Prefixed

From time to time the question about whether local indexes on partitioned tables should be prefixed or non-prefixed appears on the Oracle forums and mailing lists.

It’s possible that I’m wrong – although no-one has come up with a counter-example to the statement I keep repeating – but the whole prefixed/non-prefixed thing for local indexes dates back to a limitation in the optimizer somewhere in the 8.0 time line where Oracle couldn’t do partition elimination on indexes properly but the overhead of the error it made could be dramatically reduced (in most cases) by sticking the partition key at the start of the index.

The guideline for local indexes are the same as they would be for a non-partitioned index on a non-partitioned heap table – the partitioning column(s) are just columns in the table that might be worth including in the index, and the order of the index columns is dictated by the important queries that have to access the table.

For further comments, there’s a note I wrote (which I’ve just been reminded of) on the OTN database forum that adds a little detail to this argument.

Jonathan Lewis's picture

Upgrade issues

Here’s an example of how a bug-fix can create problems. It’s a code change in 11.2.x.x and (I believe) 10.2.0.5 relating to the costing of queries involving (but perhaps not restricted to) composite partitioned tables. I first saw this change in an email from Doug Burns, who sent me the 10053 traces from a very simple query that had started using the wrong index after an upgrade from 10.2.0.4 to 11.2.0.2.

As part of his testing he had set the optimizer_features_enable parameter back to 10.2.0.4 and found that not only did the choice of index change back to the expected index, but the costs of the two indexes changed dramatically. (The cost of using the “right” index changed from 15 to something in excess of 9,000 on the upgrade!)

The example I’m going to show you demonstrates the principles of the cost change – but with the data set and statistics I’ve generated you won’t see a change of execution plan. This is typical of some of the models I create – it’s enough to establish a principle, after which it’s simple enough to recognise the problems that the principle can cause.

So here’s a data set. I created it in a default 11.2.0.2 install on Windows 32-bit, so it’s running with 8KB blocks, autoallocate extents, and ASSM – but I’ve disabled CPU costing:

create table t1(
	list_col	number,
	range_col	number,
	v100		varchar2(10),
	v10		varchar2(10),
	vid		varchar2(10),
	padding		varchar2(100)
)
partition by range(range_col)
subpartition by list (list_col)
subpartition template (
	subpartition s0		values(0),
	subpartition s1		values(1),
	subpartition s2		values(2),
	subpartition s3		values(3),
	subpartition s4		values(4),
	subpartition s5		values(5),
	subpartition s6		values(6),
	subpartition s7		values(7),
	subpartition s8		values(8),
	subpartition s9		values(9)
)
(
	partition p_10 values less than (10),
	partition p_20 values less than (20),
	partition p_30 values less than (30),
	partition p_40 values less than (40),
	partition p_50 values less than (50),
	partition p_60 values less than (60),
	partition p_70 values less than (70),
	partition p_80 values less than (80),
	partition p_90 values less than (90),
	partition p_100 values less than (100)
)
;

insert into t1
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		rownum <= 10000
)
select
	mod(rownum,10),
	mod(rownum,100),
	lpad(mod(rownum,10),10),
	lpad(mod(rownum,100),10),
	lpad(rownum,10),
	rpad('x',100)
from
	generator	v1,
	generator	v2
where
	rownum <= 1000000
;

alter table t1 add constraint t1_pk primary key(v10, vid, v100, range_col, list_col) using index local;

create index t1_one_col on t1(v100) local;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		method_opt 	 => 'for all columns size 1'

	);
end;
/

You’ll notice that one of the indexes I’ve created has only one column, while the other has five columns (including the single column of the smaller index). Since I’ve created only 1,000,000 rows spread across 100 partitions every partition is small and the corresponding index partitions are also small, so I’m going to create use dbms_stats.get_index_stats() and dbms_stats.set_index_stats() to make the indexes appear much larger. Specifically I will set the blevel on the single column index to 2, and the blevel on the multi-column index to 3. (This difference in settings isn’t just whimsy, it helps to emulate Doug’s problem.) I’ve previously published the type of code to make this possible; in this case I only set the table-level stats because the queries I’m going to use will trigger Oracle to use just the table-level stats.

select
	/*+ index(t1(v100)) */
	count(*)
from
	t1
where
	v10 = '0000000005'
and	v100 = '0000000005'
;

select
	/*+ index(t1(v10, vid, v100)) */
	count(*)
from
	t1
where
	v10 = '0000000005'
and	v100 = '0000000005'
;

I’m going to show you two sets of execution plans. The first one is where I’ve set optimizer_features_enable to ’10.2.0.4′, the second it where I’ve left it to default.

10.2.0.4 execution plans:
===================
---------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name       | Rows  | Bytes | Cost  | Pstart| Pstop |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |            |     1 |    22 |  2350 |       |       |
|   1 |  SORT AGGREGATE                      |            |     1 |    22 |       |       |       |
|   2 |   PARTITION RANGE ALL                |            |  1000 | 22000 |  2350 |     1 |    10 |
|   3 |    PARTITION LIST ALL                |            |  1000 | 22000 |  2350 |     1 |    10 |
|*  4 |     TABLE ACCESS BY LOCAL INDEX ROWID| T1         |  1000 | 22000 |  2350 |     1 |   100 |
|*  5 |      INDEX RANGE SCAN                | T1_ONE_COL |   100K|       |   330 |     1 |   100 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("V10"='0000000005')
   5 - access("V100"='0000000005')

------------------------------------------------------------------------------
| Id  | Operation            | Name  | Rows  | Bytes | Cost  | Pstart| Pstop |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       |     1 |    22 |   100 |       |       |
|   1 |  SORT AGGREGATE      |       |     1 |    22 |       |       |       |
|   2 |   PARTITION RANGE ALL|       |  1000 | 22000 |   100 |     1 |    10 |
|   3 |    PARTITION LIST ALL|       |  1000 | 22000 |   100 |     1 |    10 |
|*  4 |     INDEX RANGE SCAN | T1_PK |  1000 | 22000 |   100 |     1 |   100 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("V10"='0000000005' AND "V100"='0000000005')
       filter("V100"='0000000005')

11.2.0.2 execution plans:
===================
Execution Plan
----------------------------------------------------------
Plan hash value: 3019183742

---------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name       | Rows  | Bytes | Cost  | Pstart| Pstop |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |            |     1 |    22 |  2530 |       |       |
|   1 |  SORT AGGREGATE                      |            |     1 |    22 |       |       |       |
|   2 |   PARTITION RANGE ALL                |            |  1000 | 22000 |  2530 |     1 |    10 |
|   3 |    PARTITION LIST ALL                |            |  1000 | 22000 |  2530 |     1 |    10 |
|*  4 |     TABLE ACCESS BY LOCAL INDEX ROWID| T1         |  1000 | 22000 |  2530 |     1 |   100 |
|*  5 |      INDEX RANGE SCAN                | T1_ONE_COL |   100K|       |   510 |     1 |   100 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("V10"='0000000005')
   5 - access("V100"='0000000005')

------------------------------------------------------------------------------
| Id  | Operation            | Name  | Rows  | Bytes | Cost  | Pstart| Pstop |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       |     1 |    22 |   370 |       |       |
|   1 |  SORT AGGREGATE      |       |     1 |    22 |       |       |       |
|   2 |   PARTITION RANGE ALL|       |  1000 | 22000 |   370 |     1 |    10 |
|   3 |    PARTITION LIST ALL|       |  1000 | 22000 |   370 |     1 |    10 |
|*  4 |     INDEX RANGE SCAN | T1_PK |  1000 | 22000 |   370 |     1 |   100 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("V10"='0000000005' AND "V100"='0000000005')
       filter("V100"='0000000005')

Regardless of the setting of optimizer_features_enable, if I hint the single-column index I have a plan that visits each partition and subpartition in turn for the index then table; and if I hint the primary key I do that same walk through each partition and subpartition and visit just the index.

The big difference is in the cost. By default, both execution plans cost more in 11.2.0.2 – but the odd thing is that the change in costs is different. The cost of using the single-column index has gone up by 180, the cost of using the primary key index has gone up by 270; and the change in costs can be attributed completely to the indexes.

Is there a pattern in the change ? Yes, it’s not immediately obvious, of course, and needs a few extra tests to confirm it, but the change in cost can be calculated as: (number of range partitions * (number of list partitions – 1) * blevel of index).

    For the single column index this is: 10 * 9 * 2 = 180
    For the primary key index this is: 10 * 9 * 3 = 270

You don’t have to see an example of a change in plan actually happening once you’ve seen this change in arithmetic. In certain circumstances the cost of using composite partitions goes up when you upgrade to 11.2 – and the change in cost is proportional to the number of data segments (subpartitions) and the blevel of the indexes.

In Doug’s case he had a table with roughly 4,000 subpartitions in total – and he had a single column index with a blevel of 2, and a multi-column index with a blevel of 3: so the cost of using the (smaller, but less efficient) index went up by about 8,000 and the cost of using the (larger but efficient) index went up by about 12,000. The difference of 4,000 between these two increases was far higher than the original cost of using the bad index – so Oracle switched from the good index to the bad index.

The nasty thing about this problem is that it’s correcting an error – 11.2 is allowing for the cost of probing every single index subpartition, that’s what the extra multiples of the blevel represent; unfortunately the optimizer’s underlying paradigm of “every block visit is a disk I/O” makes this correction a threat.

Solutions and Workarounds

There is an /*+ optimizer_features_enable(‘x.x.x.x’) */ hint that could be used if the problem applies to just a couple of queries.

If the problem applies to classes of queries involving a couple of big tables and indexes you could use dbms_stats.set_index_stats() to adjust the blevel of the critical indexes.

If the problem appears all over the place you could set the optimizer_features_enable parameter to ’10.2.0.4′ in the parameter file, or in a session logon trigger.

If the problem appears all over the place, but there are other features of the 11.2.0.2 optimizer that are very helpful you could take advantage of the “_fix_control” parameter – after talking to Oracle support. This particular problem comes under bug fix 7132684, which appears in v$system_fix_control under 10.2.0.5, with the description “Fix costing for non prefixed local index”. If you’ve got the right symptoms, then the following statement will bypass the problem:

alter session set "_fix_control"='7132684:OFF';

Final Warning:

This note comes from 30 minutes looking at Doug’s trace files, one hour building a test case, and an hour spent writing this blog. This is not a complete determination of the problem it’s just an initial outline of what can happen and the symptoms to watch out for, posted as an early warning to help a few people save some time. Do not treat this note as a definitive reference.

Ichiro Obata's picture

Exadataを支える技術6 Advanced Compression

Oracle Advanced CompressionによるTableの圧縮はParallel Queryのパフォーマンスにも大きく影響するOracle Advanced Compression White Paperより:

大規模DWHシステムの場合、物理ディスクからの読み込みデータ量を減らすことがパフォーマンス改善のキーとなる。TPC-Hベンチマークで、その影響度を見てみると(http://ichiroobata.blogspot.com/2010/08/partitioning.htmlより):

Ichiro Obata's picture

Exadataを支える技術5 Oracle Partitioning

Oracle Data Sheetより、

パーティショニングにより、さまざまなアプリケーションの可用性、性能および管理面でさまざまなメリットをもたらします。

Jonathan Lewis's picture

Local Indexes – 2

In the previous note on local indexes I raised a couple of questions about the problems of different partitions holding different volumes of data, and supplied a script to build some sample data that produced the following values for blevel across the partitions of a list-partitioned table.

INDEX_NAME           PARTITION_NAME           BLEVEL LEAF_BLOCKS   NUM_ROWS
-------------------- -------------------- ---------- ----------- ----------
T1_ID                P0                            0           1          9
                     P1                            0           1         90
                     P2                            1           4        900
                     P3                            1          33       9000
                     P4                            1         339      90000
                     P5                            2        3384     900000

T1_N1                P0                            0           1          9
                     P1                            0           1         90
                     P2                            1           4        900
                     P3                            1          32       9000
                     P4                            1         314      90000
                     P5                            2        3136     900000

Q1: What do you think Oracle will record as the blevel at the global level for the two indexes ?
A1: As one of the commentators suggested, it seems to be the highest blevel recorded for any partition – in this case 2. (It’s possible that this assumption is wrong, of course, there may be some subtle weighting calculation involved – but I haven’t yet tested that hypothesis.)

Q2: If you have query with a where clause like “id between 100 and 400 and n1 != 5″ – which is designed very precisely to exclude the last, very big, partition – what value of blevel is Oracle going to use when considering the cost of using the index t1_id to access the data ?
A2: As I pointed out in an earlier note on list partitioned tables, Oracle doesn’t recognise the (obvious to the human eye) option for partition pruning in this predicate, so it uses the global blevel in the calculations.

The second answer is the one that is causing me a problem – because I have a client system where almost all the data is in a “dead” partiiton – it has a status, stored as the partition key in a list-partitioned table, of “COMPLETE”, and lots of their code includes the predicate: status != ‘COMPLETE’, but this can make the optimizer take the wrong execution path because it uses a global blevel that has been dictated by the huge volume of data that we know we don’t want to see.

The client queries are fairly complex, of course, but here’s a very trivial example demonstrating the basic problem (using the data generated by the code in the previous note – running under 11.1.0.6):

set autotrace traceonly explain

select	*
from	t1
where
	id = 99
and	n1 != 5
;

--------------------------------------------------------------------------------------------
| Id  | Operation                          | Name  | Rows  | Bytes | Cost  | Pstart| Pstop |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |       |     1 |   119 |    14 |       |       |
|   1 |  PARTITION LIST ALL                |       |     1 |   119 |    14 |     1 |     6 |
|*  2 |   TABLE ACCESS BY LOCAL INDEX ROWID| T1    |     1 |   119 |    14 |     1 |     6 |
|*  3 |    INDEX RANGE SCAN                | T1_ID |     1 |       |    13 |     1 |     6 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N1"<>5)
   3 - access("ID"=99)

From Oracle’s perspective it has to visit all six partitions because it can’t use the most apporpriate index and do partition pruning – and the final cost of this simple query is 14 because the value used (six times, in effect) for the blevel in the calculations is two; but we have inside information that tells us that this is essentially an unreasonable cost.

If Oracle were to believe that a more appropriate blevel for this query was just one then the cost would drop significantly (although in this case the plan wouldn’t change):

--------------------------------------------------------------------------------------------
| Id  | Operation                          | Name  | Rows  | Bytes | Cost  | Pstart| Pstop |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |       |     1 |   119 |     8 |       |       |
|   1 |  PARTITION LIST ALL                |       |     1 |   119 |     8 |     1 |     6 |
|*  2 |   TABLE ACCESS BY LOCAL INDEX ROWID| T1    |     1 |   119 |     8 |     1 |     6 |
|*  3 |    INDEX RANGE SCAN                | T1_ID |     1 |       |     7 |     1 |     6 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N1"<>5)
   3 - access("ID"=99)

Of course for a really big system, where the “dead” partition was 200 Million rows, we might have a blevel of three:

--------------------------------------------------------------------------------------------
| Id  | Operation                          | Name  | Rows  | Bytes | Cost  | Pstart| Pstop |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |       |     1 |   119 |    20 |       |       |
|   1 |  PARTITION LIST ALL                |       |     1 |   119 |    20 |     1 |     6 |
|*  2 |   TABLE ACCESS BY LOCAL INDEX ROWID| T1    |     1 |   119 |    20 |     1 |     6 |
|*  3 |    INDEX RANGE SCAN                | T1_ID |     1 |       |    19 |     1 |     6 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N1"<>5)
   3 - access("ID"=99)

Note how changing the global blevel by one makes the cost change by six – a consequence of the fact that we have six partitions with no pruning. If you’re having trouble with queries against partitioned table that don’t use the right index, take a close look at the data volumes and values recorded for blevel at the global, partition and subpartition levels – it’s possible that you’re suffering from a bias introduced by one partition being much larger than all the rest.

If you’re wondering how I got these plans (without simply editing them) it was by using dbms_stats.set_index_stats() to change the stored statistics – see “Copy Stats” for an example of the type of code needed. In cases like this, where I have better information about the data and the intent of the code than the optimizer has, I am perfectly happy to give a “more truthful” picture of the data to the optimizer by writing scripts to adjust statistics.

There are three drawbacks to such an approach, of course. First: on the next upgrade the optimizer might get smarter and make my clever little hack a liability rather than a benefit; secondly, there may be examples of application code that I haven’t noticed that might go wrong because of my hack; finally, and more importantly in the short term, I have to make sure that my code runs every time the statistics on the index are modified by any other program (such as the automatic stats collection job).

But the principle is sound – if we understand the system better than the optimizer then it’s positively important to help the optimizer in the most truthful way possible. List partitions (in a way similar to frequency histograms) are an obvious target for this type of treatment.

Jonathan Lewis's picture

Local Indexes

In a recent article about list partitioned tables I raised some questions about the cases where the optimizer can’t (yet) do partitioning pruning even when the opportunity is clearly visible to the human eye. The most important example was the case where each partition was defined to hold rows for just one partition key value – but the optimizer could not prune out the redundant partition for a query like: “partition_key != {constant}”.

I recently came across a situation where this really made a big difference. The system had a huge table that had been list partitioned as follows (with some camouflage):


partition by list (status) (
	partition p_state01 values  ('STATE01'),
	partition p_state02 values  ('STATE02'),
	    ....
	partition p_state25 values  ('STATE25'),
	partition p_handled values  ('Completed')
)

The table was defined to allow row movement, and every day there would be tens of thousands of rows moving through various states until they reached the “Completed” state.

There are various pros and cons to this setup. The most significant con is that when you update the status of a row Oracle actually has to update the row “in situ”, then delete it from what is now the wrong partition and insert it into the right partition. The most dramatic pro is that if the rows you’re interested in are (almost always) the ones that haven’t got to the “Completed” you’ve put all the boring old garbage out of the way where it doesn’t cause any problems. (In fact, if you’re running 11.2 you might choose to declare some of the “Completed” partitions of any local indexes as unusable and save yourself a lot of space – and by the time I’ve finished this article you might think this is a truly wonderful idea.) In the case of the client, there were about 200 million rows in the completed partition, and barely 2 million spread over the other partitions.

There was a bit of a problem, though. Some of the indexes on this table had been created as local indexes (arguably they should all have been local)and this resulted in some odd optimisation side effects. Here’s a little bit of code to build a table that demonstrates an interesting issue:

create table t1 (
	id,
	n1,
	small_vc,
	padding
)
partition by list (n1) (
	partition p0 values(0),
	partition p1 values(1),
	partition p2 values(2),
	partition p3 values(3),
	partition p4 values(4),
	partition p5 values(5)
)
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		rownum <= 10000
)
select
	rownum			id,
	trunc(log(10,rownum))	n1,
	lpad(rownum,10,'0')	small_vc,
	rpad('x',100)		padding
from
	generator	v1,
	generator	v2
where
	rownum <= 999999
;

create index t1_n1 on t1(n1, small_vc) local nologging;
create index t1_id on t1(id, small_vc) local nologging;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		estimate_percent => 100,
		method_opt 	 => 'for all columns size 1'
	);
end;
/

break on index_name skip 1

select
	index_name, partition_name, blevel, num_rows, leaf_blocks
from
	user_ind_partitions    -- but see comment #1 below from Tony Sleight
order by
	index_name, partition_name
;

Thanks to the log definition of column n1, you will see a very skewed distribution of data across the partitions, and the output from the query against the index partitions shows this quite dramatically. Since the code sample uses a 100% sample on the stats, you should get the following figures for the indexes (with a little variation in leaf blocks, perhaps, depending on your version and tablespace definitions. I was using 11.1.0.6 with 8KB blocks, locally managed tablespaces, freelists, and 1MB uniform extents.)

INDEX_NAME           PARTITION_NAME           BLEVEL LEAF_BLOCKS   NUM_ROWS
-------------------- -------------------- ---------- ----------- ----------
T1_ID                P0                            0           1          9
                     P1                            0           1         90
                     P2                            1           4        900
                     P3                            1          33       9000
                     P4                            1         339      90000
                     P5                            2        3384     900000

T1_N1                P0                            0           1          9
                     P1                            0           1         90
                     P2                            1           4        900
                     P3                            1          32       9000
                     P4                            1         314      90000
                     P5                            2        3136     900000

So here’s important question number 1: What do you think the blevel will be at the global level for the two indexes ?

Important question number 2: If you have query with a where clause like “id between 100 and 400 and n1 != 5″ – which is designed very precisely to exclude the last partition – what value of blevel is Oracle going to use when considering the cost of using the index t1_id to access the data ?

My answers are in this follow-up post.

Chris Antognini's picture

Deferred Segment Creation as of 11.2.0.2

One year ago I wrote a post entitled Deferred Segment Creation. If you read the comments related to it you can see that several people were concerned by the fact that it was not possible to easily get rid of segments associated to empty tables. That was with version 11.2.0.1. Now we have version 11.2.0.2 [...]

randolf.geist's picture

Cleanup of partitioned objects residing in offlined tablespaces

If you have the scenario that a large database that contains historical data - typically a data warehouse with partitioned tables where the partitions reside in different tablespaces depending on their age - is supposed to be duplicated but the target environment, typically a lower environment, like duplicating Production to a UAT environment, doesn't have sufficient space to hold the complete database and in particular the whole historic partitions, then you somehow need to deal with that partial clone.

Of course, in an ideal world the lower environment is supposed to have sufficient space to hold the complete database, but we don't live in an ideal world. Sometimes, depending on the storage technology, you might be able to do some kind of temporary "split mirror" operation that allows you to start up a complete clone of the database and drop all the affected objects / tablespaces before putting the cleaned up / partial database on the final target system, but this is not always applicable, for instance when using ASM-based storage.

The issue in that particular case is that you can do for instance a partial RMAN clone that omits the historical tablespaces, but the resulting database then obviously has some tablespaces that are unavailable and can't be onlined since the datafiles are simply missing.

In case of objects that are completely residing in the offlined tablespaces the solution is simple and straightforward: The objects can be dropped or the affected tablespaces can simply be dropped with the "including contents" clause.

Things get more complicated however in case of partitioned objects if some of the partitions reside in the offlined tablespaces.

In that case you can't simply drop the offlines tablespaces - you end up with Oracle error message "ORA-14404: partitioned table contains partitions in a different tablespace" for example that tells you that objects exist that have some of their partitions not residing in the offlined tablespace and Oracle obviously won't do the cleanup job for you in that case.

Oracle allows to drop partitions of a table that reside in offlined tablespaces, but there is an important limitation: You can only do so if no index has been defined on the table. Otherwise you get the error message "ORA-14117/ORA-14630: [sub]partition resides in offlined tablespace" when attempting to drop the partition. It's interesting to note that the mere existence of indexes is sufficient - it doesn't matter if the index is residing in an offlined or onlined tablespace, if it is unusable or not - you just need to have at least one index defined on the table, and the drop partition operation errors out. It looks like a hard coded "if index exists then error" code path. I'm not sure why exactly this restriction is in place, but it has serious consequences if you need to do this sort of cleanup for a large database with hundreds of thousands of partitions.

By the way it is interesting to know that Oracle allows to drop segments from offlined tablespaces that are locally managed - think about the consequence: The local bitmaps that represent the allocated and free space in the tablespace can not be updated at the time of the drop operation. This is one of the few disadvantages compared to the old-fashioned dictionary managed tablespaces where an operation like that potentially could have been limited to a pure dictionary operation.

The approach Oracle chooses is straightforward: The segments dropped are converted into "temporary" objects (kind of a special "recyclebin" if you want to say so) and if the offlined tablespace should get onlined again the local bitmaps will be updated according to these temporary objects and finally the temporary objects will be removed from the dictionary. Otherwise if the offlined tablespace gets dropped the corresponding temporary objects consequently can also be removed.

Tanel Poder recently also blogged about this in the context of read-only tablespaces.

So you could now drop all indexes from the affected tables and recreate them afterwards, but this is not a good idea for several reasons:
* If the remaining partitions still represent multiple terabytes it will take quite long to rebuild all indexes
* Things might go wrong and you end up with missing indexes

Since these are partitioned objects, another Oracle best practice might come to rescue: Exchange the affected partitions with an exchange table excluding indexes - this allows to drop the exchanged partition residing in an offlined tablespace without any error message. Note that this means that after you've exchanged the partition you need to drop the unpartitioned object (that doesn't have indexes) in order to get rid of the affected segment.

Now if you only have a few objects that are affected or only a single partition to clean up per table then you're basically done, but if you need to clean up multiple partitions and the number of partitions is high then this approach is not feasible, because it sounds like quite an overhead to drop / create a table for each partition that needs to be cleaned up - in particular when talking about hundred thousands of partitions.

The approach of creating a table to exchange partitions with adds another complexity to the problem: Many large databases make usage of the option in Oracle to set columns unused instead of actually dropping them, since the former is a simple meta data operation in the dictionary (renames the column and marks the column as hidden) and independent from the data volume and therefore almost instant but the latter requires Oracle to process the whole segment which can take quite some time in case of huge segments (and is not a very efficient operation by the way, but that is something for a different post).

Now creating a table that can be exchanged with such an evolved table that contains unused columns can not be accomplished by simply doing a handy CTAS operation - the unused columns will be missing from the copy and therefore the exchange operation will fail with "column mismatch" errors.

So you need to deal with these unused columns somehow in that moment you need to create a table dynamically, and you need to do that if you need to drop more than a single partition per table.

The only other viable option that I'm aware of is to maintain this exchange table as a partitioned table itself permanently - which means create two exact copies (in terms of dictionary meta data) of every potentially affected table and make sure to apply the same DDL in the same order which will ensure that the exchange operation with those tables will succeed.

It will need two copies, one unpartitioned and one partitioned, if you want to avoid the drop and dynamic create table operation, since Oracle allows to exchange partitions only with unpartitioned tables. So it would take two exchange operations per affected partition - the first with the unpartitioned table, and the second with the partitioned table. Both exchange operations would have to be done excluding indexes and at least the final partitioned table has to be created excluding indexes - note that Oracle allows the exchange operation even if one table is missing a potentially created primary key constraint if you use the "EXCLUDING INDEXES" clause.

Then you can drop the partition that has been exchanged into the second table since the table doesn't have any indexes defined on it.

This allows you to accomplish two objectives:

1. Avoid a costly (and potentially complex in case of unused columns) drop / create table operation per partition
2. Manage the drop / create cycle by simply dropping and adding partitions in the second table which is far less overhead compared to a drop / create table operation which is important when performance matters

However maintaining two copies for every table that potentially needs to be cleaned up sounds like a huge overhead, and even in Oracle 11.2 with the deferred segment creation at least the partitioned table copy will allocate space since the deferred segment creation option is not (yet) applicable to partitioned objects.

So here a few ideas how to accomplish such a cleanup operation:

1. For each table to process create the required table copies on the fly dynamically once. If you don't have to deal with unused columns then a simple CTAS operation can be used to create those tables. If you need to handle unused columns, a more complex approach is required. You will have to use custom queries against the data dictionary to extract the required information since the official DBMS_METADATA interface as far as I know doesn't expose information about unused columns.

The following query could be used as a starting point to extract the column definition of a table:

-- Get the column list including hidden columns, but ignoring virtual columns added by function-based indexes
-- TODO: What about virtual columns (11g or later)?
select
decode(t.hidden_column,
'YES', 'SYS$' || internal_column_id || '$UNUSED',
t.column_name
) ||
' ' ||
lower(data_type) ||
decode(substr(data_type, 1, 9),
'TIMESTAMP', null,
'INTERVAL ', null,
decode(t.data_type,
'NUMBER', decode(t.data_precision || t.data_scale,
null, null,
'(' ||
decode(t.data_precision,
null, '*',
t.data_precision
) ||
decode(t.data_scale,
null, null,
',' || t.data_scale
) ||
')'
),
'FLOAT', '(' || t.data_precision || ')',
'LONG', null,
'LONG RAW', null,
'BLOB', null,
'CLOB', null,
'NCLOB', null,
'BFILE', null,
'CFILE', null,
'BINARY_FLOAT', null,
'BINARY_DOUBLE', null,
'MLSLABEL', null,
'ROWID', null,
'UROWID', null,
'DATE', null,
'(' ||
nvl(
nvl(
decode(t.char_length,
0, to_number(null),
t.char_length
),
t.char_col_decl_length
),
t.data_length
) ||
decode(t.character_set_name,
'CHAR_CS', decode(t.char_used,
'C', ' char',
'B', ' byte'
)
) ||
')'
)
) ||
' ' ||
decode(nullable,
'N', 'not null',
null
) ||
',' as column_def
from
all_tab_cols t
where
t.owner =
and t.table_name =
and t.segment_column_id is not null
order by
t.internal_column_id;

This will generate SYS$$UNUSED columns for all columns set to unused in the source table which can then subsequently be set to unused in the newly created table to have the same layout as the source table.

Note that it hasn't been tested with virtual columns yet, but should deal with most of the available data types and also handle columns using char semantics correctly (important for NCHAR / NVARCHAR based columns and databases running with multi-byte characters, in particular AL32UTF8). It also can handle virtual columns added via function-based indexes.

Of course the extraction process potentially needs to handle much more than shown here, for instance check the table properties like index organized tables, partition keys etc.

2. Process each affected table partition using the following approach:

* Add a new "clean" partition to the partitioned copy table in a tablespace that is online. The partition key is a "dummy" key that can be the same for each iteration of this process. This is the segment that is going to end up in the affected partitioned table and will be finally dropped since it then resides in a available tablespace. Remember, we want to get rid of those partitions residing in unavailable tablespaces, since we have not sufficient space in the clone of the database to make them available)

* Exchange the affected partition with the unpartitioned copy table EXCLUDING INDEXES (The unpartitioned table copy has also been initially created in an available tablespace)

* Exchange the unpartitioned copy table (that now holds the defect partition) with the newly created "clean" partition of the partitioned copy table EXCLUDING INDEXES

* Now the defect partition resides in the partitioned copy table and can simply be dropped since we don't have any indexes defined on this table

* Finally we can drop now the partition from the affected table (that is now the partition that has been added to the partitioned copy table in the previous iteration of the process) since the partition resides now in an available tablespace

The same process can be applied to a subpartitioned table. There are of course some more details to consider, for example, you can't drop the last subpartition or partition from a partition / table, but since we assume that our affected table in general still has some partitions left over that reside in current and available tablespaces it should be possible to handle these scenarios.

Here is a code snippet / template that applies this process:

/**
* Drop a (sub-)partition residing in an offline tablespace
* @param p_object The original object to clean up
* @param p_temp_exc_table_name The name of the working table
* @param p_clean_tablespace_name The tablespace where to create the cleanup partition
* @param p_partition_name The name of the (sub-)partition to drop
* @param p_segment_type The type of the segment (TABLE[ ][SUB][PARTITION])
* @param p_last_partition_indicator For subpartitions indicate that this is
* the last subpartition - this will drop the whole partition
**/
procedure drop_partition(
p_object in r_oracle_object
, p_temp_exc_table_name in out oracle_object
, p_clean_tablespace_name in oracle_object
, p_partition_name in oracle_object
, p_segment_type in oracle_object
, p_last_partition_indicator in varchar2
)
as
s_sql large_sql;
s_partition_type varchar2(255);
s_exchange_table oracle_object;
e_partition_in_offline_ts exception;
pragma exception_init(e_partition_in_offline_ts, -14117);
-- We do this in an autonomous transaction to keep it separate
-- from any outside transactions like the "queue" transaction
-- used by the parallel clean up threads
pragma autonomous_transaction;
begin
-- Determine if this is a partition or subpartition
s_partition_type := substr(p_segment_type, 7, 255);

-- Step 1: Add a clean partition to the temporary cleanup table
s_sql := 'alter table ' || p_temp_exc_table_name || ' add partition p_cleanup values (42) tablespace ' || p_clean_tablespace_name;

execute(p_object.owner, s_sql);

-- Get the name of the unpartitioned table copy
s_exchange_table := get_exchange_table(p_object);

-- Step 2: Exchange the defect (sub-)partition with the unpartitioned copy table EXCLUDING indexes
s_sql := '
alter table ' || p_object.segment_name || '
exchange ' || s_partition_type || ' ' || p_partition_name || '
with table ' || s_exchange_table || '
excluding indexes without validation';

execute(p_object.owner, s_sql);

-- Step 3: Exchange the unpartitioned copy table with the partitioned copy table partition again EXCLUDING indexes
s_sql := '
alter table ' || p_temp_exc_table_name || '
exchange partition p_cleanup
with table ' || s_exchange_table || '
excluding indexes without validation';

execute(p_object.owner, s_sql);

-- Step 4: Drop the defect partition now residing in partitioned "waste-bin"table partition
-- Since we don't have any indexes this is possible now
s_sql := 'alter table ' || p_temp_exc_table_name || ' drop partition p_cleanup';

begin
execute(p_object.owner, s_sql);
exception
-- Any tables with LOBs will cause an exception since they will implicitly have a LOB INDEX added
-- Therefore the DROP PARTITION will fail with ORA-14117
-- The simplest solution is to drop and recreate the entire working table - note the serious side effects
-- of the underlying ORA-01110 error in 11.1.0.7. See below comments about the potential impact
-- of the health check monitoring
-- A smoother way of handling this to avoid these potential side-effects is to check if the
-- table has any indexes defined on it and then do not attempt to drop the partition but
-- immediately drop the entire table
when e_partition_in_offline_ts then
drop_temporary_exc_table(p_object, p_temp_exc_table_name);
p_temp_exc_table_name := create_temporary_exc_table(p_object, p_clean_tablespace_name, p_segment_type);
end;

-- Step 5: Drop the partition / subpartition which completes the cleanup
-- TODO: Are really all subpartitions of a partition affected?
-- If not, this logic needs to be revised
if (s_partition_type = 'SUBPARTITION' and p_last_partition_indicator = 'Y') then
s_sql := 'alter table ' || p_object.segment_name || ' drop partition ' || get_parent_partition(p_object, p_partition_name);

execute(p_object.owner, s_sql);
elsif s_partition_type = 'PARTITION' then
s_sql := 'alter table ' || p_object.segment_name || ' drop partition ' || p_partition_name;

execute(p_object.owner, s_sql);
end if;
end drop_partition;

Note that tables with LOBs need special treatment since the logic used for tables without LOBs fails with an error that the partition resides in an offline tablespace when the "defect" partition gets dropped from the temporary cleanup table.

This is the error that is typically raised when attempting to drop a partition from a table having indexes defined on it. Very likely this error is caused by the fact that a LOB always has a corresponding LOB INDEX (that will be always exchanged when doing an EXCHANGE PARTITION even if EXCLUDING INDEXES is specified).

Since this index cannot be dropped from the temporary cleanup table we actually do not gain anything by performing the exchange exercise.
The simplest solution to the problem is to drop and recreate the cleanup table which is possible without the ORA-14117 error message being raised.

Note that attempting to drop the partition raising the ORA-14117 error has a serious side effect in 11.1.0.7 and 11.2.0.1: If the "Health Monitoring" detects a "flooding" ORA-1110 error (something is wrong with a datafile) any operation on these datafiles from then on seem to have tracing turned on - this leads to the odd situation that onlining such a tablespace again (which we can't do in our actual case since the datafiles are not there, but I've done for testing purposes) takes minutes and generates trace files in size between 20 and 500 MB (!).

I haven't investigated much into this, but in my tests this couldn't be avoided by setting the undocumented parameter "_disable_health_checks=TRUE".

The MMON_SLAVE process will dump something similar to this:

========= Dump for error ORA 1110 (no incident) ========
----- DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) -----

Also for the sessions that cause the error something like this will be written to trace files:

DDE rules only execution for: ORA 1110
----- START Event Driven Actions Dump ----
---- END Event Driven Actions Dump ----
----- START DDE Actions Dump -----
Executing SYNC actions
----- START DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) -----
DDE Action 'DB_STRUCTURE_INTEGRITY_CHECK' was flood controlled
----- END DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (FLOOD CONTROLLED, 17 csec) -----
Executing ASYNC actions
----- END DDE Actions Dump (total 0 csec) -----

A smoother way of handling this to avoid these potential side-effects is to check if the table has any indexes defined on it and then do not attempt to drop the partition but immediately drop the entire table.

Since each iteration of this process takes some time (in a database with hundreds of thousands of partitions this can take a few seconds per partition due to the underlying recursive data dictionary operations) and is according to some traces mostly CPU bound you can speed up the processing by using multiple processes it if you have multiple CPUs available.

Using an approach with Advanced Queueing having multiple consumers (for instance as many consumers as CPUs) we were able to cleanup more than 200,000 partitions (table plus index (sub)partitions) in less than one hour which was within the available timeframe for that operation.

Syndicate content