In-table predicates

Jonathan Lewis's picture

This note was prompted by a recent email asking about the optimizer’s method for estimating the selectivity of a predicate which compared two columns in the same table – for example:  “where orders.amount_invoiced = orders.amount_paid”. It’s been about 14 years since I wrote “Cost Based Oracle – Fundamentals” so my memory of what I wrote (and whether I even mentioned this case) was rather hazy, so I sent off a quick reply and decided to do a little checking.

It turned out that I’d already written a blog note with a throwaway comment about the estimates and a general workaround for optimizer problems caused by examples of this kind. The comment I made about the estimate was that the selectivity seems to be the smaller of the selectivities of (using the example above) “amount_paid = :unpeekable_bind” and “amount_invoice = :unpeekable_bind”. I’m fairly sure I’ve made similar comments several times in the past, but after replying to the email I started to wonder whether this would still be true if there were histograms on the columns. So I ran up a little test and here, to start things off, is the code to generate the data I used for testing:


rem
rem     Script:         column_equality_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Apr 2019
rem     Purpose:

create table t1(
        id      number(8,0),
        n1      number(6,0)
)
;

create table t2(
        id      number(8,0),
        n1      number(6,0)
)
;

create table t3(
        n1      number(6,0),
        n2      number(6,0),
        v1      varchar2(50)
)
;

execute dbms_random.seed(0)

insert into t1
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                  id,
        trunc(10 * abs(dbms_random.normal))     n1
from
        generator       v1
;

insert into t2
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                  id,
        trunc(10 * abs(dbms_random.normal))     n1
from
        generator       v1
;

insert into t3 (n1, n2, v1)
select
        t1.n1,
        t2.n1,
        rpad(rownum,50)
from
        t1, t2
where
        t1.id = t2.id
;

commit;

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1 for columns n1 size 254'
        );
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T2',
                method_opt  => 'for all columns size 1 for columns n1 size 254'
        );
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T3',
                method_opt  => 'for all columns size 254 for columns v1 size 1'
        );
end;
/

select
        table_name, column_name, num_distinct, density, histogram,
        low_value, high_value
from
        user_tab_cols
where
        table_name in ('T1','T2','T3')
and     column_name in ('N1','N2')
order by
        table_name, column_name
;


TABLE_NAME      COLUMN_NAME     NUM_DISTINCT    DENSITY HISTOGRAM       LOW_VALUE  HIGH_VALUE
--------------- --------------- ------------ ---------- --------------- ---------- ----------
T1              N1                        38     .00005 FREQUENCY       80         C128

T2              N1                        38     .00005 FREQUENCY       80         C126

T3              N1                        38     .00005 FREQUENCY       80         C128
                N2                        38     .00005 FREQUENCY       80         C126


I’ve created two sets of 10,000 rows each of normally distributed data – but taken the absolute values so I’ve only got half the bell curve, and I’ve scaled up by a factor of 10 and truncated. This has given me two similar but slightly different sets of values which happen to cover 38 distinct values each.

I’ve then generated my test set by joining these two tables on the unique (though not declared as such) id column to give a table with the same number of rows and two skewed sets of data. The calls to dbms_stats create histograms on the skewed data sets, and I’ve reported a few significant numbers about the 4 relevant columns.

Looking at the column statistics we have num_distinct = 38 across the board – so my observation from paragraph 2 above would tend to suggest that the optimizer would report 10,000/38 = 263 as the cardinality estimate for the predciate “t3.n1 = t3.n2” (I’m fairly confident that in this case 1/num_distinct will be preferred over using the density from user_tab_cols). But here’s what we get from a call to explain plan:


explain plan for
select
        v1
from
        t3
where
        n1 = n2
;

select * from table(dbms_xplan.display);

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   564 | 32148 |    18   (6)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T3   |   564 | 32148 |    18   (6)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N1"="N2")

The estimate is 564 – which is a pretty good estimate in this case (the actual result was 552) as the two columns were randomly generated and there’s no correlation between them. Unfortunately this is quite a long way of my assumption of 263, so where did the optimizer get that number from?

Here’s a query (with result set) that you may recognise from an earlier post.


break on report skip 1
compute count of value on report
compute sum of t1_frequency on report
compute sum of t2_frequency on report
compute sum of product on report

column product format 999,999,999

with f1 as (
select
        endpoint_value                                                            value,
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency
from
        user_tab_histograms
where
        table_name  = 'T3'
and     column_name = 'N1'
),
f2 as (
select
        endpoint_value                                                            value,
        endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency
from
        user_tab_histograms
where
        table_name  = 'T3'
and     column_name = 'N2'
)
select
        f1.value,
        f1.frequency    t1_frequency,
        f2.frequency    t2_frequency,
        f1.frequency * f2.frequency product
from
        f1, f2
where
        f2.value = f1.value
order by
        f1.value
;



     VALUE T1_FREQUENCY T2_FREQUENCY      PRODUCT
---------- ------------ ------------ ------------
         0          777          768      596,736
         1          806          753      606,918
         2          794          779      618,526
         3          808          763      616,504
         4          752          749      563,248
         5          627          729      457,083
         6          623          628      391,244
         7          584          616      359,744
         8          544          597      324,768
         9          512          546      279,552
        10          441          439      193,599
        11          409          342      139,878
        12          345          370      127,650
        13          318          300       95,400
        14          257          282       72,474
        15          244          242       59,048
        16          214          206       44,084
        17          172          193       33,196
        18          161          140       22,540
        19          113          114       12,882
        20          108           93       10,044
        21           95           81        7,695
        22           72           55        3,960
        23           54           56        3,024
        24           43           36        1,548
        25           38           31        1,178
        26           23           18          414
        27           18           23          414
        28            7           14           98
        29            9           13          117
        30           14           11          154
        31            4            2            8
        32            5            3           15
        33            1            3            3
        35            4            1            4
        37            2            2            4
---------- ------------ ------------ ------------
        36
                   9998         9998    5,643,754


I’m querying the histoggram information for the two columns, and where t3.n1 and t3.n2 have a value in common I’ve reported the two frequencies for that value and the product of the frequencies. For convenience I’ve included a count and a couple of sums to show that there isn’t a perfect match in the set of values for the two columns. The most important number at the bottom of the page, though, is the sum of the products of frequencies of common values. Take that value and divide by 10,000 and you get 564.3754 – compare that with the cardinality estimate of the predicate “t3.n1 = t3.n2”, it’s a perfect match (allowing for rounding).

The query against user_tab_histograms is the query I used to calculate the cardinality of a join where there were frequency histograms on the columns at both ends of the join. The optimizer’s estimate for “intra-table” predicates is consistent with its estimate for joins (in the special cases of “no histograms” and “two frequency histograms”, at least). Viewing it from a slightly different angle: the selectivity of the predicate “n1 = n2” can be derived as “the cardinality estimate for joining t3 to itself” divided by “the cardinality of the cartesian join” (the latter being num_rows * num_rows, of course).

Just as a closing demo – lets generate a plan for the appropriate self-join of t3 and check the cardinality estimate:


explain plan for
select
        t3a.v1, t3b.v1
from
        t3 t3a, t3 t3b
where
        t3a.n2 = t3b.n1
;

select * from table(dbms_xplan.display);


---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  5643K|   581M|   138  (83)| 00:00:01 |
|*  1 |  HASH JOIN         |      |  5643K|   581M|   138  (83)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T3   | 10000 |   527K|    13   (8)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T3   | 10000 |   527K|    13   (8)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T3A"."N2"="T3B"."N1")


As expected the (rounded) join cardinality is reported as 5,643K.

So the selectivity of the single table predicate “n1 = n2” will be (5,643,000 / (10,000 * 10,000) = 0.05643 and the cardinality estimate of the single table query will be 10,000 * 0.05643 = 564.3 QED.

I haven’t tested any other variations of types of histogram, degree of overlap of value ranges, etc. but I suspect that the general principle is probably going to give the selectivity as (or with the appearance of): “estimated cardinality of self-join” / “square of num_rows (allowing for nulls)”.

 

To prevent automated spam submissions leave this field empty.