Join Cardinality – 2

Jonathan Lewis's picture

In the previous note I posted about Join Cardinality I described a method for calculating the figure that the optimizer would give for the special case where you had a query that:

  • joined two tables
  • used a single-column to join on equality
  • had no nulls in the join columns
  • had a perfect frequency histogram on the columns at the two ends of the join
  • had no filter predicates associated with either table

The method simply said: “Match up rows from the two frequency histograms, multiply the corresponding frequencies” and I supplied a simple SQL statement that would read and report the two sets of histogram data, doing the arithmetic and reporting the final cardinality for you. In an update I also added an adjustment needed in 11g (or, you might say, removed in 12c) where gaps in the histograms were replaced by “ghost rows” with a frequency that was half the lowest frequency in the histogram.

This is a nice place to start as the idea is very simple, and it’s likely that extensions of the basic idea will be used in all the other cases we have to consider. There are 25 possibilities that could need separate testing – though only 16 of them ought to be relevant from 12c onwards. Oracle allows for four kinds of histograms – in order of how precisely they describe the data they are:

  • Frequency – with a perfect description of the data
  • Top-N (a.k.a. Top-Frequency) – which describes all but a tiny fraction (ca. one bucket’s worth) of data perfectly
  • Hybrid – which can (but doesn’t usually, by default) describe up to 2,048 popular values perfectly and gives an approximate distribution for the rest
  • Height-balanced – which can (but doesn’t usually, by default) describe at most 1,024 popular values with some scope for misinformation.

Finally, of course, we have the general case of no histogram, using only 4 numbers (low value, high value, number of rows, number of distinct values) to give a rough picture of the data – and the need for histograms appears, of course, when the data doesn’t look anything like an even distribution of values between the low and high with close to “number of rows”/”number of distinct values” for each value.

So there are 5 possible statistical descriptions for the data in a column – which means there are 5 * 5 = 25 possible options to consider when we join two columns, or 4 * 4 = 16 if we label height-balanced histograms as obsolete and ignore them (which would be a pity because Chinar has done some very nice work explaining them).

Of course, once we’ve worked out a single-column equijoin between two tables there are plenty more options to consider:  multi-column joins, joins involving range-based predicates, joins involving more than 2 tables, and queries which (as so often happens) have predicates which aren’t involved in the joins.

For the moment I’m going to stick to the simplest case – two tables, one column, equality – and comment on the effects of filter predicates. It seems to be very straightforward as I’ll demonstrate with a new model

rem
rem     Script:         freq_hist_join_03.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2018
rem

execute dbms_random.seed(0)

create table t1(
        id      number(8,0),
        n0040   number(4,0),
        n0090   number(4,0),
        n0190   number(4,0),
        n0990   number(4,0),
        n1      number(4,0)
)
;

create table t2(
        id      number(8,0),
        n0050   number(4,0),
        n0110   number(4,0),
        n0230   number(4,0),
        n1150   number(4,0),
        n1      number(4,0)
)
;

insert into t1
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                  id,
        mod(rownum,   40) + 1                   n0040,
        mod(rownum,   90) + 1                   n0090,
        mod(rownum,  190) + 1                   n0190,
        mod(rownum,  990) + 1                   n0990,
        trunc(30 * abs(dbms_random.normal))     n1
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5 -- > comment to avoid WordPress format issue
;

insert into t2
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                  id,
        mod(rownum,   50) + 1                   n0050,
        mod(rownum,  110) + 1                   n0110,
        mod(rownum,  230) + 1                   n0230,
        mod(rownum, 1150) + 1                   n1150,
        trunc(30 * abs(dbms_random.normal))     n1
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5 -- > comment to avoid WordPress format issue
;

begin
        dbms_stats.gather_table_stats(
                ownname => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1 for columns n1 size 254'
        );
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T2',
                method_opt  => 'for all columns size 1 for columns n1 size 254'
        );
end;
/

You’ll notice that in this script I’ve created empty tables and then populated them. This is because of an anomaly that appeared in 18.3 when I used “create as select”, and should allow the results from 18.3 be an exact match for 12c. You don’t need to pay much attention to the Nxxx columns, they were there so I could experiment with a few variations in the selectivity of filter predicates.

Given the purpose of the demonstration I’ve gathered histograms on the column I’m going to use to join the tables (called n1 in this case), and here are the summary results:


TABLE_NAME           COLUMN_NAME          HISTOGRAM       NUM_DISTINCT NUM_BUCKETS
-------------------- -------------------- --------------- ------------ -----------
T1                   N1                   FREQUENCY                119         119
T2                   N1                   FREQUENCY                124         124

     VALUE  FREQUENCY  FREQUENCY      PRODUCT
---------- ---------- ---------- ------------
         0       2488       2619    6,516,072
         1       2693       2599    6,999,107
         2       2635       2685    7,074,975
         3       2636       2654    6,995,944
...
       113          1          3            3
       115          1          2            2
       116          4          3           12
       117          1          1            1
       120          1          2            2
                                 ------------
sum                               188,114,543

We’ve got frequencyy histograms, and we can see that they don’t have a perfect overlap. I haven’t printed every single line from the cardinality query, just enough to show you the extreme skew, a few gaps, and the total. So here are three queries with execution plans:


set serveroutput off

alter session set statistics_level = all;
alter session set events '10053 trace name context forever';

select
        count(*)
from
        t1, t2
where
        t1.n1 = t2.n1
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

select
        count(*)
from
        t1, t2
where
        t1.n1 = t2.n1
and     t1.n0990 = 20
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));


select
        count(*)
from
        t1, t2
where
        t1.n1 = t2.n1
and     t1.n0990 = 20
and     t2.n1150 = 25
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

I’ve queried the pure join – the count was exactly the 188,114,543 predicted by the cardinality query, of course – then I’ve applied a filter to one table, then to both tables. The first filter n0990 = 20 will (given the mod(,990)) definition identify one row in 990 from the original 100,000 in t1; the second filter n1150 = 25 will identify one row in 1150 from t2. That’s filtering down to 101 rows and 87 rows respectively from the two tables. So what do we see in the plans:


-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:23.47 |     748 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:23.47 |     748 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |    188M|    188M|00:00:23.36 |     748 |  6556K|  3619K| 8839K (0)|
|   3 |    TABLE ACCESS FULL| T1   |      1 |    100K|    100K|00:00:00.01 |     374 |       |       |          |
|   4 |    TABLE ACCESS FULL| T2   |      1 |    100K|    100K|00:00:00.01 |     374 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."N1"="T2"."N1")



-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:00.02 |     748 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:00.02 |     748 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |    190K|    200K|00:00:00.02 |     748 |  2715K|  2715K| 1647K (0)|
|*  3 |    TABLE ACCESS FULL| T1   |      1 |    101 |    101 |00:00:00.01 |     374 |       |       |          |
|   4 |    TABLE ACCESS FULL| T2   |      1 |    100K|    100K|00:00:00.01 |     374 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."N1"="T2"."N1")
   3 - filter("T1"."N0990"=20)



-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:00.01 |     748 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:00.01 |     748 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |    165 |    165 |00:00:00.01 |     748 |  2715K|  2715K| 1678K (0)|
|*  3 |    TABLE ACCESS FULL| T2   |      1 |     87 |     87 |00:00:00.01 |     374 |       |       |          |
|*  4 |    TABLE ACCESS FULL| T1   |      1 |    101 |    101 |00:00:00.01 |     374 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."N1"="T2"."N1")
   3 - filter("T2"."N1150"=25)
   4 - filter("T1"."N0990"=20)


The first execution plan shows an estimate of 188M rows – but we’ll have to check the trace file to confirm whether that’s only an approximate match to our calculation, or whether it’s an exact match. So here’s the relevant pair of lines:


Join Card:  188114543.000000 = outer (100000.000000) * inner (100000.000000) * sel (0.018811)
Join Card - Rounded: 188114543 Computed: 188114543.000000

Yes, the cardinality calculation and the execution plan estimates match perfectly. But there are a couple of interesting things to note. First, Oracle seems to be deriving the cardinality by multiplying the individual cardinalities of the two tables with a figure it calls “sel” – the thing that Chinar Aliyev has labelled Jsel the “Join Selectivity”. Secondly, Oracle can’t do arithmetic (or, removing tongue from cheek) the value it’s reported for the join selectivity is reported at only 6 decimal places, but stored to far more. What is the Join Selectivity, though ? It’s the figure we derive from the cardinality SQL divided by the cardinality of the cartesian join of the two tables – i.e. 188,114,543 / (100,000 * 100,000).

With the clue from the first trace file, can we work out why the second and third plans show 190K and 165 rows respectively. How about this – multiply the filtered cardinalities of the two separate tables, then multiply the result by the join selectivity:

  • 1a)   n0990 = 20: gives us 1 row in every 990.    100,000 / 990 = 101.010101…    (echoing the rounded execution plan estimate).
  • 1b)   100,000 * (100,000/990) * 0.0188114543 = 190,014.69898989…    (which is in the ballpark of the plan and needs confirmation from the trace file).

 

  • 2a)   n1150 = 25: gives us 1 row in every 1,150.    100,000 / 1,150 = 86.9565217…    (echoing the rounded execution plan estimate)
  • 2b)   (100,000/990) * (100,000/1,150) * 0.0188114543 = 165.2301651..    (echoing the rounded execution plan estimate).

Cross-checking against extracts from the 10053 trace files:


Join Card:  190014.689899 = outer (101.010101) * inner (100000.000000) * sel (0.018811)
Join Card - Rounded: 190015 Computed: 190014.689899

Join Card:  165.230165 = outer (86.956522) * inner (101.010101) * sel (0.018811)
Join Card - Rounded: 165 Computed: 165.230165

Conclusion.

Remembering that we’re still looking at very simple examples with perfect frequency histograms: it looks as if we can work out a “Join Selectivity” (Jsel) – the selectivity of a “pure” unfiltered join of the two tables – by querying the histogram data then use the resulting value to calculate cardinalities for simple two-table equi-joins by multiplying together the individual (filtered) table cardinality estimates and scaling by the Join Selectivity.

Acknowledgements

Most of this work is based on a document written by Chinar Aliyev in 2016 and presented at the Hotsos Symposium the same year. I am most grateful to him for responding to a recent post of mine and getting me interested in spending some time to get re-acquainted with the topic. His original document is a 35 page pdf file, so there’s plenty more material to work through, experiment with, and write about.

 

To prevent automated spam submissions leave this field empty.