Oakies Blog Aggregator

Jonathan Lewis's picture

Min/Max upgrade

Here’s a nice little optimizer enhancement that appeared in 12.2 to make min/max range scans (and full scans) available in more circumstances. Rather than talk through it, here’s a little demonstration:

rem
rem     Script:         122_minmax.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1        Good path
rem             12.1.0.2        Bad path

create table pt1 (
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
)
nologging
partition by hash (object_id) partitions 4
as
select
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
from
        (select * from all_objects),
        (select rownum n1 from dual connect by level <= 10)  -- > comment to avoid format wordpress issue
;

alter table pt1 modify(status not null);

execute dbms_stats.gather_table_stats(null,'pt1',granularity=>'ALL',method_opt=>'for all columns size 1')

create index pt1_i1 on pt1(status, namespace) nologging local;

alter session set statistics_level = all;
set serveroutput off
set linesize 156
set pagesize 60
set trimspool on

select  min(status) from pt1;
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

select  min(namespace) from pt1 where status = 'INVALID';
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

select  min(namespace) from pt1 where status = (select min(status) from pt1);
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

The basic “min/max” optimisation allows Oracle to avoid a massive sort aggregate – Oracle doesn’t need to acquire a lot of data and sort it when it knows that the “left hand” end of an index is the low values and the “right hand” is the high values so, for example, in the first query above the optimizer could simply walk down the index branches to the left hand leaf and look at the single lowest entry in the leaf block to determine the lowest value for status … if the index had been a global index.

Things get a little messy, though, when the index is locally partitioned and your query isn’t about the partition key and there’s no suitable global index. Once upon a time (IIRC) Oracle would simply have to do an index fast full scan across all index partitions to handle such a query, but some time ago it got a lot cleverer and was enhanced to do a min/max scan on each partition in turn getting one value per partition very efficiently, then aggregating across those values to find the global minimum.

Here are the three execution plans (with rowsource execution stats pulled from memory) taken from 12.1.0.2 for the queries above:


-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      12 |
|   1 |  SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   2 |   PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   3 |    INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
-----------------------------------------------------------------------------------------------------------------------------


-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      12 |
|   1 |  SORT AGGREGATE               |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   2 |   PARTITION HASH ALL          |        |      1 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      12 |
|   3 |    FIRST ROW                  |        |      4 |      1 |     9   (0)|       |       |      1 |00:00:00.01 |      12 |
|*  4 |     INDEX RANGE SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      12 |
-------------------------------------------------------------------------------------------------------------------------------


-----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |
-----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |        |      1 |        |   337 (100)|       |       |      1 |00:00:00.07 |    2402 |   2242 |
|   1 |  SORT AGGREGATE                |        |      1 |      1 |            |       |       |      1 |00:00:00.07 |    2402 |   2242 |
|   2 |   PARTITION HASH ALL           |        |      1 |    422K|   328  (10)|     1 |     4 |     10 |00:00:00.07 |    2402 |   2242 |
|*  3 |    INDEX FAST FULL SCAN        | PT1_I1 |      4 |    422K|   328  (10)|     1 |     4 |     10 |00:00:00.07 |    2402 |   2242 |
|   4 |     SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |      0 |
|   5 |      PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |      0 |
|   6 |       INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |      0 |
-----------------------------------------------------------------------------------------------------------------------------------------

In the first plan Oracle has done an “index full scan (min/max)” across each of the four partitions in turn to return one row very cheaply from each, then aggregated to find the overall minimum.

In the second plan Oracle has done an “index range scan (min/max)” in exactly the same way, since it was able to find the start point in the index for the status ‘INVALID’ very efficiently.

In the third plan Oracle has been able to find the minimum value for the status (‘INVALID’) very efficiently in the subquery, and has passed that single value up to the main query, which has then used a brute force approach to search the whole of every partition of the index for every occurrence (all 10 of them) of the value ‘INVALID’ and then aggregated them to find the minimum namespace. Despite “knowing”, by the time the main query runs, that there will be a single value to probe for the status, the optimizer has not anticipated the fact that the final query will effectively become the same as the preceding one. As a result we’ve read 2,242 data blocks into the cache.

Turn, then, to the execution plan from 12.2.0.1 for this last query:


---------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      24 |
|   1 |  SORT AGGREGATE                 |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      24 |
|   2 |   PARTITION HASH ALL            |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|   3 |    FIRST ROW                    |        |      4 |      1 |     9   (0)|       |       |      4 |00:00:00.01 |      24 |
|*  4 |     INDEX RANGE SCAN (MIN/MAX)  | PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|   5 |      SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   6 |       PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   7 |        INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
---------------------------------------------------------------------------------------------------------------------------------

In 12.2 you can see that the main query is now doing an “index range scan (min/max)” on each index partition in turn, based on the incoming (though unknown at parse time) single value from the subquery. As a result the total work done is a mere 24 buffer visits.

There have been a couple of occasions in the past where I’ve had to write some PL/SQL to work around little details like this. It’s nice to know simple tables and partitoned tables with local indexes can now behave the same way. I also wonder whether there may be sites that could drop (or drop columns from, or make local) some indexes that they’ve previously created to  handle queries of the “most recent occurence” type.

If, for any reason, you need to disable this enhancement, it’s controlled by fix_control (v$system_fix_control) “18915345 Allow MIN/MAX optimization for pred having single row subquery” which can be set in the startup file, at the system level, or in the session.

Update

Checking MoS for the bug number I found that the limitation had been reported for 11.2.0.3, with “Fixed in product version” reported as 12.2; but there are patches for various releases of 11.2.0.4, though none yet for 12.1.0.2 – but if you think you need it you can always try raising an SR.

 

Franck Pachot's picture

OpenShift on my Windows 10 laptop with MiniShift

If you want to play with OpenShift on your laptop, you can, in a Virtual Machine. I have VirtualBox installed on my laptop. I’ll install Minishift here, which will create the VM to run OpenShift with few simple commands only. On Linux you can refer to Daniel’s post. Here is the Windows version. Oh, and Daniel did that to run Postgres but my goal is to run an Oracle container of course. Or MySQL maybe.

I’ve downloaded minishift-1.18.0-windows-amd64.zip and unzipped it in D:\Downloads\minishift-1.18.0-windows-amd64 where I have minishift.exe

Minishift

I configure to use VirtualBox

minishift config set vm-driver virtualbox

It is installed in my Windows profile:

C:\Users\fpa\.minishift

Be careful, minishift do not know that we have multiple drives in Windows. When I was running minishift.exe from the D: disk is was not able to find the virtual machine’s files that were on C:

D:\Downloads\minishift-1.18.0-windows-amd64\minishift-1.18.0-windows-amd64>minishift start
-- Starting profile 'minishift'
-- Checking if https://github.com is reachable ... OK
-- Checking if requested OpenShift version 'v3.9.0' is valid ... OK
-- Checking if requested OpenShift version 'v3.9.0' is supported ... OK
-- Checking if requested hypervisor 'virtualbox' is supported on this platform ... OK
-- Checking if VirtualBox is installed ... OK
-- Checking the ISO URL ... OK
-- Downloading OpenShift binary 'oc' version 'v3.9.0'
40.81 MiB / 40.81 MiB [===================================================================================] 100.00% 0s-- Downloading OpenShift v3.9.0 checksums ... OK
-- Checking if provided oc flags are supported ... OK
-- Starting local OpenShift cluster using 'virtualbox' hypervisor ...
-- Minishift VM will be configured with ...
Memory: 2 GB
vCPUs : 2
Disk size: 20 GB
-- Starting Minishift VM ..... FAIL E0529 17:08:31.056327 1448 start.go:391] Error starting the VM: Error creating the VM. Error creating machine: Error in driver during machine creation: open /Users/fpa/.minishift/cache/iso/b2d/v1.3.0/minishift-b2d.iso: The system cannot find the path specified.. Retrying.
Error starting the VM: Error creating the VM. Error creating machine: Error in driver during machine creation: open /Users/fpa/.minishift/cache/iso/b2d/v1.3.0/minishift-b2d.iso: The system cannot find the path specified.

Then, I changed to the C: drive

D:\Downloads\minishift-1.18.0-windows-amd64\minishift-1.18.0-windows-amd64>C:
 
C:\Users\fpa>dir \Users\fpa\.minishift\cache\iso\b2d\v1.3.0\
Volume in drive C is OS
Volume Serial Number is 26AE-33F7
 
Directory of C:\Users\fpa\.minishift\cache\iso\b2d\v1.3.0
 
29-May-18 15:22 .
29-May-18 15:22 ..
29-May-18 15:22 41,943,040 minishift-b2d.iso
1 File(s) 41,943,040 bytes
2 Dir(s) 30,652,370,944 bytes free

And I run minishift from there:

C:\Users\fpa>D:minishift start
-- Starting profile 'minishift'
-- Checking if https://github.com is reachable ... OK
-- Checking if requested OpenShift version 'v3.9.0' is valid ... OK
-- Checking if requested OpenShift version 'v3.9.0' is supported ... OK
-- Checking if requested hypervisor 'virtualbox' is supported on this platform ... OK
-- Checking if VirtualBox is installed ... OK
-- Checking the ISO URL ... OK
-- Checking if provided oc flags are supported ... OK
-- Starting local OpenShift cluster using 'virtualbox' hypervisor ...
-- Minishift VM will be configured with ...
Memory: 2 GB
vCPUs : 2
Disk size: 20 GB
-- Starting Minishift VM .............................. OK
-- Checking for IP address ... OK
-- Checking for nameservers ... OK
-- Checking if external host is reachable from the Minishift VM ...
Pinging 8.8.8.8 ... OK
-- Checking HTTP connectivity from the VM ...
Retrieving http://minishift.io/index.html ... OK
-- Checking if persistent storage volume is mounted ... OK
-- Checking available disk space ... 0% used OK
Importing 'openshift/origin:v3.9.0' . CACHE MISS
Importing 'openshift/origin-docker-registry:v3.9.0' . CACHE MISS
Importing 'openshift/origin-haproxy-router:v3.9.0' . CACHE MISS
-- OpenShift cluster will be configured with ...
Version: v3.9.0
-- Copying oc binary from the OpenShift container image to VM ...................................... OK
-- Starting OpenShift cluster ........................
Using Docker shared volumes for OpenShift volumes
Using public hostname IP 192.168.99.102 as the host IP
Using 192.168.99.102 as the server IP
Starting OpenShift using openshift/origin:v3.9.0 ...
OpenShift server started.
 
The server is accessible via web console at:
  https://192.168.99.102:8443
 
You are logged in as:
User: developer
Password:
 
To login as administrator:
oc login -u system:admin

Docker

That’s all. I have a new VM in VirtualBox whith its main files in C:/Users/fpa/.minishift

C:/Users/fpa/.minishift/machines/minishift/boot2docker.iso
C:/Users/fpa/.minishift/machines/minishift/disk.vmdk

The VM boots on the Boot2Docker iso, which is the way to run Docker on Windows without enabling HyperV. The first network interface is NAT for internet access. The second one has a DHCP IP from 192.168.99.1

You can control the VM with minishift (start, stop, configure, ssh,…):

D:\Downloads\minishift-1.18.0-windows-amd64\minishift-1.18.0-windows-amd64>d:minishift ssh
## .
## ## ## ==
## ## ## ## ## ===
/"""""""""""""""""\___/ ===
~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ / ===- ~~~
\______ o __/
\ \ __/
\____\_______/
_ _ ____ _ _
| |__ ___ ___ | |_|___ \ __| | ___ ___| | _____ _ __
| '_ \ / _ \ / _ \| __| __) / _` |/ _ \ / __| |/ / _ \ '__|
| |_) | (_) | (_) | |_ / __/ (_| | (_) | (__| < __/ |
|_.__/ \___/ \___/ \__|_____\__,_|\___/ \___|_|\_\___|_|
Boot2Docker version 1.12.6, build HEAD : 5ab2289 - Wed Jan 11 03:20:40 UTC 2017
Docker version 1.12.6, build 78d1802

We have everything running in containers here:

docker@minishift:/mnt/sda1$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
77c0ef5a80d7 50c7bffa0653 "/usr/bin/openshift-r" 8 minutes ago Up 8 minutes k8s_router_router-1-tsmw7_default_7a1be0e2-635b-11e8-843d-f2c6a11ee2db_1
10fb6a2a6b70 9b472363b07a "/bin/sh -c '/usr/bin" 8 minutes ago Up 8 minutes k8s_registry_docker-registry-1-zfxm5_default_7865ac33-635b-11e8-843d-f2c6a11ee2db_1
2f6b90fb0bb4 openshift/origin-pod:v3.9.0 "/usr/bin/pod" 8 minutes ago Up 8 minutes k8s_POD_router-1-tsmw7_default_7a1be0e2-635b-11e8-843d-f2c6a11ee2db_1
3c720d166989 fae77002371b "/usr/bin/origin-web-" 8 minutes ago Up 8 minutes k8s_webconsole_webconsole-7dfbffd44d-48b9h_openshift-web-console_62b66c66-635b-11e8-843d-f2c6a11ee2db_1
bb5870fc0b7e openshift/origin-pod:v3.9.0 "/usr/bin/pod" 8 minutes ago Up 8 minutes k8s_POD_docker-registry-1-zfxm5_default_7865ac33-635b-11e8-843d-f2c6a11ee2db_1
95663bf29487 openshift/origin-pod:v3.9.0 "/usr/bin/pod" 8 minutes ago Up 8 minutes k8s_POD_webconsole-7dfbffd44d-48b9h_openshift-web-console_62b66c66-635b-11e8-843d-f2c6a11ee2db_1
5fc022dbe112 openshift/origin:v3.9.0 "/usr/bin/openshift s" 8 minutes ago Up 8 minutes origin

But we should not have to connect to this machine.

The minishift executable can be used to control anything. As I have docker client installed on my laptop (the Docker Toolbox) I can get the environment variables:

D:\Downloads\minishift-1.18.0-windows-amd64\minishift-1.18.0-windows-amd64>d:minishift docker-env
SET DOCKER_TLS_VERIFY=1
SET DOCKER_HOST=tcp://192.168.99.102:2376
SET DOCKER_CERT_PATH=C:\Users\fpa\.minishift\certs
REM Run this command to configure your shell:
REM @FOR /f "tokens=*" %i IN ('minishift docker-env') DO @call %i

and see, from Windows, the docker images that are in the VM:

D:\Downloads\minishift-1.18.0-windows-amd64\minishift-1.18.0-windows-amd64>where docker
C:\Program Files\Docker Toolbox\docker.exe
 
SET DOCKER_TLS_VERIFY=1
SET DOCKER_HOST=tcp://192.168.99.102:2376
SET DOCKER_CERT_PATH=C:\Users\fpa\.minishift\certs
 
D:\Downloads\minishift-1.18.0-windows-amd64\minishift-1.18.0-windows-amd64>docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
openshift/origin-web-console v3.9.0 fae77002371b 12 days ago 495.1 MB
openshift/origin-docker-registry v3.9.0 9b472363b07a 12 days ago 464.9 MB
openshift/origin-haproxy-router v3.9.0 50c7bffa0653 12 days ago 1.283 GB
openshift/origin-deployer v3.9.0 e4de3cb64af9 12 days ago 1.261 GB
openshift/origin v3.9.0 83ec0170e887 12 days ago 1.261 GB
openshift/origin-pod v3.9.0 b6d2be1df9c0 12 days ago 220.1 MB

While I’m there, I can run whatever I want as a docker container. Let’s try with Oracle.

I need to login to the Docker store (where I have accepted the license conditions)

D:\Downloads\minishift-1.18.0-windows-amd64\minishift-1.18.0-windows-amd64>docker login
Login with your Docker ID to push and pull images from Docker Hub. If you don't have a Docker ID, head over to https://hub.docker.com to create one.
Username: pachot
Password:
Login Succeeded

Let’s pull the Oracle ‘slim’ image:

D:\Downloads\minishift-1.18.0-windows-amd64\minishift-1.18.0-windows-amd64>docker pull store/oracle/database-enterprise:12.2.0.1-slim
12.2.0.1: Pulling from store/oracle/database-enterprise
4ce27fe12c04: Downloading [> ] 1.076 MB/83.31 MB
9d3556e8e792: Downloading [> ] 535.3 kB/151 MB
fc60a1a28025: Download complete
0c32e4ed872e: Download complete
b465d9b6e399: Waiting

And run it:

D:\Downloads\minishift-1.18.0-windows-amd64\minishift-1.18.0-windows-amd64>docker run -it --rm --name orcl store/oracle/database-enterprise:12.2.0.1-slim
Setup Oracle Database
Oracle Database 12.2.0.1 Setup
Wed May 30 20:16:56 UTC 2018
 
Check parameters ......
log file is : /home/oracle/setup/log/paramChk.log
paramChk.sh is done at 0 sec
 
untar DB bits ......
log file is : /home/oracle/setup/log/untarDB.log
untarDB.sh is done at 38 sec
...

You may find that funny, but the Oracle images in the Docker store contains only a tarball of Oracle Home and a pre-created database. Just the time to untar those and run the instance and after 2 minutes I have my database ready. All is untar-ed to the volume, including the software.

Here are the ports that are redirected to:

C:\Users\fpa>SET DOCKER_TLS_VERIFY=1
C:\Users\fpa>SET DOCKER_HOST=tcp://192.168.99.102:2376
C:\Users\fpa>SET DOCKER_CERT_PATH=C:\Users\fpa\.minishift\certs
C:\Users\fpa>docker port orcl
1521/tcp -> 0.0.0.0:32771
5500/tcp -> 0.0.0.0:32770

Then, easy to connect with SQL*Net with the credentials provided (see the setup instructions)

C:\Users\fpa>sqlcl sys/Oradoc_db1@//192.168.99.102:32771/orclpdb1.localdomain as sysdba
SQLcl: Release 18.1.1 Production on Wed May 30 22:41:10 2018
Copyright (c) 1982, 2018, Oracle. All rights reserved.
 
SQL> select instance_name,host_name from v$instance;
 
INSTANCE_NAME HOST_NAME
------------- ------------
ORCLCDB 254f96463883

OpenShift

So, that’s an easy way to run Oracle on Docker when you have VirtualBox. One download and 5 commands and I’m ready to connect. But that’s not the goal. Here we have OpenShift here to manage multiple Docker containers.

According to the ‘minishift start’ output I have a Web server on https://192.168.99.102:8443 (user: system, password: admin)

It already contains a lot of databases:
CaptureMiniShift001

They are really easy to use. In two clicks I’ve run a MySQL container:
CaptureMiniShift002

If you don’t like the GUI, there’s the command line interface of OpenShift within the minishift ‘cache:

C:\Users\fpa>C:\Users\fpa\.minishift\cache\oc\v3.9.0\windows\oc.exe login https://192.168.99.102:8443
Authentication required for https://192.168.99.102:8443 (openshift)
Username: system
Password:
Login successful.
 
You have one project on this server: "test"
 
Using project "test".
 
C:\Users\fpa>C:\Users\fpa\.minishift\cache\oc\v3.9.0\windows\oc.exe status
In project test on server https://192.168.99.102:8443
 
svc/mysql - 172.30.224.246:3306
dc/mysql deploys openshift/mysql:5.7
deployment #1 deployed 7 minutes ago - 1 pod

Now that I have OpenShift running and installed, I’ll be able to run Oracle and manage containers from there. That’s for the next post on this subject.

 

Cet article OpenShift on my Windows 10 laptop with MiniShift est apparu en premier sur Blog dbi services.

fritshoogland's picture

Oracle database wait event ‘db file async I/O submit’ timing bug

This blogpost is a look into a bug in the wait interface that has been reported by me to Oracle a few times. I verified all versions from Oracle 11.2 version up to 18.2.0.0.180417 on Linux x86_64, in all these versions this bug is present. The bug is that the wait event ‘db file async I/O submit’ does not time anything when using ASM, only when using a filesystem, where this wait event essentially times the time the system call io_submit takes. All tests are done on Linux x86_64, Oracle Linux 7.4 with database and grid version 18.2.0.0.180417

So what?
You might have not seen this wait event before; that’s perfectly possible, because this wait event is unique to the database writer. So does this wait event matter?

When the Oracle datebase engine is set to using asynchronous I/O, and when it makes sense to use asynchronous I/O (!), the engine will use the combination of io_submit() to issue I/O requests to the operating system, and when needs to, fetch the I/O requests using io_getevents(). In general (so not consistently), the engine does not time io_submit, which is a non-blocking call, it only times when it needs to wait for I/O requests using io_getevents(), which is reported as a wait event in an IO wait event class. A lot of ‘%parallel%’ IO related wait events can time asynchronous IO calls.

So why would the engine then time io_submit() for the database writer?
Well, io_submit() is not a blocking call, UNLESS the device queue to which the requests are submitted is full. This means that the developers of the database writer code decided to implement a wait event for io_submit, which is not the case for any other process.

To understand why this makes sense, a little knowledge about database writer internals is necessary. When blocks are dirtied in the cache and these blocks are checkpointed later on, these must be written to disk. The amount of blocks to be written and therefore the number of writes can get high very quickly. The way this is processed is quite interesting (simplified obviously; and when using a filesystem):

a) the database writer picks up a batch of blocks needing writing, for up to 128 IO requests.
b) that batch is submitted, timed by ‘db file async I/O submit’
c) a blocking io_getevents call is issued, timed by ‘db file parallel write’, to wait for the IOs to finish. The interesting thing specifically for the database writer is that the minimal number of IOs ready to wait for is very low (a few IOs to 25-75% of the IOs if the amount gets bigger). Any finished IO will be picked up here, however it’s perfectly possible IOs are still active after this step. In fact, I think it’s deliberately made that way.
d) if any IO requests are still pending, a nonblocking, non-wait event timed io_getevents call is issued to pick up any finished IOs.
e) if any blocks still need writing for which no IO request have been submitted, go to a).
f) if at this point IO requests are still pending, to to c).

This means that the database writer can submit huge amounts of IO requests, and keep on doing that, much more than any other process, because it doesn’t need to wait for all IOs to finish. So, this means that if there is a process that is likely to run into a blocking io_submit call, it’s the database writer.

When using a database without ASM, the above wait timing is exactly what happens. A function call graph of io_submit for the database writer when the database uses a filesystem looks like this:

 | | | | | > kslwtbctx(0x7ffc55eb3e60, 0x8b4, ...)
 | | | | | | > sltrgftime64(0x6c2f4288, 0x6bbe3ca0, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffc55eb3400, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffc55eb3400, ...)
 | | | | | | | | < clock_gettime+0x00000000005d returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | |  kslwait_timeout_centi_to_micro(0x7fffffff, 0x19183e92, ...)
 | | | | | |  kskthbwt(0x19c37b0d8, 0xb3, ...)
 | | | | | |  kslwt_start_snapshot(0x6c2f5538, 0x6c2f5538, ...)
 | | | | | | < kslwt_start_snapshot+0x0000000000d0 returns: 0x6c2f4ae8
 | | | | |  ksfdgo(0x800, 0, ...)
 | | | | | | > ksfd_skgfqio(0x7fc304483f78, 0x9, ...)
 | | | | | | | > skgfqio(0x7fc3091fddc0, 0x7fc304483f78, ...)
 | | | | | | | | > skgfrvldtrq(0x7fc304483f78, 0x9, ...)
 | | | | | | | |  sltrgftime64(0x2000, 0x7fc3043772b0, ...)
 | | | | | | | | | > clock_gettime@plt(0x1, 0x7ffc55eae3b0, ...)
 | | | | | | | | | | > clock_gettime(0x1, 0x7ffc55eae3b0, ...)
 | | | | | | | | | | < clock_gettime+0x00000000005d returns: 0
 | | | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | | | |  skgfr_lio_listio64(0x7fc3091fddc0, 0x1, ...)
 | | | | | | | | | > io_submit@plt(0x7fc302992000, 0x115, ...)
 | | | | | | | | | < io_submit+0x000000000007 returns: 0x115
 | | | | | | | | < skgfr_lio_listio64+0x000000000131 returns: 0
 | | | | | | | < skgfqio+0x00000000035e returns: 0
 | | | | | | < ksfd_skgfqio+0x0000000001f5 returns: 0
 | | | | |  kslwtectx(0x7ffc55eb3e60, 0x7fc304483f78, ...)
 | | | | | | > sltrgftime64(0x7ffc55eb3e60, 0x7fc304483f78, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffc55eb33e0, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffc55eb33e0, ...)
 | | | | | | | | < clock_gettime+0x00000000005d returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | |  kslwt_end_snapshot(0x6c2f5538, 0x6c2f5538, ...)
 | | | | | | | > kslwh_enter_waithist_int(0x6c2f5538, 0x6c2f5538, ...)
 | | | | | | |  kslwtrk_enter_wait_int(0x6c2f5538, 0x6c2f5538, ...)
 | | | | | | | < kslwtrk_enter_wait_int+0x000000000019 returns: 0x6bcaa180
 | | | | | |  kslwt_update_stats_int(0x6c2f5538, 0x6c2f5538, ...)
 | | | | | | | > kews_update_wait_time(0x9, 0x8f54, ...)
 | | | | | | |  ksucpu_wait_update(0x9, 0x8f54, ...)
 | | | | | | | < ksucpu_wait_update+0x000000000036 returns: 0x6bd658b0
 | | | | | |  kskthewt(0x19c38402c, 0xb3, ...)
 | | | | | | < kskthewt+0x0000000005b1 returns: 0x30
 | | | | |  select event#, name from v$event_name where event# = to_number('b3','xx');
    EVENT# NAME
---------- ----------------------------------------------------------------
       179 db file async I/O submit

Now on to the actual purpose of this blog post, the same situation, but now when ASM is used. When ASM is used, there is a significant increase in the call stack. This means more code is executed. This may sound strange at first, but it’s very logical if you give it some thought: when using ASM, the Oracle database is talking to raw devices. This means that any of the functionality a filesystem performs, which is implemented in ASM must in some way be performed. This is done in several additional layers in the database code.

Let’s look at a backtrace of io_submit of the database writer when using a filesystem:

#0  0x00007f22bdb36690 in io_submit () from /lib64/libaio.so.1
#1  0x0000000004832ef0 in skgfr_lio_listio64 ()
#2  0x000000001238b7ce in skgfqio ()
#3  0x0000000011d5c3ad in ksfd_skgfqio ()
#4  0x0000000011d57fce in ksfdgo ()
#5  0x0000000000d9f21c in ksfdaio ()
#6  0x00000000039c4a5e in kcfisio ()
#7  0x0000000001d836ec in kcbbdrv ()
#8  0x000000001222fac5 in ksb_act_run_int ()
#9  0x000000001222e792 in ksb_act_run ()
#10 0x0000000003b8b9ce in ksbabs ()
#11 0x0000000003baa161 in ksbrdp ()
#12 0x0000000003fbaed7 in opirip ()
#13 0x00000000026ecaa0 in opidrv ()
#14 0x00000000032904cf in sou2o ()
#15 0x0000000000d681cd in opimai_real ()
#16 0x000000000329d2a1 in ssthrdmain ()
#17 0x0000000000d680d3 in main ()

If you want to follow the call sequence, a backtrace/stacktrace must be read from the bottom up.
ksb = kernel service background processes
kcf = kernel cache file management
ksfd = kernel service functions disk IO
skgf = o/s dependent kernel generic fiile
I hope you recognise the logical layers that are necessary for doing the I/O.

Now look at a backtrace of io_submit of the database writer when using ASM:

#0  0x00007f22bdb36690 in io_submit () from /lib64/libaio.so.1
#1  0x0000000004832ef0 in skgfr_lio_listio64 ()
#2  0x000000001238b7ce in skgfqio ()
#3  0x0000000011d5c3ad in ksfd_skgfqio ()
#4  0x0000000011d57fce in ksfdgo ()
#5  0x0000000000d9f21c in ksfdaio ()
#6  0x000000000755c1a8 in kfk_ufs_async_io ()
#7  0x0000000001455fb2 in kfk_submit_io ()
#8  0x00000000014551a8 in kfk_io1 ()
#9  0x0000000001450b3e in kfk_transitIO ()
#10 0x000000000143c450 in kfioSubmitIO ()
#11 0x000000000143bbaa in kfioRequestPriv ()
#12 0x000000000143b160 in kfioRequest ()
#13 0x000000000136f6bd in ksfdafRequest ()
#14 0x000000000137311a in ksfdafGo ()
#15 0x0000000011d58179 in ksfdgo ()
#16 0x0000000000d9f269 in ksfdaio ()
#17 0x00000000039c4a5e in kcfisio ()
#18 0x0000000001d836ec in kcbbdrv ()
#19 0x000000001222fac5 in ksb_act_run_int ()
#20 0x000000001222e792 in ksb_act_run ()
#21 0x0000000003b8b9ce in ksbabs ()
#22 0x0000000003baa161 in ksbrdp ()
#23 0x0000000003fbaed7 in opirip ()
#24 0x00000000026ecaa0 in opidrv ()
#25 0x00000000032904cf in sou2o ()
#26 0x0000000000d681cd in opimai_real ()
#27 0x000000000329d2a1 in ssthrdmain ()
#28 0x0000000000d680d3 in main ()

Essentially, a couple of layers are added to facilitate ASM; ksfdaf, kfio, kfk.
So the logical sequence becomes:
ksb = kernel service background processes
kcf = kernel cache file management
ksfd = kernel service functions disk IO
ksfdaf = kernel service functions disk IO ASM files
kfio = kernel automatic storage management translation I/O layer
kfk = kernel automatic storage management KFK

ksfd = kernel service functions disk IO
skgf = o/s dependent kernel generic file

Now to give an overview of the function call sequence, I simply need to cut out a lot of functions because otherwise it would be unreadable.

 | | | | | > ksfdgo(0x806, 0x35b4, ...)
 | | | | | | > ksfdafGo(0x806, 0x35b4, ...)
 | | | | | | | > ksfdafRequest(0x7ffcc7d845a0, 0x10f, ...)
 | | | | | | | | > kfioRequest(0x7ffcc7d845a0, 0x10f, ...)
 | | | | | | | | | > _setjmp@plt(0x7ffcc7d821d8, 0x10f, ...)
 | | | | | | | | |  __sigsetjmp(0x7ffcc7d821d8, 0, ...)
 | | | | | | | | |  __sigjmp_save(0x7ffcc7d821d8, 0, ...)
 | | | | | | | | |  kfioRequestPriv(0x7ffcc7d845a0, 0x10f, ...)
...
 | | | | | | | | | | | | | | | | > ksfdgo(0x188, 0x35c5, ...)
 | | | | | | | | | | | | | | | | | > ksfd_skgfqio(0x7f4232709f78, 0x9, ...)
 | | | | | | | | | | | | | | | | | | > skgfqio(0x7f4237483dc0, 0x7f4232709f78, ...)
 | | | | | | | | | | | | | | | | | | | > skgfrvldtrq(0x7f4232709f78, 0x9, ...)
 | | | | | | | | | | | | | | | | | | |  sltrgftime64(0x2000, 0x7f4230b61c98, ...)
 | | | | | | | | | | | | | | | | | | | | > clock_gettime@plt(0x1, 0x7ffcc7d7bb10, ...)
 | | | | | | | | | | | | | | | | | | | | | > clock_gettime(0x1, 0x7ffcc7d7bb10, ...)
 | | | | | | | | | | | | | | | | | | | | | < clock_gettime+0x000000000059 returns: 0
 | | | | | | | | | | | | | | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | | | | | | | | | | | | | | |  skgfr_lio_listio64(0x7f4237483dc0, 0x1, ...)
 | | | | | | | | | | | | | | | | | | | | > io_submit@plt(0x7f4230ad8000, 0x112, ...)
 | | | | | | | | | | | | | | | | | | | | < io_submit+0x000000000007 returns: 0x112
 | | | | | | | | | | | | | | | | | | | < skgfr_lio_listio64+0x000000000131 returns: 0
 | | | | | | | | | | | | | | | | | | < skgfqio+0x00000000035e returns: 0
 | | | | | | | | | | | | | | | | | < ksfd_skgfqio+0x0000000001f5 returns: 0
 | | | | | | | | | | | | | | | | < ksfdgo+0x000000000135 returns: 0
...
 | | | | | | | | | < kfioRequestPriv+0x000000000224 returns: 0
 | | | | | | | | < kfioRequest+0x000000000251 returns: 0
 | | | | | | | < ksfdafRequest+0x0000000003c8 returns: 0
 | | | | | | < ksfdafGo+0x000000000081 returns: 0x1
 | | | | |  kslwtbctx(0x7ffcc7d86f60, 0x7f4232709f38, ...)
 | | | | | | > sltrgftime64(0x6da39e68, 0x6d2f5bc0, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffcc7d86500, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffcc7d86500, ...)
 | | | | | | | | < clock_gettime+0x000000000059 returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | |  kslwait_timeout_centi_to_micro(0x7fffffff, 0x14cb3fcb, ...)
 | | | | | |  kskthbwt(0x2b0f3fe06, 0xb3, ...)
 | | | | | |  kslwt_start_snapshot(0x6da3b118, 0x6da3b118, ...)
 | | | | | | < kslwt_start_snapshot+0x0000000000d0 returns: 0x6da3a6c8
 | | | | |  ksfdgo(0x808, 0, ...)
 | | | | |  kslwtectx(0x7ffcc7d86f60, 0x9, ...)
 | | | | | | > sltrgftime64(0x7ffcc7d86f60, 0x9, ...)
 | | | | | | | > clock_gettime@plt(0x1, 0x7ffcc7d864e0, ...)
 | | | | | | | | > clock_gettime(0x1, 0x7ffcc7d864e0, ...)
 | | | | | | | | < clock_gettime+0x000000000059 returns: 0
 | | | | | | | < clock_gettime+0x00000000003a returns: 0
 | | | | | |  kslwt_end_snapshot(0x6da3b118, 0x6da3b118, ...)
 | | | | | | | > kslwh_enter_waithist_int(0x6da3b118, 0x6da3b118, ...)
 | | | | | | |  kslwtrk_enter_wait_int(0x6da3b118, 0x6da3b118, ...)
 | | | | | | | < kslwtrk_enter_wait_int+0x000000000019 returns: 0x6dacf1e8
 | | | | | |  kslwt_update_stats_int(0x6da3b118, 0x6da3b118, ...)
 | | | | | | | > kews_update_wait_time(0x9, 0xd02, ...)
 | | | | | | |  ksucpu_wait_update(0x9, 0xd02, ...)
 | | | | | | | < ksucpu_wait_update+0x000000000036 returns: 0x6db40f70
 | | | | | |  kskthewt(0x2b0f40b08, 0xb3, ...)
 | | | | | | < kskthewt+0x0000000005b1 returns: 0x30
 | | | | |  ksfdafCopyWaitCtx(0x7ffcc7d86f60, 0xb3, ...)
 | | | | | | > _intel_fast_memcpy(0x7ffcc7d86f60, 0x7f423270a848, ...)
 | | | | | |  _intel_fast_memcpy.P(0x7ffcc7d86f60, 0x7f423270a848, ...)
 | | | | | |  __intel_ssse3_rep_memcpy(0x7ffcc7d86f60, 0x7f423270a848, ...)
 | | | | | | < __intel_ssse3_rep_memcpy+0x00000000242e returns: 0x7ffcc7d86f60
 | | | | | < ksfdafCopyWaitCtx+0x000000000038 returns: 0x7ffcc7d86f60
 | | | | < ksfdaio+0x00000000055f returns: 0x7ffcc7d86f60
 | | |  oradebug setorapname dbw0
Oracle pid: 18, Unix process pid: 3617, image: oracle@o182-fs.local (DBW0)
SQL> oradebug event sql_trace wait=true
Statement processed.

Then go to the trace directory, and tail the database writer trace file.
Next, attach to the database writer with gdb, and break on the io_submit call and perform a sleep 1 (sleep for 1 second). This should add 1000000 microseconds to the waiting time, if the wait event includes the function we put the break on.

(gdb) break io_submit
Breakpoint 1 at 0x7f336b986690
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>shell sleep 1
>c
>end

Now continue the database writer, and execute a checkpoint (alter system checkpoint), and look at the wait events:

WAIT #0: nam='db file async I/O submit' ela= 2 requests=11 interrupt=0 timeout=0 obj#=-1 tim=15801301770
WAIT #0: nam='db file parallel write' ela= 5077 requests=1 interrupt=0 timeout=2147483647 obj#=-1 tim=15801306930

Well, it’s clear nothing has timed the one second we added, right? (the time in the wait event is at ‘ela’, which is in microseconds)

For the sake of completeness, and to validate this test method, let’s add the sleep to io_getevents (io_getevents_0_4) to see if ‘db file parallel write’ does show the extra time we added in the system call, because ‘db file parallel write’ is supposed to time io_getevents():

(gdb) dis 1
(gdb) break io_getevents_0_4
Breakpoint 2 at 0x7f336b986650
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>shell sleep 1
>c
>end

Continue the database writer again, and execute a checkpoint:

WAIT #0: nam='db file async I/O submit' ela= 1 requests=22 interrupt=0 timeout=0 obj#=-1 tim=15983030322
WAIT #0: nam='db file parallel write' ela= 1003978 requests=2 interrupt=0 timeout=2147483647 obj#=-1 tim=15984034336

Yay! There we got the artificial waiting time!

Based on this, I can only come the conclusion that the wait event ‘db file async I/O submit’ does not perform any actual timing of the io_submit system call when ASM is used with the Oracle database.

Jonathan Lewis's picture

Index Bouncy Scan 3

This is a follow-up to a problem I had with yesterday’s example of using recursive CTEs to “bounce” along a multi-column index to pick out the unique set of combinations of the first two columns. Part of the resulting query used a pair of aggregate scalar subqueries in a select list – and Andrew Sayer improved on my query by introducing a “cross apply” (which I simply hadn’t thought of) which the optimizer transformed into a lateral view (which I had thought of, but couldn’t get to work).

After seeing what the Andrew and the optimizer had done I looked a little more closely at my lateral view experiment and modified it so that it worked. Here are the three critical versions of the relevant code fragment; first is my original code, then Andrew’s cross apply, then my working lateral view version:

select
        (select min(t1.val1) val1 from t1 where t1.val1 > bounce1.val1) val1,
        (select min(t1.val2) val2 from t1 where t1.val1 > bounce1.val1 and rownum = 1) val2
from    bounce1
where   bounce1.val1 is not null
 
 
select
        ca.val1 ,ca.val2
from    bounce1
cross  apply (select val1, val2
              from  (select /*+ index(t1) no_index_ffs(t1) */
                             val1, val2
                     from    t1
                     where   t1.val1 > bounce1.val1
                     and     rownum = 1
                    )
             ) ca
where  bounce1.val1 is not null
 
----

select
        ca.val1 ,ca.val2
from    bounce1, 
        lateral(select val1, val2
              from  (select /*+ index(t1) no_index_ffs(t1) */
                             val1, val2
                     from    t1
                     where   t1.val1 > bounce1.val1
                     and     rownum = 1
                    )
             ) ca
where  bounce1.val1 is not null

All I’ve done to modify Andrew’s code is put a comma after the table (actually CTE) bounce1, then change “cross apply” to “lateral”. Compare the resulting text with the following lateral version that doesn’t work:


select
        ca.val1 ,ca.val2
from    bounce1, 
        lateral (
                   select /*+ index(t1) no_index_ffs(t1) */
                             val1, val2
                     from    t1
                     where   t1.val1 > bounce1.val1
                     and     rownum = 1
             ) ca
where  bounce1.val1 is not null

To get from not working to working all I’ve done is wrap the text in my lateral() subquery inside one more (apparently redundant) layer of “select * from ()”!

In fact my assumption that my code wasn’t working was incorrect – what was really going on was that the code I had written was producing the wrong results but I thought that I had made a mistake in the way I was writing it and couldn’t figure out what I had done wrong.

Problem Solving:

To get a better idea of what’s going on, I took a closer look at the execution plans. Here are the plans (main body only) for the two variants of using the lateral() view – the first from the SQL with the “redundant” select, the second as I originally wrote it. Notice that the number of rows (A-Rows) returned in the first case is the 30 expected while in the second case it’s only 10.


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |                 |      1 |        |   125 (100)|     30 |00:00:00.01 |      40 |     28 |       |       |          |
|   1 |  SORT ORDER BY                               |                 |      1 |      4 |   125   (3)|     30 |00:00:00.01 |      40 |     28 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |                 |      1 |      4 |   124   (2)|     30 |00:00:00.01 |      40 |     28 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |     33 |00:00:00.01 |      40 |     28 |  1024 |  1024 |          |
|*  4 |     VIEW                                     |                 |      1 |      2 |    61   (2)|      3 |00:00:00.01 |       8 |      4 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |      3 |00:00:00.01 |       8 |      4 |  1024 |  1024 |          |
|*  6 |       VIEW                                   |                 |      1 |      1 |    29   (0)|      1 |00:00:00.01 |       2 |      1 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY                 |                 |      1 |  10000 |    29   (0)|      1 |00:00:00.01 |       2 |      1 | 73728 | 73728 |          |
|   8 |         INDEX FULL SCAN                      | T1_PK           |      1 |  10000 |    29   (0)|      2 |00:00:00.01 |       2 |      1 |       |       |          |
|   9 |       NESTED LOOPS                           |                 |      3 |      1 |    31   (0)|      2 |00:00:00.01 |       6 |      3 |       |       |          |
|  10 |        RECURSIVE WITH PUMP                   |                 |      3 |        |            |      3 |00:00:00.01 |       0 |      0 |       |       |          |
|  11 |        VIEW                                  | VW_LAT_1BBF5C63 |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |      3 |       |       |          |
|  12 |         VIEW                                 |                 |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |      3 |       |       |          |
|* 13 |          COUNT STOPKEY                       |                 |      3 |        |            |      2 |00:00:00.01 |       6 |      3 |       |       |          |
|* 14 |           INDEX RANGE SCAN                   | T1_PK           |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |      3 |       |       |          |
|  15 |     SORT AGGREGATE                           |                 |     30 |      1 |            |     30 |00:00:00.01 |      32 |     24 |       |       |          |
|  16 |      FIRST ROW                               |                 |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |     24 |       |       |          |
|* 17 |       INDEX RANGE SCAN (MIN/MAX)             | T1_PK           |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |     24 |       |       |          |
|  18 |     RECURSIVE WITH PUMP                      |                 |     11 |        |            |     30 |00:00:00.01 |       0 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |                 |      1 |        |   125 (100)|     10 |00:00:00.01 |      16 |       |       |          |
|   1 |  SORT ORDER BY                               |                 |      1 |      4 |   125   (3)|     10 |00:00:00.01 |      16 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |                 |      1 |      4 |   124   (2)|     10 |00:00:00.01 |      16 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |     11 |00:00:00.01 |      16 |  1024 |  1024 |          |
|*  4 |     VIEW                                     |                 |      1 |      2 |    61   (2)|      1 |00:00:00.01 |       4 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |      1 |00:00:00.01 |       4 |  1024 |  1024 |          |
|*  6 |       VIEW                                   |                 |      1 |      1 |    29   (0)|      1 |00:00:00.01 |       2 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY                 |                 |      1 |  10000 |    29   (0)|      1 |00:00:00.01 |       2 | 73728 | 73728 |          |
|   8 |         INDEX FULL SCAN                      | T1_PK           |      1 |  10000 |    29   (0)|      2 |00:00:00.01 |       2 |       |       |          |
|   9 |       NESTED LOOPS                           |                 |      1 |      1 |    31   (0)|      0 |00:00:00.01 |       2 |       |       |          |
|  10 |        RECURSIVE WITH PUMP                   |                 |      1 |        |            |      1 |00:00:00.01 |       0 |       |       |          |
|* 11 |        VIEW                                  | VW_DCL_1BBF5C63 |      1 |      1 |     2   (0)|      0 |00:00:00.01 |       2 |       |       |          |
|* 12 |         COUNT STOPKEY                        |                 |      1 |        |            |      1 |00:00:00.01 |       2 |       |       |          |
|  13 |          INDEX FULL SCAN                     | T1_PK           |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       2 |       |       |          |
|  14 |     SORT AGGREGATE                           |                 |     10 |      1 |            |     10 |00:00:00.01 |      12 |       |       |          |
|  15 |      FIRST ROW                               |                 |     10 |      1 |     2   (0)|      9 |00:00:00.01 |      12 |       |       |          |
|* 16 |       INDEX RANGE SCAN (MIN/MAX)             | T1_PK           |     10 |      1 |     2   (0)|      9 |00:00:00.01 |      12 |       |       |          |
|  17 |     RECURSIVE WITH PUMP                      |                 |     11 |        |            |     10 |00:00:00.01 |       0 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------------------

Most importantly we can see that the optimizer has used two different transformations. For the working query we see the view name VW_LAT_xxxxxxxx at operation 11, this is Oracle implementing a lateral view; for the problem query we see the view name VW_DCL_xxxxxxxx at operation 11, which is Oracle implementing a transformation to a “decorrelated lateral view”.

My first test after noting this difference was to see what would happen in I added the hint /*+ no_query_transformation */ to the query: it resulted in the VW_DCL_xxxxxxxx view name changing to VW_LAT_xxxxxxxx and the query producing the right result. Andrew Sayer, on the ODC thread, then pointed out that he’d done a couple more experiments and used the /*+ no_decorrelate() */ hint so I tried that with my query, adding it (with no parameters) to the subquery inside the lateral() clause – again the plan changed from using VW_DCL to VW_LAT and the results were correct.

Test Case

Bottom line on this – it looks like the optimizer is decorrelating a subquery when it shouldn’t, leading to wrong results. To make it easier to see this anomaly I stripped the original sample down to a basic test case starting with the table that I used in the previous posting:

rem
rem     Script:         decorralate.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2018
rem
rem     Last tested 
rem             18.1.0.0  -- via liveSQL
rem             12.2.0.1
rem             12.1.0.2
rem

create table t1
segment creation immediate
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum-1,3)                 val1,
        mod(rownum-1,10)                val2,
        lpad('x',100,'x')               padding
from
        generator       v1
order by
        dbms_random.value
;

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

alter table t1 add constraint t1_pk primary key(val1, val2, id);

Now two versions of a simplified piece of code that should select the distinct values of val1 greater than the lowest value (each row in the UNION ALL of dual is emulating the way in which yesterday’s recursive CTE was effectively saying “this is a current known value, find the next higher”):


prompt  =============
prompt  Right results
prompt  =============

select
        v1.val1, v1.val2
from    (
        select  0 val1, 0 val2 from dual
        union all
        select 1,0 from dual
        union all
        select 2,0 from dual
        ) bounce1,
        lateral (
            select val1, val2 from (
              select  /*+ index(t1) no_index_ffs(t1) */
                      t1.val1, t1.val2
              from    t1
              where   t1.val1 > bounce1.val1
              and     rownum = 1
            )
        ) v1
;

prompt  ===========================================
prompt  Wrong results -- "redundant" select removed
prompt  ===========================================

select
        v1.val1, v1.val2
from    (
        select  0 val1, 0 val2 from dual
        union all
        select 1,0 from dual
        union all
        select 2,0 from dual
        ) bounce1,
        lateral (
            -- select val1, val2 from (
              select  /*+ index(t1) no_index_ffs(t1) */
                      t1.val1, t1.val2
              from    t1
              where   t1.val1 > bounce1.val1
              and     rownum = 1
            -- )
        ) v1
;

Here’s a cut-n-paste from running the two queries:


=============
Right results
=============

      VAL1       VAL2
---------- ----------
         1          0
         2          0

2 rows selected.

============================================
Wrong results  -- "redundant" select removed
============================================

no rows selected

Finally, to get an idea of what’s gone wrong – and to show that the optimizer has done something wrong when attempting to decorrelate – we can take a look at the optimizer trace file to see the final transformed SQL that the optimizer has produced a plan for. (I enabled the trace with the command “alter session set events ‘trace [rdbms.SQL_Transform.*]’;” to limit the trace to just the information about optimizer transformations.) This – cosmetically altered – is the final “unparsed” query:

select 
        vw_dcl_a18161ff.val1 val1,
        vw_dcl_a18161ff.val2 val2 
from    ( 
                (select 0 val1 from sys.dual dual) 
                union all  
                (select 1 1 from sys.dual dual) 
                union all  
                (select 2 2 from sys.dual dual)
        ) bounce1, 
        (
        select
                 /*+ no_index_ffs (t1) index (t1) */ 
                t1.val1 val1_0,
                t1.val2 val2_1 
        from
                test_user.t1 t1
        where 
                rownum = 1
        ) vw_dcl_a18161ff 
where 
        vw_dcl_a18161ff.val1 > bounce1.val1

As you can see, the lateral view has turned into a non-mergeable inline view which selects the first row available from t1 by following the supplied hints, and joins that single row result set to bounce1. I have a suspicion that lateral views which include rownum predicates should not be decorrelated. I have looked on MoS to see if I can find any bugs related to decorrelating lateral views, but either there are none or my search terms weren’t good enough.

 

Jonathan Lewis's picture

Upgrades

One of my maxims for Oracle performance is: “Don’t try to be too clever”. Apart from the obvious reason that no-one else may be able to understand how to modify your code if the requirements change at a future date, there’s always the possibility that an Oracle upgrade will mean some clever trick you implemented will simply stop working.

While searching for information about a possible Oracle bug recently I noticed the following fix control (v$system_fix_control) in 12.2.0.1:


     BUGNO OPTIMIZE SQL_FEATURE                        DESCRIPTION                                                             VALUE
---------- -------- ---------------------------------- ---------------------------------------------------------------- ------------
  18385778          QKSFM_CARDINALITY_18385778         avoid virtual col usage if FI is unusable or invisible 

Maybe that’s just invalidated an idea I published 12 years ago.

I haven’t researched the bug or any underlying SR, but I can think of valid argument both for and against the fix as described.

 

 

Jonathan Lewis's picture

Index Bouncy Scan 2

I wrote a note some time last year about taking advantage of the “index range scan (min/max)” operation in a PL/SQL loop to find the small number distinct values in a large single column index efficiently (for example an index that was not very efficient but existed to avoid the “foreign key locking” problem. The resulting comments included pointers to other articles that showed pure SQL solutions to the same problem using recursive CTEs (“with” subqueries) from Markus Winand and Sayan Malakshinov: both writers also show examples of extending the technique to cover more cases than the simple list of distinct values.

The topic came up again on the ODC (OTN) database forum a couple of days ago; one of the replies linked back to my original posting, another gave the recursive solution for a single column index – so I ended up seeing the following question twice, once as a comment on my blog, once in the forum: “Can you extend this method to a two column index, what about an N column index ?”

Here’s a walk-through of working out one possible solution for the two-column requirement – how to find all the distinct combinations for the first two columns of a very large index without having to scan and aggregate the whole index. We start with a suitable table and index.


rem
rem     Script:         bouncy_index.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Apr 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2
rem             11.2.0.4
rem

create table t1
segment creation immediate
nologging
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum-1,3)                 val1,
        mod(rownum-1,10)                val2,
        lpad('x',100,'x')               padding
from
        generator       v1
order by
        dbms_random.value
;

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

alter table t1 add constraint t1_pk primary key(val1, val2, id);

I’ve created a table with 3 values for val1, 10 values for val2, with a total of 30 combinations. The addition of the primary key starting with (val1, val2) is just a lazy way to ensure that I have a suitable index AND val1 and val2 are both declared not null.

With this data my first step will be to demonstrate the recursive CTE (“with” subquery) used by Andrew Sayer in the ODC posting to get the distinct values for val1 using three index “index range scan (min/max)”probes. I’ve included the in-memory execution plan with rowsource execution stats to show that this does a minimal amount of work.

The results in this note come from 12.2.0.1:


set serveroutput off
alter session set statistics_level = all;

with bouncy (val1)
as (
        select  min(val1) val1
        from    t1
        union all
        select  (select min(t1.val1) val1 from t1 where t1.val1 > bouncy.val1) val1
        from    bouncy
        where   bouncy.val1 is not null
    )
select  *
from    bouncy
where   bouncy.val1 is not null
order by
        val1
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

---------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |       |      1 |        |    19 (100)|      3 |00:00:00.01 |       7 |      4 |       |       |          |
|   1 |  SORT ORDER BY                             |       |      1 |      2 |    19   (6)|      3 |00:00:00.01 |       7 |      4 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                     |       |      1 |      2 |    18   (0)|      3 |00:00:00.01 |       7 |      4 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|       |      1 |        |            |      4 |00:00:00.01 |       7 |      4 |  1024 |  1024 |          |
|   4 |     SORT AGGREGATE                         |       |      1 |      1 |            |      1 |00:00:00.01 |       2 |      1 |       |       |          |
|   5 |      INDEX FULL SCAN (MIN/MAX)             | T1_PK |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       2 |      1 |       |       |          |
|   6 |     SORT AGGREGATE                         |       |      3 |      1 |            |      3 |00:00:00.01 |       5 |      3 |       |       |          |
|   7 |      FIRST ROW                             |       |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       5 |      3 |       |       |          |
|*  8 |       INDEX RANGE SCAN (MIN/MAX)           | T1_PK |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       5 |      3 |       |       |          |
|   9 |     RECURSIVE WITH PUMP                    |       |      4 |        |            |      3 |00:00:00.01 |       0 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("BOUNCY"."VAL1" IS NOT NULL)
   8 - access("T1"."VAL1">:B1)

As you can see I’ve done an “index full scan (min/max)” as the first step of the recursive query, visiting just two buffered blocks (the index leaf-block count is 27 – roughly 9 per value of val1 – so Oracle is clearly doing an efficient access for that value, it’s not rally a “full” scan. We then see 3 “index range scan (min/max)” at roughly 2 buffer visits each to collect the remaining values. (There’s probably a small saving in buffer gets due to the pinning that takes place).

So we can get the val1 values very easily and efficiently with this recurstive CTE technology. Let’s write some code that uses the same technology to find the val2 values for each possible val1 value in turn:

with bounce2 (val1, val2)
as (
        select val1, val2 from (
                select  0 val1, 0 val2 from dual
                union all
                select 1,0 from dual
                union all
                select 2,0 from dual
        )
        union all
        select  bounce2.val1, (select min(t1.val2) val2 from t1 where t1.val1 = bounce2.val1 and t1.val2 > bounce2.val2) val2
        from    bounce2
        where   bounce2.val2 is not null
--      and     bounce2.val1 is not null
)
select * from bounce2
where
        bounce2.val2 is not null
and     bounce2.val1 is not null        -- > redundant predicate
order by
        val1, val2
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

---------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |       |      1 |        |    27 (100)|     30 |00:00:00.01 |      32 |     24 |       |       |          |
|   1 |  SORT ORDER BY                             |       |      1 |      6 |    27   (4)|     30 |00:00:00.01 |      32 |     24 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                     |       |      1 |      6 |    26   (0)|     30 |00:00:00.01 |      32 |     24 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|       |      1 |        |            |     33 |00:00:00.01 |      32 |     24 |  1024 |  1024 |          |
|   4 |     VIEW                                   |       |      1 |      3 |     6   (0)|      3 |00:00:00.01 |       0 |      0 |       |       |          |
|   5 |      UNION-ALL                             |       |      1 |        |            |      3 |00:00:00.01 |       0 |      0 |       |       |          |
|   6 |       FAST DUAL                            |       |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       0 |      0 |       |       |          |
|   7 |       FAST DUAL                            |       |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       0 |      0 |       |       |          |
|   8 |       FAST DUAL                            |       |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       0 |      0 |       |       |          |
|   9 |     SORT AGGREGATE                         |       |     30 |      1 |            |     30 |00:00:00.01 |      32 |     24 |       |       |          |
|  10 |      FIRST ROW                             |       |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |     24 |       |       |          |
|* 11 |       INDEX RANGE SCAN (MIN/MAX)           | T1_PK |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |     24 |       |       |          |
|  12 |     RECURSIVE WITH PUMP                    |       |     11 |        |            |     30 |00:00:00.01 |       0 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("BOUNCE2"."VAL2" IS NOT NULL AND "BOUNCE2"."VAL1" IS NOT NULL))
  11 - access("T1"."VAL1"=:B1 AND "T1"."VAL2">:B2)


In this example of the code the second half of the CTE looks remarkably similar to the previous statement – except I now have a two-column CTE and I’ve included an equality predicate against val1 based on the first of the two columns. In the first half of the code I’ve cheated (as a temporary measure) and supplied three rows of data which list the three distinct values of val1 with their associated minimum values for val2.

The execution plan shows that I’ve done 30 “index range scan (min/max)” of the index with 32 buffer visits. And that’s exactly the right number of probes to return my result set. So if I can manage to generate the starting values efficiently I can execute the whole query efficiently. So let’s find a way of changing that “union all on dual” fudge into a generic statement. Let’s replace it with a recursive CTE:


with bounce1(val1, val2) as (
        select val1, val2 
        from    (
                select
                        /*+ index(t1) */
                        val1, val2,
                        row_number() over(order by val1, val2) rn
                from    t1
        )
        where
                rn = 1
        union all
        select
                (select min(t1.val1) val1 from t1 where t1.val1 > bounce1.val1) val1,
                (select min(t1.val2) val2 from t1 where t1.val1 > bounce1.val1 and rownum = 1) val2
        from    bounce1
        where   bounce1.val1 is not null
),
bounce2 (val1, val2)
as (
        select  val1, val2 
        from    bounce1
--      where   bounce1.val1 is not null
        union all
        select  bounce2.val1, (select min(t1.val2) val2 from t1 where t1.val1 = bounce2.val1 and t1.val2 > bounce2.val2) val2
        from    bounce2
        where   bounce2.val2 is not null
--      and     bounce2.val1 is not null
)
select * from bounce2
where
        bounce2.val2 is not null
and     bounce2.val1 is not null        -- > redundant predicate
order by
        val1, val2
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |       |      1 |        |   189 (100)|     30 |00:00:00.01 |      45 |       |       |          |
|   1 |  SORT ORDER BY                               |       |      1 |      4 |   189   (2)|     30 |00:00:00.01 |      45 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |       |      1 |      4 |   188   (2)|     30 |00:00:00.01 |      45 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |       |      1 |        |            |     34 |00:00:00.01 |      45 |  1024 |  1024 |          |
|   4 |     VIEW                                     |       |      1 |      2 |    87   (2)|      4 |00:00:00.01 |      13 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|       |      1 |        |            |      4 |00:00:00.01 |      13 |  1024 |  1024 |          |
|*  6 |       VIEW                                   |       |      1 |      1 |    29   (0)|      1 |00:00:00.01 |       2 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY                 |       |      1 |  10000 |    29   (0)|      1 |00:00:00.01 |       2 | 73728 | 73728 |          |
|   8 |         INDEX FULL SCAN                      | T1_PK |      1 |  10000 |    29   (0)|      2 |00:00:00.01 |       2 |       |       |          |
|   9 |       SORT AGGREGATE                         |       |      3 |      1 |            |      3 |00:00:00.01 |       5 |       |       |          |
|  10 |        FIRST ROW                             |       |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       5 |       |       |          |
|* 11 |         INDEX RANGE SCAN (MIN/MAX)           | T1_PK |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       5 |       |       |          |
|  12 |       SORT AGGREGATE                         |       |      3 |      1 |            |      3 |00:00:00.01 |       6 |       |       |          |
|* 13 |        COUNT STOPKEY                         |       |      3 |        |            |      2 |00:00:00.01 |       6 |       |       |          |
|* 14 |         INDEX RANGE SCAN                     | T1_PK |      3 |    500 |     2   (0)|      2 |00:00:00.01 |       6 |       |       |          |
|  15 |       RECURSIVE WITH PUMP                    |       |      4 |        |            |      3 |00:00:00.01 |       0 |       |       |          |
|  16 |     SORT AGGREGATE                           |       |     30 |      1 |            |     30 |00:00:00.01 |      32 |       |       |          |
|  17 |      FIRST ROW                               |       |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |       |       |          |
|* 18 |       INDEX RANGE SCAN (MIN/MAX)             | T1_PK |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |       |       |          |
|  19 |     RECURSIVE WITH PUMP                      |       |     11 |        |            |     30 |00:00:00.01 |       0 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("BOUNCE2"."VAL2" IS NOT NULL AND "BOUNCE2"."VAL1" IS NOT NULL))
   6 - filter("RN"=1)
   7 - filter(ROW_NUMBER() OVER ( ORDER BY "VAL1","VAL2")<=1) 11 - access("T1"."VAL1">:B1)
  13 - filter(ROWNUM=1)
  14 - access("T1"."VAL1">:B1)
  18 - access("T1"."VAL1"=:B1 AND "T1"."VAL2">:B2)


Again we see 30 probes using “index range scan (min/max)” with 32 buffer gets to get 30 rows; plus a further 13 buffer gets to generate the three driving rows. The 13 buffer gets break down to: 2 to get the minimum (val1, val2) combination using an “index full scan (min/max)”, then 5 for the probes to get the three minimum values for val1, and 6 for the probes to get the three corresponding minimum values of val2.

You’ll notice that I’ve got various “is not null” predicates scattered throughout the code. In some cases this is to stop Oracle from running into an infinite loop and reporting Oracle error: ORA-32044: cycle detected while executing recursive WITH query” This will occur because of the way that “(select max()…)” inline scalar subqueries returning a null if there is no data found which would lead to the next cycle of the recursive descent taking that null as an input – hence starting the infinite recursion. In some cases the “is not null” predicates are my default pattern for recurstive CTEs and some of them could probably be removed with no change in meaning (or workload).

The /*+ index() */ hint in the starting point for bounce1 was necessary to avoid an “index fast full scan” in 12.2; but that was purely a case of the statistics – number of distinct values, leaf_block count, etc – making the optimizer pick an option that was appropriate for this tiny data set, but not appropriate for the demonstration.  In fact this looks like the side effect of two defects in the 12.1 optimizer code, of which only one has been fixed in 12.2.

Optimizer Limitations

Here’s an extract from the execution plan for the final query with an /*+ index(t1) */ hint in place. The extract is identical for 12.1.0.2 and 12.2.0.1:

--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
...
|*  6 |       VIEW                                   |       |      1 |      1 |    29   (0)|      1 |00:00:00.01 |       2 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY                 |       |      1 |  10000 |    29   (0)|      1 |00:00:00.01 |       2 | 73728 | 73728 |          |
|   8 |         INDEX FULL SCAN                      | T1_PK |      1 |  10000 |    29   (0)|      2 |00:00:00.01 |       2 |       |       |          |

You’ll notice the Cost at operation 8 is appropriate for a real (i.e. all leaf blocks) full scan of the index. (The leaf_block value was 27 as I mentioned earlier on). You’ll also see that the OMem (PGA requirement for optimum workarea operation) figure is consistent with Oracle processing 10,000 rows in the index. Since the optimizer managed to work out that it could do a full scan with nosort and stopkey it looks a little surprising that the algorithms didn’t manage to make some allowance for the limited access that would occur. (I’d view this as a current limitation, rather than a bug, though).

Now compare the equivalent extracts when we hint an index fast full scan 12.1.0.2 first, then 12.2.0.1:

12.1.0.2
--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
...
|*  6 |       VIEW                                   |       |      1 |      1 |    39   (8)|      1 |00:00:00.03 |      32 |       |       |          |
|*  7 |        WINDOW SORT PUSHED RANK               |       |      1 |  10000 |    39   (8)|      1 |00:00:00.03 |      32 |  2048 |  2048 | 2048  (0)|
|   8 |         INDEX FAST FULL SCAN                 | T1_PK |      1 |  10000 |     5   (0)|  10000 |00:00:00.01 |      32 |       |       |          |

12.2.0.1
--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
...
|*  6 |       VIEW                                   |       |      1 |      1 |     7  (29)|      1 |00:00:00.01 |      34 |       |       |          |
|*  7 |        WINDOW SORT PUSHED RANK               |       |      1 |  10000 |     7  (29)|      1 |00:00:00.01 |      34 |  2048 |  2048 | 2048  (0)|
|   8 |         INDEX FAST FULL SCAN                 | T1_PK |      1 |  10000 |     5   (0)|  10000 |00:00:00.01 |      34 |       |       |          |

In both cases the cost of the index fast full scan is the same – and much cheaper; but in 12.1.0.2 the cost of the query looks as if it is allowing for sorting (and spilling) the entire 10,000 rows of returned from the index fast full scan (even though the OMem indicates otherwise), while the cost in 12.2.0.1 looks as if it recognises that it just has to do a running comparison through the data set as it returns, keeping only the current minimum in memory at any one moment. This clearly matches our expectations of how Oracle ought to behave, which is why I’d call this a bug in 12.1, fixed by 12.2.

The dramatic change in cost of operation 7 on the upgrade explains the change in plan and the necessity for the /*+ index(t1) */ hint – but if the “first row” predicate were also reflected in the costing then the cost of the “stopkey” index full scan would drop to 2 (probably) and the original 12.1 path would be re-appear.

Footnote

I don’t think there’s a lot of scope for improving the efficiency of this query for getting the (relatively) small number of distinct combinations from the first two columns of a very large index – but there are some very clever SQL bunnies on the ODC forum, so I won’t be surprised if someone comes up with a better solution.

Update

Well it didn’t take very long for someone to improve my SQL. Andrew Sayer took advantage of the “cross apply” feature of Oracle 12c to get rid of that nasty little bit of SQL where I’d used two scalar subqueries in the select list of the driving CTE. Here are the before and after versions of that fragment:


        select
                (select min(t1.val1) val1 from t1 where t1.val1 > bounce1.val1) val1,
                (select min(t1.val2) val2 from t1 where t1.val1 > bounce1.val1 and rownum = 1) val2
        from    bounce1
        where   bounce1.val1 is not null


        select
                ca.val1 ,ca.val2
        from    bounce1
        cross  apply (select val1, val2
                      from  (select /*+ index(t1) no_index_ffs(t1) */
                                     val1, val2
                             from    t1
                             where   t1.val1 > bounce1.val1
                             and     rownum = 1
                            )
                     ) ca
        where  bounce1.val1 is not null

This “cross apply” has the effect of running a correlated subquery for every row selected from (this level of) bounce1 and then joining the results back to (this level of) bounce1. With this change in place (and with my original data set) the following plan appears:


------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |                 |      1 |        |   161 (100)|     30 |00:00:00.01 |      40 |       |       |          |
|   1 |  SORT ORDER BY                               |                 |      1 |      4 |   161   (2)|     30 |00:00:00.01 |      40 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |                 |      1 |      4 |   160   (2)|     30 |00:00:00.01 |      40 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |     33 |00:00:00.01 |      40 |  1024 |  1024 |          |
|*  4 |     VIEW                                     |                 |      1 |      2 |    73   (2)|      3 |00:00:00.01 |       8 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |      3 |00:00:00.01 |       8 |  1024 |  1024 |          |
|*  6 |       VIEW                                   |                 |      1 |      1 |    29   (0)|      1 |00:00:00.01 |       2 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY                 |                 |      1 |  10000 |    29   (0)|      1 |00:00:00.01 |       2 | 73728 | 73728 |          |
|   8 |         INDEX FULL SCAN                      | T1_PK           |      1 |  10000 |    29   (0)|      2 |00:00:00.01 |       2 |       |       |          |
|   9 |       NESTED LOOPS                           |                 |      3 |      1 |    43   (0)|      2 |00:00:00.01 |       6 |       |       |          |
|  10 |        RECURSIVE WITH PUMP                   |                 |      3 |        |            |      3 |00:00:00.01 |       0 |       |       |          |
|  11 |        VIEW                                  | VW_LAT_A83890C2 |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |       |       |          |
|  12 |         VIEW                                 |                 |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |       |       |          |
|* 13 |          COUNT STOPKEY                       |                 |      3 |        |            |      2 |00:00:00.01 |       6 |       |       |          |
|* 14 |           INDEX RANGE SCAN                   | T1_PK           |      3 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |       |       |          |
|  15 |     SORT AGGREGATE                           |                 |     30 |      1 |            |     30 |00:00:00.01 |      32 |       |       |          |
|  16 |      FIRST ROW                               |                 |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |       |       |          |
|* 17 |       INDEX RANGE SCAN (MIN/MAX)             | T1_PK           |     30 |      1 |     2   (0)|     27 |00:00:00.01 |      32 |       |       |          |
|  18 |     RECURSIVE WITH PUMP                      |                 |     11 |        |            |     30 |00:00:00.01 |       0 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("BOUNCE2"."VAL2" IS NOT NULL AND "BOUNCE2"."VAL1" IS NOT NULL))
   4 - filter("BOUNCE1"."VAL1" IS NOT NULL)
   6 - filter("RN"=1)
   7 - filter(ROW_NUMBER() OVER ( ORDER BY "VAL1","VAL2")<=1) 13 - filter(ROWNUM=1) 14 - access("T1"."VAL1">"BOUNCE1"."VAL1")
  17 - access("T1"."VAL1"=:B1 AND "T1"."VAL2">:B2)

If you compare this with my final plan further up the page you can see that operations 9 – 14 look completely different and while my plan shows two “sort aggregate” probes against t1_pk, Andrew’s plan does an interesting “nested loop” driven by a “recursive pump” that effectively halves the work done in this section of the plan.

Another little detail about this plan that I found interesting was that the “cross apply” had been converted to a “lateral join” internally – note the VW_LAT_xxxx view name. This was a little irritating because I had actually tried to write the query with a lateral join in the first place and ended up getting the wrong results. I’ve got a follow-up posting about this – but (spoiler alert) I think it means I’ve found another bug.

fritshoogland's picture

All about ansible vault

This blogpost is about using ansible vault. Vault is a way to encrypt sensitive information in ansible scripts by encrypting it. The motivation for this blogpost is the lack of a description that makes sense to me of what the possibilities are for using vault, and how to use the vault options in playbooks.

The basic way ansible vault works, is that when ansible-playbook reads a yaml file, it encounters $ANSIBLE_VAULT;1.1;AES256 indicating ansible vault is used to encrypt the directly following lines, it will use a password to decrypt it, and then uses the decrypted version in memory only. This way secrets can be hidden from being visible. Obviously, the password will allow decrypting it, and the password must be used in order for ansible-playbook to decrypt it.

The original use of vault is to encrypt an entire yaml file. As of Ansible version 2.3, ansible allows the encryption of single values in a yaml file.

1. File encryption
A very silly example of ansible-vault, which shouldn’t be used (!) is this:

$ echo "---
- hosts: localhost
  vars:
    a: aaa
    b: bbb
  tasks:

  - name: execute a shell command
    shell: >
      ls" > test1.yml

When this runs, it will execute on localhost, set to ansible variables (a and b) and then use the shell module to execute “ls”. Again, this is just an example, and is silly on itself. This is how it’s run:

$ ansible-playbook test1.yml
PLAY [localhost] ***********************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] *********************************************************************************************************
changed: [localhost]

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0

This can be encrypted using ansible vault in the following way:

$ ansible-vault encrypt test1.yml
New Vault password:
Confirm new Vault password:
Encryption successful

Now the yaml file is entirely encrypted:

$ cat test1.yml
$ANSIBLE_VAULT;1.1;AES256
32663536343532643864353135643431373830333236636230613264383531336562323530386632
3933353561316530333538363032633731336135396434630a626631303839646161363432396230
32336130633264343766393562656138363132393835646137633065393739336234336463336138
6564613730333436650a336537326233343566633761383436356136666533326561346530613566
64316336393036346538656338303734623161386435373138656532663737666662633765656438
37303666393939356161393134666130303435333463636630386434626335383164663761383035
35613437356361366665633931346664386535393936396637363561386164613337373562313264
63333530353263363437386536393765393930653234353733313433643637333932353638353339
31623366376566613766366236356163303837666664646465363439313265383562656637623266
6463663065383030626232643365623437666332313631376262

It still can be run, but you need to specify the password. In this case it asks to enter a password:

$ ansible-playbook test1.yml --ask-vault-pass
Vault password:

PLAY [localhost] ***********************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] *********************************************************************************************************
changed: [localhost]

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0

Of course this is not practical, the most common and probable way you want to use this is to have variables with secret information, like passwords, tokens, private keys, personal data, etc. encrypted, but the playbook still be readable. You should use a version control tool like git, and use functionality like diff!

The first thing is to move the variables out of the playbook into a dedicated section. In this case I chosen to assign them based on host, in a directory called ‘host_vars’, and then for localhost ‘host_vars/localhost’:

$ cat host_vars/localhost/vars.yml
---
a: aaa
b: bbb

And then of course remove them from the playbook itself:

$ cat test2.yml
---
- hosts: localhost
  tasks:

  - name: execute a shell command
    shell: >
      ls
  - debug: msg="{{ a }} / {{ b }}"

If this is run, it will pick up the variables (for which I put in a debug task):

$ ansible-playbook test2.yml
PLAY [localhost] ***********************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] *********************************************************************************************************
changed: [localhost]

TASK [debug] ***************************************************************************************************************************
ok: [localhost] => {
    "msg": "aaa / bbb"
}

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0

Okay, now let’s say I need to encrypt the variable ‘a’. I can bluntly encrypt vars.yml, and that will work, however, then you don’t know where the variable comes from. The solution to that is ‘indirect referencing’. This is done by adding a second file to host_vars/localhost called ‘vault.yml’, which will be the encrypted file. In the case of encrypting variable a, do the following:
1. Change the vars.yml file to assign {{ vault_a }} to a:

$ cat host_vars/localhost/vars.yml
---
a: "{{ vault_a }}"
b: bbb

2. Define the value of a to vault_a in vault.yml:

$ cat host_vars/localhost/vault.yml
---
vault_a: aaa

3. Encrypt vault.yml:

$ ansible-vault encrypt host_vars/localhost/vault.yml
New Vault password:
Confirm New Vault password:
Encryption successful

Now it can be used by specifying the password again:

$ ansible-playbook test2.yml --ask-vault-pass
Vault password:
PLAY [localhost] ***********************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] *********************************************************************************************************
changed: [localhost]

TASK [debug] ***************************************************************************************************************************
ok: [localhost] => {
    "msg": "aaa / bbb"
}

PLAY RECAP *****************************************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0

This should not be a surprise, as I said, ansible will recognise an encrypted file, and decrypt it automatically. However, if you looked carefully you saw something else that is important to realise: I encrypted the variable a containing ‘aaa’, and it was revealed to me using the debug statement. I think this is reasonable, because it’s been explicitly programmed to do so. You should write your playbooks not to reveal any information that you encrypted previously.

2. Variable encryption
If you have larger sets of variables to be encrypted, the above method makes sense. However, if you want to encrypt only a few variables it’s probably simpler to encrypt only the variable you want to be encrypted. The ansible-vault executable provides a facility for that called ‘encrypt_string’. The way this works is that you simply specify the variable and the variable value with ansible-vault, which then outputs the encrypted value, which you need to put in a playbook yourself. For example your ‘a’ variable with the value ‘aaa’:

$ ansible-vault encrypt_string --name 'a' 'aaa'
New Vault password:
Confirm New Vault password:
a: !vault |
          $ANSIBLE_VAULT;1.1;AES256
          37363334323763633562313934656631396431636139653036333036356636326163623238626132
          6138343534396232643638323830326462383831366165630a393335303939663138613437633835
          65396461383565316135316339383035626262343664643563373631366338393361623230666634
          3637363232313935310a323564663863336565623366633838616337393366313831396637386432
          3035
Encryption successful

Now use this to input to a playbook:

---
- hosts: localhost
  vars:
    a: !vault |
          $ANSIBLE_VAULT;1.1;AES256
          37363334323763633562313934656631396431636139653036333036356636326163623238626132
          6138343534396232643638323830326462383831366165630a393335303939663138613437633835
          65396461383565316135316339383035626262343664643563373631366338393361623230666634
          3637363232313935310a323564663863336565623366633838616337393366313831396637386432
          3035
    b: bbb
  tasks:

  - name: execute a shell command
    shell: >
      ls
  - debug: msg="{{ a }} / {{ b }}"

3. Vault passwords
The next thing to look at is how to specify the decryption password. One way or another, you need a password specified to decrypt. The first option is to specify the password at the CLI, which is done in the previous examples by specifying ‘–ask-vault-pass’ as an argument to ansible-playbook. This is a good option for running playbooks ad-hoc.

If you want playbooks to run from a facility like ansible tower, rundeck or semaphore, you need to specify the password in non-interactive, in a file. This is often implemented as a hidden file, like ‘.vault_pass.txt’. Of course this then means the password is readable if you got command line access. The main idea here is this gives you an option to encrypt sensitive data in a playbook and store it in a version control system, and do NOT store the password (file) in version control.

$ echo "mypassword" > .vault_pass.txt
$ ansible-playbook --vault-password-file .vault_pass.txt test2.yml
PLAY [localhost] **************************************************************************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] ************************************************************************************************************************************************
changed: [localhost]

TASK [debug] ******************************************************************************************************************************************************************
ok: [localhost] => {
    "msg": "aaa / bbb"
}

PLAY RECAP ********************************************************************************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0 

There is a caveat here too, into which I ran because my removable HDD enforces 777 on my files, which means the execute bit is set:

$ ansible-playbook --vault-password-file .ttt test2.yml
 [WARNING]: Error in vault password file loading (default): Problem running vault password script /Volumes/Samsung_T3/ansible/.ttt ([Errno 8] Exec format error). If this
is not a script, remove the executable bit from the file.

ERROR! Problem running vault password script /Volumes/Samsung_T3/ansible/.ttt ([Errno 8] Exec format error). If this is not a script, remove the executable bit from the file.

The message is self-explanatory: if ansible finds the passwordfile specified to be executable, it will execute it. In my case, I created a little shell script to do that:

#!/bin/bash
echo "mypassword"

3.a. Vault-id for passwords
Since ansible version 2.4 the official way to use encryption is by using vault-id. Vault-id is more or less the same as the above usage, however, it adds a new feature that could be important: the option to use multiple vault passwords in the same playbook. The way this works is quite simple: ansible tries all the passwords it has been given until it finds one that works. Please mind vault-id only works with encrypted files, not with encrypted variables! Below is an example of the use of vault-id with multiple passwords:

$ echo "---
- hosts: localhost
  tasks:

  - name: execute a shell command
    shell: >
      ls
  - debug: msg="{{ a }} / {{ b }}" > test3.yml
$ echo '---
a: "{{ vault_a }}"
b: "{{ vault_b }}"' > host_vars/localhost/vars.yml
$ echo "password_a" > .a.vault_pass.txt
$ echo "password_b" > .b.vault_pass.txt
$ echo "---
vault_a: aaa" > host_vars/localhost/vault_a.yml
$ echo "---
vault_b: bbb" > host_vars/localhost/vault_b.yml
$ ansible-vault --vault-id a@.a.vault_pass.txt encrypt host_vars/localhost/vault_a.yml
Encryption successful
$ ansible-vault --vault-id b@.b.vault_pass.txt encrypt host_vars/localhost/vault_b.yml
Encryption successful
$ ansible-playbook --vault-id a@.a.vault_pass.txt --vault-id b@.b.vault_pass.txt test3.yml
PLAY [localhost] **************************************************************************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************************************************************************
ok: [localhost]

TASK [execute a shell command] ************************************************************************************************************************************************
changed: [localhost]

TASK [debug] ******************************************************************************************************************************************************************
ok: [localhost] => {
    "msg": "aaa / bbb"
}

PLAY RECAP ********************************************************************************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0
Jonathan Lewis's picture

Filtering LOBs

A two-part question about the FILTER operation appeared on the Oracle-L list server a couple of days ago. The first part was a fairly common question – one that’s often prompted by the way the optimizer used to behave in older versions of Oracle. Paraphrased, it was: “Why is the total cost of the query so high compared to the sum of its parts?”

Here’s the query, and the execution plan.

 INSERT INTO TEMP
  SELECT DISTINCT 'id',
    PHT.emplid
  FROM PHOTO PHT
  WHERE 1               =1
  AND PHT.PHOTO IS NOT NULL
  AND NOT EXISTS
    (SELECT 'x'
    FROM TEMP TMP
    WHERE PHT.EMPLID=TMP.EMPLID_SRCH
    AND TMP.OPRID  = 'id'
    )
  ;  

  
-------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |          | 21210 |  3334K|  5802K  (2)| 00:03:47 |
|   1 |  LOAD TABLE CONVENTIONAL | TEMP     |       |       |            |          |
|*  2 |   FILTER                 |          |       |       |            |          |
|*  3 |    TABLE ACCESS FULL     | PHOTO    | 21211 |  3334K|   313   (1)| 00:00:01 |
|*  4 |    TABLE ACCESS FULL     | TEMP     |     1 |    17 |   380   (2)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( NOT EXISTS (SELECT 0 FROM " TEMP" "TMP" WHERE
              "TMP"."EMPLID_SRCH"=:B1 AND "TMP"."OPRID"='id'))
   3 - filter("PHT"."PHOTO" IS NOT NULL)
   4 - filter("TMP"."EMPLID_SRCH"=:B1 AND "TMP"."OPRID"='id')

Note that the “not exists” subquery against temp runs as a filter subquery with a cost of 380 for the tablescan. Combine that with the cost of 313 for the driving tablescan of photo and you might wonder why the resulting cost isn’t something like 693 – and in some old versions of Oracle that’s probably how it would be reported.

Historically the optimizer has been very bad about producing a final cost when queries have included subqueries – whether as filter subqueries in the predicate section or as scalar subqueries in the select list. Sometimes the cost would simply vanish from the final cost, sometimes it would be added just once to the final cost regardless of how many times the subquery might actually execute.

In this example the subquery against temp is a correlated subquery and might have to run once for every row in photo where the column photo was not null. At best it would have to run at least once for every distinct value of the photo.emplid column (the correlation column) found in those rows. In recent versions of Oracle the optimizer has tried to introduce some estimate of how many times the subquery would run as part of its calculation of the total cost. So (to a crude approximation) 5802K = 313 + N * 380. Unfortunately if we try to work backwards to N we find it would be about 15,267 which is about 72% of the 21,200 rows estimated as the result of the tablescan of photo – I haven’t tried to investigate the algorithms yet but presumably the optimizer makes some allowances somewhere for “self caching” as the subquery runs.

The more interesting part of the question came when the OP decided to test the effect of getting rid of the subquery. Check the costs in the resulting plan:


  INSERT INTO TEMP
  SELECT DISTINCT 'id',
    PHT.emplid
  FROM PHOTO PHT
  WHERE 1               =1
  AND PHT.UC_PBI_PHOTO IS NOT NULL;

  
-------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |          | 21211 |  3334K|  3659   (1)| 00:00:01 |
|   1 |  LOAD TABLE CONVENTIONAL | TEMP     |       |       |            |          |
|*  2 |   TABLE ACCESS FULL      | PHOTO    | 21211 |  3334K|  3659   (1)| 00:00:01 |
-------------------------------------------------------------------------------------
  
Predicate Information (identified by operation id):
---------------------------------------------------
     2 - filter("PHT"."PHOTO" IS NOT NULL)

Note how the cost of the tablescan of photo has gone up from 313 in the previous query to 3,659 in the simpler query! How can a tablescan that drives a subquery have a lower cost than the tablescan on its own? Bear in mind that in both cases the Cost attributed to the operation “Table Access Full” is purely about scanning the rows in the photo table and is (or should be) entirely disconnected from the cost and frequency of the subquery.

The clue is in the table definition. The column photo.photo is a BLOB.

Models

I think there are potentially two errors in the optimizer displayed by this example. The first is that it’s adding in a cost that it shouldn’t even be considering; the second is that it’s being inconsistent in the way that it’s deriving that cost.

To demonstrate what I think is happening, I built a variant of the OP’s example as follows:


rem
rem     Script:         optimizer_lob_costs.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2
rem

create table photo (
        emplid          varchar2(11) not null,
        photo           clob,
        other_col       varchar2(1000)
)
lob (photo) 
store as
        photo_lob(
        disable storage in row 
        cache
        logging
)
;

create unique index ph_uk on photo(emplid);

insert /*+ append */ into photo
select
        lpad(2 * rownum,10,0),
        rpad('x',1000),
        rpad('x',1000)
from
        all_objects
where
        rownum <= 10000 -- > comment to avoid wordpress format issue
;

commit;

create table temp(
        oprid           varchar2(30),
        emplid_srch     varchar2(11)
)
;

insert /*+ append */ into temp
select
        'id',
        lpad(2 * rownum,10,0)
from
        all_objects
where
        rownum <= 1000 -- > comment to avoid wordpress format issue
;

commit;

execute dbms_stats.gather_table_stats(user,'photo',method_opt=>'for all columns size 1', cascade=>true)
execute dbms_stats.gather_table_stats(user,'temp', method_opt=>'for all columns size 1', cascade=>true)


I’ve changed the BLOB to a CLOB defined with storage in row disabled, and I’ve introduced a varchar2() column of the same size as the CLOB column. I’ve declared the correlating column not null and created a unique index on it. Here are the two queries I want to review – slightly simplified versions of the original:


explain plan for
insert into temp(emplid_srch)
select 
        distinct pht.emplid
from 
        photo pht
where 
        1 = 1
and  pht.photo is not null
-- and     pht.other_col is not null
and     not exists (
                select /*+ no_unnest */
                        null
                from 
                        temp tmp
                where 
                        pht.emplid=tmp.emplid_srch
        )
;  

select * from table(dbms_xplan.display);

explain plan for
insert into temp(emplid_srch)
select
        distinct pht.emplid
from    photo pht
where   1               =1
and  pht.photo is not null
-- and     pht.other_col is not nulL
;  

select * from table(dbms_xplan.display);

As you can see I’ve had to include a /*+ no_unnest */ hint in my SQL to get the FILTER operation to appear in the plan (the OP had the hidden parameter “_unnest_subquery” set to false); I’ve also allowed for two variants of each query, one referencing the CLOB column the other referencing the varchar2() column. The only results I’ll show are for the queries accessing the CLOB, and here are the plans first with, then without, the subquery. Check the cost of the tablescan of the photo table in the two cases:


----------------------------------------------------------------------------------
| Id  | Operation                | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |       |  9999 |   956K| 10458   (3)| 00:00:01 |
|   1 |  LOAD TABLE CONVENTIONAL | TEMP  |       |       |            |          |
|*  2 |   FILTER                 |       |       |       |            |          |
|*  3 |    TABLE ACCESS FULL     | PHOTO | 10000 |   957K|   216   (1)| 00:00:01 |
|*  4 |    TABLE ACCESS FULL     | TEMP  |     1 |    11 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( NOT EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "TEMP" "TMP"
              WHERE "TMP"."EMPLID_SRCH"=:B1))
   3 - filter("PHT"."PHOTO" IS NOT NULL)
   4 - filter("TMP"."EMPLID_SRCH"=:B1)


----------------------------------------------------------------------------------
| Id  | Operation                | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |       | 10000 |   957K|   285   (2)| 00:00:01 |
|   1 |  LOAD TABLE CONVENTIONAL | TEMP  |       |       |            |          |
|*  2 |   TABLE ACCESS FULL      | PHOTO | 10000 |   957K|   285   (2)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("PHT"."PHOTO" IS NOT NULL)

With the subquery in place the tablescan of photo reports a cost of 285, in the absence of the subquery it reports a cost of 216, a difference of 69. Repeating the test but using the varchar2() column the cost of the tablescan was 213 in both cases – suggesting that the variation was due to the column being a LOB.

With no further clues in the plan it looked like one of those rare occasions when I have to look at the 10053 (optimizer) trace file – and this is what I got from the 12.1.0.2 trace, looking at the section headed “SINGLE TABLE ACCESS PATH” for the photo table. First the base query without the subquery:


SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for PHOTO[PHT]
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
  Column (#2): PHOTO(LOB)
    AvgLen: 87 NDV: 0 Nulls: 0 Density: 0.000000
  Table: PHOTO  Alias: PHT
    Card: Original: 10000.000000  Rounded: 10000  Computed: 10000.000000  Non Adjusted: 10000.000000
  Scan IO  Cost (Disk) =   210.000000
  Scan CPU Cost (Disk) =   13571440.480000
  Total Scan IO  Cost  =   210.000000 (scan (Disk))
                         + 70.000000 (io filter eval) (= 0.007000 (per row) * 10000.000000 (#rows))
                       =   280.000000
  Total Scan CPU  Cost =   13571440.480000 (scan (Disk))
                         + 9138463.200000 (cpu filter eval) (= 913.846320 (per row) * 10000.000000 (#rows))
                       =   22709903.680000

Note the “Total Scan IO Cost” described at line 13 includes a component at line 12 labelled “(io filter eval)” – why, for the predicate “photo is null”, would we do any special I/O when that predicate can be met in the basic table scan.

(Note: A predicate like “lob_column is null” means there is no lob locator in place, so no lob access need be done for that test. In fact the related, but very different, predicate “length(lob_column) = 0” meaning the lob locator exists but the lob is “empty” could also be satisfied during the tablescan without reference to the physical lob segment(s) because the length of the lob is included in the lob locator.)

Let’s assume that the optimizer is incorrectly assuming the run-time engine will have to access the lob in some way to determine that the lob is null. The worst case scenario is that Oracle will start by accessing the LOBindex – so why don’t we check how big the LOBindex is. The first step I took was to check the object_id of the LOBindex and then do a tree dump (which showed 66 leaf blocks) and then I checked the segment header block and dumped that with the following results:


  Extent Control Header
  -----------------------------------------------------------------
  Extent Header:: spare1: 0      spare2: 0      #extents: 1      #blocks: 127
                  last map  0x00000000  #maps: 0      offset: 4128
      Highwater::  0x01400447  ext#: 0      blk#: 70     ext size: 127
  #blocks in seg. hdr's freelists: 4
  #blocks below: 70
  mapblk  0x00000000  offset: 0
                   Unlocked
     Map Header:: next  0x00000000  #extents: 1    obj#: 194295 flag: 0x40000000
  Extent Map
  -----------------------------------------------------------------
   0x01400401  length: 127

See the “Highwater::” information at line 6 – the allocated space in the segment is the first 70 blocks of the first extent. That’s (almost certainly) where the incremental cost of 70 (single block I/Os) comes from.  (And I did couple of big updates to the LOB, designed to expand the LOBindex without changing the segment size of the underlying table, to corroborate that hypothesis.)

This brings us to the question of why the cost of the tablescan drops when the subquery is included. Again we generate the 10053 trace and examine the details under the “SINGLE TABLE ACCESS PATH”:


SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for PHOTO[PHT]
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
  Table: PHOTO  Alias: PHT
    Card: Original: 10000.000000  Rounded: 10000  Computed: 10000.000000  Non Adjusted: 10000.000000
  Scan IO  Cost (Disk) =   210.000000
  Scan CPU Cost (Disk) =   13571440.480000
  Total Scan IO  Cost  =   210.000000 (scan (Disk))
                         + 3.500000 (io filter eval) (= 0.000350 (per row) * 10000.000000 (#rows))
                       =   213.500000
  Total Scan CPU  Cost =   13571440.480000 (scan (Disk))
                         + 656923.160000 (cpu filter eval) (= 65.692316 (per row) * 10000.000000 (#rows))
                       =   14228363.640000


In this case the “(io filter eval)” at line 10 is only 3.5 – and if you know your optimizer and how it handles subqueries you’re allowed to guess that could be one of Oracle’s standard guesses of 5% coming into play. (Again, growing the index seemed to corroborate this hypothesis.)

So here’s (possible) bug number 2: the first bug is adding a cost for accessing the LOBindex when there should be no need to  access the index at all – the execution plan says we will get 10,000 rows from the table, the filter predicate does report a cardinality reduced by just 1 on a column that has been declared with a uniqueness constraint, but a fairly commonly used “guess factor” of 5% is used as an indicator of the number of times the lob predicate will be tested. The various bits of the arithmetic are not consistent with each other.

Summary notes:

If you have a tablescan with a predicate that references a lob column then the cost of the tablescan includes the cost of the lob access – and there are cases where lob access is not needed but still gets costed {this is bug number 1 – the predicates are column is/is not null, and length(column) = / != 0)}.

If the lob data itself does not need to be accessed then the size of the lob index – which you can’t easily find – may have a significant impact on the cost of the tablescan.

If the query also includes predicates that result in the optimizer guessing about cardinality effects (1%, 5%, 0.25% are common guesses) then that guess may be used to scale the assumed (and potentially irrelevant) cost of the lob access. (There is scope for further experimentation in this area to examine the effects of “non-guess” predicates and the assumed order of application of predicates, e.g. are lob predicates costed as the last to be applied, does the algorithm for costing matched the execution order.)

As often happens it’s easy to see that there are oddities in the arithmetic that affect the cost of a query in ways that might make the optimizer pick a silly execution plan. Unfortunately it’s not easy to predict when you’re likely to see the effects of these oddities; the best we can do is remember that there is an anomaly with costing lob-based predicates and hope that we think of it when we see the optimizer picking a bad plan for reasons that initially are not obvious.

Bertrand Drouvot's picture

demystifying and wrapping the oracle cloud APIs with Python

As an oracle DBA you might have to deal with REST API, especially when working with the cloud. The purpose of this post is to demystify the REST API usage from a DBA point of view. Let’s take an example and write a Python wrapper to automate the Instance creation in the Oracle Database Cloud Service.

The instance creation can be done manually using the Web Interface that way:

https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-26-at-0... 2198w, https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-26-at-0... 150w, https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-26-at-0... 300w, https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-26-at-0... 768w, https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-26-at-0... 1024w" sizes="(max-width: 1100px) 100vw, 1100px" />

It could also been done using the APIs. The APIs are described in this link. The one related to the Instance creation is the following:

https://bdrouvot.files.wordpress.com/2018/05/create_instance_api.png?w=2... 2198w, https://bdrouvot.files.wordpress.com/2018/05/create_instance_api.png?w=1... 150w, https://bdrouvot.files.wordpress.com/2018/05/create_instance_api.png?w=3... 300w, https://bdrouvot.files.wordpress.com/2018/05/create_instance_api.png?w=7... 768w, https://bdrouvot.files.wordpress.com/2018/05/create_instance_api.png?w=1... 1024w" sizes="(max-width: 1100px) 100vw, 1100px" />

So, identityDomainId is a path parameter, the Authorization and X-ID-TENANT-NAME are header parameters and the body describes the parameters in JSON format.

To interact with the APIs through HTTP, let’s use the Python Requests module.

To work with JSON let’s use the JSON one.

First, we need to create an http session. Let’s import the requests module:

import requests

and then create the http session:

client = requests.Session()

Once done, we can add the authorization:

client.auth = (USERNAME, PASSWORD)

and update the header with the content-type and the X-ID-TENANT-NAME:

client.headers.update(
  {'content-type': 'application/json'
   'X-ID-TENANT-NAME':'{0}'.format(IDENTITY_DOMAIN_ID)})

Now, let’s create the body. In this example the JSON data has been extracted from a file (prov_database.json) that contains:

$ cat prov_database.json
{
  "description": "BDTDESC",
  "edition": "EE",
  "level": "PAAS",
  "serviceName": "WILLBEOVERWRITTEN",
  "shape": "oc3",
  "subscriptionType": "MONTHLY",
  "version": "12.1.0.2",
  "vmPublicKeyText": "ssh-rsa ",
  "parameters": [
    {
      "type": "db",
      "usableStorage": "15",
      "adminPassword": "",
      "sid": "BDTSID",
      "pdbName": "MYPDB",
      "failoverDatabase": "no",
      "backupDestination": "NONE"
    }
  ]
}

It has been extracted in the Python wrapper that way:

data = json.load(open('{0}/prov_database.json'.format(dir_path), 'r'))

Now that we have defined the authorization, the header and the body, all we have to do is to post to http.

The url structure is described here:https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-27-at-0... 150w, https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-27-at-0... 300w, https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-27-at-0... 768w, https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-27-at-0... 1024w, https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-27-at-0... 1716w" sizes="(max-width: 1100px) 100vw, 1100px" />

So that the post is launched that way with Python:

response = client.post("https://dbcs.emea.oraclecloud.com/paas/service/dbcs/api/v1.1/instances/{0}".format(IDENTITY_DOMAIN_ID), json=data)

As you can see the IDENTITY_DOMAIN_ID is also a path parameter and the data is part of the request. That’s it!

Remarks:

  • The username, password and identityDomainId have also been extracted from a file (.oracleapi_config.yml). The file contains:
$ cat .oracleapi_config.yml
identityDomainId: 
username: 
password: 
logfile: oracleapi.log

and has been extracted that way:

f = open('{0}/.oracleapi_config.yml'.format(dir_path), 'r')
config = safe_load(f)
IDENTITY_DOMAIN_ID = config['identityDomainId']
USERNAME = config['username']
PASSWORD = config['password']
  • More fun:
    You may have noticed that the http post’s response provides a link to an URL you can use to check the progress of the creation

https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-24-at-2... 2200w, https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-24-at-2... 150w, https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-24-at-2... 300w, https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-24-at-2... 768w, https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-24-at-2... 1024w" sizes="(max-width: 1100px) 100vw, 1100px" />

So that we can integrate the check in our wrapper (the source code is available at the end of this post and in this git repository).

Let’s use our wrapper:

$ python ./opc_api_wrapper.py
Usage:
    opc_api_wrapper.py  create_instance

$ python ./opc_api_wrapper.py BDTSERV create_instance
creating opc instance BDTSERV...
InProgress (Starting Compute resources...)
InProgress (Starting Compute resources...)
InProgress (Starting Compute resources...)
InProgress (Configuring Oracle Database Server...)
InProgress (Configuring Oracle Database Server...)
InProgress (Configuring Oracle Database Server...)
InProgress (Configuring Oracle Database Server...)
InProgress (Configuring Oracle Database Server...)
InProgress (Configuring Oracle Database Server...)
InProgress (Configuring Oracle Database Server...)
InProgress (Configuring Oracle Database Server...)
InProgress (Configuring Oracle Database Server...)
InProgress (Configuring Oracle Database Server...)
InProgress (Configuring Oracle Database Server...)
InProgress (Configuring Oracle Database Server...)
Succeeded ( Service Reachabilty Check (SRC) of Oracle Database Server [BDTSERV] completed...)

opc instance BDTSERV created:

SSH access to VM [DB_1/vm-1] succeeded...
Oracle Database Server Configuration completed...
Successfully provisioned Oracle Database Server...
Service Reachabilty Check (SRC) of Oracle Database Server [BDTSERV] completed...

During the instance creation, you could also check the progress through the web interface:

https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-24-at-2...
150w, https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-24-at-2... 300w, https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-24-at-2... 768w, https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-24-at-2... 1024w, https://bdrouvot.files.wordpress.com/2018/05/screen-shot-2018-05-24-at-2... 2066w" sizes="(max-width: 1100px) 100vw, 1100px" />

Source code:

# Author: Bertrand Drouvot
# Blog : https://bdrouvot.wordpress.com/
# opc_api_wrapper.py : V1.0 (2018/05)
# Oracle cloud API wrapper

from yaml import safe_load
import os
from os import path
import logging
import json
import requests
import time
from docopt import docopt

FORMAT = '%(asctime)s - %(name)s - %(levelname)-s %(message)s'

help = ''' Oracle Public Cloud Wrapper

Usage:
    opc_api_wrapper.py  create_instance

Options:
    -h  Help message

    Returns .....
'''

class APIError(Exception):

    def __init__(self, message, status_code=None, payload=None):
        Exception.__init__(self)
        self.message = message
        self.status_code = 415

    def to_dict(self):
        rv = dict()
        rv['message'] = self.message
        return rv

def launch_actions(kwargs):

    dir_path = os.path.dirname(os.path.realpath(__file__))
    try:
        f = open('{0}/.oracleapi_config.yml'.format(dir_path), 'r')
        config = safe_load(f)
    except:
        raise ValueError("This script requires a .oracleapi_config.yml file")
        exit(-1)

    if kwargs['create_instance']:
        create_instance(kwargs['service_name'],dir_path,config)

def return_last_from_list(v_list):
    for msg in (v_list[0], v_list[-1]):
        pass
    return msg

def print_all_from_list(v_list):
    for mmsg in v_list:
        print mmsg

def check_job(config,joburl):

    IDENTITY_DOMAIN_ID = config['identityDomainId']
    USERNAME = config['username']
    PASSWORD = config['password']

    client = requests.Session()
    client.auth = (USERNAME, PASSWORD)
    client.headers.update({'X-ID-TENANT-NAME': '{0}'.format(IDENTITY_DOMAIN_ID)})

    response = client.get("{0}".format(joburl))
    jsontext= json.loads(response.text)
    client.close()
    return (jsontext['job_status'],jsontext['message'])


def create_instance(service_name,dir_path,config):

    logfile = config['logfile']
    logging.basicConfig(filename=logfile, format=FORMAT, level=logging.INFO)

    IDENTITY_DOMAIN_ID = config['identityDomainId']
    USERNAME = config['username']
    PASSWORD = config['password']

    data = json.load(open('{0}/prov_database.json'.format(dir_path), 'r'))
    data['serviceName'] = service_name

    print "creating opc instance {0}...".format(service_name)

    client = requests.Session()
    client.auth = (USERNAME, PASSWORD)
    client.headers.update(
        {'content-type': 'application/json',
         'X-ID-TENANT-NAME':'{0}'.format(IDENTITY_DOMAIN_ID)})

    response = client.post("https://dbcs.emea.oraclecloud.com/paas/service/dbcs/api/v1.1/instances/{0}".format(IDENTITY_DOMAIN_ID), json=data)
    if response.status_code != 202:
        raise APIError(response.json()['message'])
    jobburl = response.headers['Location']
    jobsstatus = "InProgress"
    while (jobsstatus == "InProgress"):
        time.sleep(120)
        jobsstatus,jobmessage = check_job(config,jobburl)
        print "{0} ({1})".format(jobsstatus,return_last_from_list(jobmessage))
    client.close()
    print ""
    print "opc instance {0} created:".format(service_name)
    print ""
    print_all_from_list(jobmessage)

#
# Main
#

def main():

    arguments = docopt(help)
    for key in arguments.keys():
        arguments[key.replace('<','').replace('>','')] = arguments.pop(key)

    launch_actions(arguments)

if __name__ == '__main__':
    main()

Conclusion

We have been able to wrap the instance creation APIs with Python. We have also included a way to check/follow the creation progress. Wrapping the APIs with Python gives flexibility, for example you could launch ansible playbook(s) from the wrapper once the instance creation is done.

Franck Pachot's picture

Which Bitnami service to choose in the Oracle Cloud Infrastructure?

In the Oracle Cloud PaaS you have a marketplace where you can choose your service. Some are packaged from Bitnami and available on multiple OS. My first idea is that in PaaS you do not care about the OS. But Oracle Cloud has this very nice feature where you still have full access to the OS, as root, even in PaaS. Then, you choose the Linux distribution of your preference. Except if performance is different. They run on different Linux kernels. Is Oracle Linux Unbreakable Kernel more efficient?

We need to compare with relevant and reliable measures. And for that I’ve created a Postgres service and used Kevin Closson SLOB method, as I’m a lucky beta tester for pgio. I’ll post later about the creation of the Postgres service in the Oracle Cloud.

Cached IO

First, I’ve run cached IO to measure CPU performance.

I’ve run with the following settings in pgio.conf:

UPDATE_PCT=0
RUN_TIME=300
NUM_SCHEMAS=4
NUM_THREADS=1
WORK_UNIT=255
SCALE=1G

This is 4 schemas with 1GB of data. This fit in my 30GB host. Actually, here is a vmstat sample during the run showing 5GB cached and large free memory:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 0 15508168 12444 5495652 0 0 0 0 1530 1631 28 22 50 0 0
4 0 0 15508292 12452 5495648 0 0 0 4 1506 1564 27 23 50 0 0

There’s no I/O to block device here, which is my goal.

I’ve run 10 times the runit.sh and here is the summary of main result with IOPS:


$ for i in allopc@144.21.82.244.out allopc@144.21.82.255.out allopc@144.21.82.252.out ; do grep os-release $i | grep PRETTY | tail -1 ; grep x86 $i | tail -1 ; grep -E "runit.out.*DBNAME" $i | tail -10 | nl ; done
 
Fri May 25 11:56:11 UTC 2018 /etc/os-release : PRETTY_NAME="Ubuntu 16.04.4 LTS"
Linux ubuntu 4.4.0-127-generic #153-Ubuntu SMP Sat May 19 10:58:46 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
1 Fri May 25 11:11:05 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >393232< CACHE_HITS/s >15711<
2 Fri May 25 11:16:06 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >398179< CACHE_HITS/s >16000<
3 Fri May 25 11:21:07 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >402080< CACHE_HITS/s >16019<
4 Fri May 25 11:26:07 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >404682< CACHE_HITS/s >16086<
5 Fri May 25 11:31:07 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >408524< CACHE_HITS/s >16327<
6 Fri May 25 11:36:08 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >409209< CACHE_HITS/s >16390<
7 Fri May 25 11:41:08 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >403647< CACHE_HITS/s >16327<
8 Fri May 25 11:46:09 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >397440< CACHE_HITS/s >15894<
9 Fri May 25 11:51:09 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >397273< CACHE_HITS/s >15956<
10 Fri May 25 11:56:11 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >396906< CACHE_HITS/s >15904<
 
Fri May 25 11:54:56 UTC 2018 /etc/os-release : PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
Linux debian 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux
1 Fri May 25 11:09:53 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >395672< CACHE_HITS/s >15882<
2 Fri May 25 11:14:54 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >401400< CACHE_HITS/s >16188<
3 Fri May 25 11:19:54 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >391551< CACHE_HITS/s >15764<
4 Fri May 25 11:24:54 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >393827< CACHE_HITS/s >15802<
5 Fri May 25 11:29:54 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >404462< CACHE_HITS/s >16198<
6 Fri May 25 11:34:55 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >392712< CACHE_HITS/s >15732<
7 Fri May 25 11:39:55 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >399389< CACHE_HITS/s >16063<
8 Fri May 25 11:44:55 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >390283< CACHE_HITS/s >15567<
9 Fri May 25 11:49:56 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >391426< CACHE_HITS/s >15771<
10 Fri May 25 11:54:56 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >392760< CACHE_HITS/s >15874<
 
Fri May 25 11:53:58 UTC 2018 /etc/os-release : PRETTY_NAME="Oracle Linux Server 7.5"
Linux b5e501 4.1.12-124.15.2.el7uek.x86_64 #2 SMP Tue May 22 11:52:31 PDT 2018 x86_64 x86_64 x86_64 GNU/Linux
1 Fri May 25 11:08:54 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >395815< CACHE_HITS/s >15759<
2 Fri May 25 11:13:55 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >395346< CACHE_HITS/s >16009<
3 Fri May 25 11:18:55 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >395340< CACHE_HITS/s >15898<
4 Fri May 25 11:23:56 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 301 seconds. RIOPS >402556< CACHE_HITS/s >16200<
5 Fri May 25 11:28:56 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >397847< CACHE_HITS/s >16039<
6 Fri May 25 11:33:56 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >398848< CACHE_HITS/s >16027<
7 Fri May 25 11:38:57 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >398817< CACHE_HITS/s >16089<
8 Fri May 25 11:43:57 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >398878< CACHE_HITS/s >15961<
9 Fri May 25 11:48:57 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >395885< CACHE_HITS/s >15606<
10 Fri May 25 11:53:58 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >395965< CACHE_HITS/s >15893<

I’ve put this data in Excel to show the picture where, basically, all 3 environments perform exactly the same:

CapturePGLIOPS

Physical IO

For the second test, I wanted to test physical IOPS. But not to measure disk performance, which is the same for my 3 services. I want to do physical I/O only to see if there’s a difference in context switches when doing non-blocking I/O – which we do not see in the previous test because they were filesystem cache hits. Then I kept the small scale of 4 sessions with 1GB so that there’s a good chance that it remains in the storage cache. But I reduced the memory in order to have less than 4GB in filesystem cache.

pgio comes with an utility (pgio_reduce_free_memory.sh) to allocate enough huge page to limit the filesystem cache:


$ sudo bash pgio/pgio_reduce_free_memory.sh 2
 
Taking action to reduce free memory down to 2GB available.
total used free shared buff/cache available
Mem: 30886100 124676 30698952 19088 62472 30469900
Swap: 0 0 0
 
Attempting to allocate 13966 huge pages
MemAvailable: 1869148 kB
HugePages_Total: 13966

This is perfect: 13966 huge pages, that’s 27 GB in my 30GB VM that cannot be used by the filesystem, so that my 4x1GB will need I/O calls to the disk.

Here is a vmstat extract to confirm that the filesystem cache is less than 2GB

procs -----------memory---------- ---swap-- -----io---- ---system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 4 0 154416 2344 2015204 0 0 93025 0 32425 24295 2 3 51 44 1
0 4 0 153816 2428 2015848 0 0 94923 21 32400 24414 2 3 51 44 1

runit.sh calls vmstat with a 3 seconds delay so you can see a higher number of block/s and context switches.

Here are the interesting lines from the runit.sh output:

$ for i in allopc@144.21.89.85.out allopc@144.21.89.53.out allopc@144.21.89.26.out ; do grep os-release $i | grep PRETTY | tail -1 ; grep x86 $i | tail -1 ; grep -E "runit.out.*DBNAME" $i | tail -10 | nl ; done
 
Fri May 25 14:58:47 UTC 2018 /etc/os-release : PRETTY_NAME="Ubuntu 16.04.4 LTS"
Linux ubuntu 4.4.0-127-generic #153-Ubuntu SMP Sat May 19 10:58:46 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
1 Fri May 25 14:13:40 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 301 seconds. RIOPS >1214< CACHE_HITS/s >55<
2 Fri May 25 14:18:42 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >4545< CACHE_HITS/s >195<
3 Fri May 25 14:23:43 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >17053< CACHE_HITS/s >682<
4 Fri May 25 14:28:43 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >18865< CACHE_HITS/s >801<
5 Fri May 25 14:33:44 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >18933< CACHE_HITS/s >794<
6 Fri May 25 14:38:44 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >18832< CACHE_HITS/s >777<
7 Fri May 25 14:43:45 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >18826< CACHE_HITS/s >757<
8 Fri May 25 14:48:46 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 301 seconds. RIOPS >19229< CACHE_HITS/s >819<
9 Fri May 25 14:53:46 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >19408< CACHE_HITS/s >835<
10 Fri May 25 14:58:47 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >19524< CACHE_HITS/s >799<
 
Fri May 25 14:58:41 UTC 2018 /etc/os-release : PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
Linux debian 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux
1 Fri May 25 14:13:35 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >1727< CACHE_HITS/s >82<
2 Fri May 25 14:18:36 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >10743< CACHE_HITS/s >534<
3 Fri May 25 14:23:37 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >18660< CACHE_HITS/s >763<
4 Fri May 25 14:28:37 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >18496< CACHE_HITS/s >811<
5 Fri May 25 14:33:38 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >18537< CACHE_HITS/s >757<
6 Fri May 25 14:38:38 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >18647< CACHE_HITS/s >774<
7 Fri May 25 14:43:39 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >18838< CACHE_HITS/s >775<
8 Fri May 25 14:48:40 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >18686< CACHE_HITS/s >786<
9 Fri May 25 14:53:40 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >18742< CACHE_HITS/s >782<
10 Fri May 25 14:58:41 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 301 seconds. RIOPS >18634< CACHE_HITS/s >793<
 
Fri May 25 14:57:25 UTC 2018 /etc/os-release : PRETTY_NAME="Oracle Linux Server 7.5"
Linux b5e501 4.1.12-124.15.2.el7uek.x86_64 #2 SMP Tue May 22 11:52:31 PDT 2018 x86_64 x86_64 x86_64 GNU/Linux
1 Fri May 25 14:12:20 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >17025< CACHE_HITS/s >721<
2 Fri May 25 14:17:21 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >19331< CACHE_HITS/s >792<
3 Fri May 25 14:22:21 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >19271< CACHE_HITS/s >770<
4 Fri May 25 14:27:22 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >19324< CACHE_HITS/s >802<
5 Fri May 25 14:32:22 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >18943< CACHE_HITS/s >802<
6 Fri May 25 14:37:23 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >19202< CACHE_HITS/s >818<
7 Fri May 25 14:42:24 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >18999< CACHE_HITS/s >803<
8 Fri May 25 14:47:24 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >19459< CACHE_HITS/s >823<
9 Fri May 25 14:52:24 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >19138< CACHE_HITS/s >836<
10 Fri May 25 14:57:25 UTC 2018 runit.out : DBNAME: pgio. 4 schemas, 1 threads(each). Run time: 300 seconds. RIOPS >18958< CACHE_HITS/s >781<

And I’ve graphed them:
CapturePGPIOPS

As I hit the storage here, I needed a few runs to warm up the storage cache and get the I/O I wanted: low latency rfom storage cache, but involving context switches on the server. And basically, all 3 alternatives (Ubuntu with Linux 4.4, Debian with Linux 4.9 and OEL with the unbreakable kernel 4.1) behave the same. From these tests, I can say that the performance is not the major criteria to choose one of the PaaS alternatives. Just choose the distribution you like.

About pgio, it is great to get performance measures that are reliable and focused on what I want to test. I had no problems to run it on all 3 platforms. Just needed to apt-get / yum to install systat and bc which are not there by default.

 

Cet article Which Bitnami service to choose in the Oracle Cloud Infrastructure? est apparu en premier sur Blog dbi services.

To prevent automated spam submissions leave this field empty.