Oakies Blog Aggregator

connor_mc_d's picture

Connections with a wallet – redux

Wow…it is nearly 4 years ago now that I wrote an article on connecting to the database via a wallet to avoid having to hard code passwords into script. That article is here:

https://connor-mcdonald.com/2015/09/21/connection-shortcuts-with-a-wallet/

So I went to do a similar exercise on my new 18c Windows database today, and to my surprise things went pear shaped at the very first step


c:\oracle\product\18\bin>mkstore -create -wrl c:\temp\wallet
The syntax of the command is incorrect.

Not a lot of feedback there Smile

Unfortunately it’s just a small error in the script (See MOS Note 2406404.1 ). All you need to do copy the mkstore.bat from as 12.1 client and you’ll be fine. I called mine mkstore121.bat.


c:\oracle\product\18\bin>mkstore121.bat -create -wrl c:\temp\wallet
Oracle Secret Store Tool Release 18.0.0.0.0 - Production
Version 18.1.0.0.0
Copyright (c) 2004, 2017, Oracle and/or its affiliates. All rights reserved.

Enter password: *********
Enter password again: *********

Franck Pachot's picture

Oracle 19c Data Guard sandbox created by DBCA -createDuplicateDB

Here are the commands I use to create a sandbox on Linux with a CDB1 database in a Data Guard configuration. I use the latest (19c) DBCA features to create the Primary and duplicate to the Standby.

I’m doing all in a VM which is a Compute Instance provisioned in the Oracle Cloud. In this example, I have an Oracle Linux 7.6 VM.DenseIO2.24 shape with 320GB RAM and 24 cores but remember that you will not be able to scale up/down so choose according to your credits...

I have 40GB in the / filesystem

OS and filesystem installation

I’ve installed the prerequisites as root (preinstall package, sudo and HugePages — here 200GB out of the 314GB I have):

sudo su
yum -y update
yum -y install -y oracle-database-preinstall-18c
grep "oracle" /etc/sudoers || echo "oracle  ALL=(ALL)       NOPASSWD: ALL" >> /etc/sudoers
echo "vm.nr_hugepages=102400" >> /etc/sysctl.conf
sysctl -p

I’ve attached a 8TB block device:

I create a /u01 filesystem on it for my Oracle installation

sudo iscsiadm -m node -o new -T iqn.2015-12.com.oracleiaas:a9a2...
sudo iscsiadm -m node -o update -T iqn.2015-12.com.oracleiaas:a91...
sudo iscsiadm -m node -T iqn.2015-12.com.oracleiaas:a9a21f4...
mkfs.xfs /dev/oracleoci/oraclevdb
mkdir /u01
mount /dev/oracleoci/oraclevdb /u01
echo "/dev/oracleoci/oraclevdb /u01 xfs defaults,_netdev,nofail 0 0" >> /etc/fstab
mkdir -p /u01/app/oracle/product/DB192
chown -R oracle:dba /u01

Oracle software

First, I install the Oracle Home (unzip V981623–01.zip in /u01/app/oracle/product/DB192 and run runInstaller).

su - oracle
mkdir -p /u01/oradata /u01/fast_recovery_area /u01/app/oracle
unzip -d /u01/app/oracle/product/DB192 V981623-01.zip
cat > /u01/oradata/DB192.rsp <oracle.install.responseFileVersion=/oracle/install/rspfmt_dbinstall_response_schema_v19.0.0
oracle.install.option=INSTALL_DB_SWONLY
UNIX_GROUP_NAME=oinstall
INVENTORY_LOCATION=/u01/app/oraInventory
ORACLE_BASE=/u01/app/oracle
ORACLE_HOME=/u01/app/oracle/product/DB192
oracle.install.db.InstallEdition=EE
oracle.install.db.OSDBA_GROUP=dba
oracle.install.db.OSOPER_GROUP=oper
oracle.install.db.OSBACKUPDBA_GROUP=backupdba
oracle.install.db.OSDGDBA_GROUP=dgdba
oracle.install.db.OSKMDBA_GROUP=kmdba
oracle.install.db.OSRACDBA_GROUP=dba
oracle.install.db.rootconfig.executeRootScript=true
oracle.install.db.rootconfig.configMethod=SUDO
oracle.install.db.rootconfig.sudoPath=/bin/sudo
oracle.install.db.rootconfig.sudoUserName=oracle
END
/u01/app/oracle/product/DB192/runInstaller -silent \
-responseFile /u01/oradata/DB192.rsp

You may see that sudo is not accepted without entering the oracle password, so just in case, you can run it manually:

sudo /u01/app/oracle/product/DB192/root.sh

Create the databases

I add a ‘#’ in front of the text for easy copy/paste of the whole setup.

# Set the environment and create the directories

ORACLE_HOME=/u01/app/oracle/product/DB192
mkdir -p /u01/oradata /u01/fast_recovery_area

# Divide the huge pages to fit equally 4 instances

SGA_MB=$(awk '/Hugepagesize.*kB/{s=$2} /HugePages_Total/ {f=$2} END{print s*f/1024/4} ' /proc/meminfo)

# Here is a script to create the standby logs after the creation of the primary database as this is not done by DBCA (one more than online redo logs and no multiplexing). Mention the tread# or you will have some unexpected warnings with validate database.

cat > /u01/oradata/add_standby_log.sql <<'SQL'
exec for i in (select t,g+rownum g,s from (select thread# t,max(bytes) s,count(*)+1 c from v$log group by thread#),(select rownum n from xmltable('1 to 100')),(select max(group#) g from v$log) where n<=c) loop execute immediate 'alter database add standby logfile thread '||i.t||' group '||i.g||' size '||i.s; end loop;
exit
SQL

# Primary database creation

grep ^CDB1A: /etc/oratab || $ORACLE_HOME/bin/dbca -silent \
-createDatabase -gdbName CDB1 -sid CDB1A \
-createAsContainerDatabase true -numberOfPdbs 1 -pdbName PDB1 \
-sysPassword oracle -systemPassword oracle \
-pdbAdminPassword oracle \
-datafileDestination /u01/oradata -useOMF true -storageType FS \
-recoveryAreaDestination /u01/fast_recovery_area \
-recoveryAreaSize 10240 -enableArchive true \
-memoryMgmtType AUTO_SGA -totalMemory ${SGA_MB:=4096} \
-createListener LISTENER1A:1521 \
-emConfiguration EMEXPRESS -emExpressPort 5501 \
-templateName General_Purpose.dbc \
-databaseType OLTP -sampleSchema true -redoLogFileSize 100 \
-customScripts /u01/oradata/add_standby_log.sql \
-initParams \
db_unique_name=CDB1A,dg_broker_start=true,shared_pool_size=600M

# Database duplication for the standby — this is possible from DBCA in 19c:

grep ^CDB1B: /etc/oratab || $ORACLE_HOME/bin/dbca -silent \
-createDuplicateDB -gdbName CDB1 -sid CDB1B \
-createAsStandby -dbUniqueName CDB1B -sysPassword oracle \
-primaryDBConnectionString $(hostname):1521/CDB1A \
-datafileDestination /u01/oradata -useOMF true -storageType FS \
-recoveryAreaDestination /u01/fast_recovery_area \
-recoveryAreaSize 10240 -enableArchive true \
-createListener LISTENER1B:1522

# fix local listener (I see no reason to resolve it in tnsnames.ora and anyway DBCA fails to update it correctly). Note that, being on the same host, I need two listeners because the same services (the PDB names) are registered by both instances.

. oraenv <<< CDB1A
ORACLE_SID=CDB1A sqlplus / as sysdba <<<"alter system set local_listener='//$(hostname):1521';"
ORACLE_SID=CDB1B sqlplus / as sysdba <<<"alter system set local_listener='//$(hostname):1522';"

# Enable flashback (I’ll use FSFO)

for i in CDB1{B,A} ; do ORACLE_SID=$i rman target / <<<'alter database flashback on;' ; done

# Create a CDB1 service for the primary

ORACLE_SID=CDB1A sqlplus / as sysdba <<'SQL'
exec dbms_service.create_service(service_name=>'CDB1',network_name=>'CDB1'); dbms_service.start_service(service_name=>'CDB1');
create or replace trigger start_role_services
after db_role_change on database
begin
for i in (select database_role, open_mode from v$database) loop
if i.database_role='PRIMARY' then dbms_service.start_service('CDB1'); else dbms_service.start_service('CDB1'); end if;
end loop;
end;
/
show errors
SQL

# Listener static entries (I remove first the SID_LIST added by DBCA)

sed -ie '/^SID_LIST/,$d' $ORACLE_HOME/network/admin/listener.ora
cat >>$ORACLE_HOME/network/admin/listener.ora <SID_LIST_LISTENER1A=(SID_LIST=(SID_DESC=(GLOBAL_DBNAME=CDB1A_DGMGRL)(SID_NAME=CDB1A)(ORACLE_HOME=$ORACLE_HOME)))
SID_LIST_LISTENER1B=(SID_LIST=(SID_DESC=(GLOBAL_DBNAME=CDB1B_DGMGRL)(SID_NAME=CDB1B)(ORACLE_HOME=$ORACLE_HOME)))
CAT
for i in {1521..1522} ; do lsnrctl reload //$(hostname):$i ; done

# A minimal Data Guard Broker configuration that allows FSFO

cat > /u01/oradata/configure_dgb.sql <
create configuration CDB1 as primary database is CDB1A connect identifier is '//$(hostname):1521/CDB1A_DGMGRL';
add database CDB1B as connect identifier is '//$(hostname):1522/CDB1B_DGMGRL';
edit database CDB1A set property LogXptMode='SYNC';
edit database CDB1B set property LogXptMode='SYNC';
enable configuration;
edit database CDB1A set property StandbyFileManagement='AUTO';
edit database CDB1B set property StandbyFileManagement='AUTO';
edit database CDB1B set property FastStartFailoverTarget='CDB1A';
edit database CDB1A set property FastStartFailoverTarget='CDB1B';
edit configuration set protection mode as MaxAvailability;
enable fast_start failover observe only;
DGB
dgmgrl sys/oracle </u01/oradata/configure_dgb.sql

# Set ‘Y’ in oratab for dbstart (but you will have to lsnrstart LISTENER1 and LISTERNER2 first anyway because dbstart starts only the default LISTENER)

sudo sed -ie '/^CDB1[AB]:/s/:N$/:Y/' /etc/oratab

# archivelog deleted on both side without need to backup them (as long as there are no gap)

for i in CDB1{B,A} ; do ORACLE_SID=$i rman target / <<<'configure archivelog deletion policy to applied on all standby;' ; done

# Some verifications and testing two switchovers so that we know it works

for i in {1521..1522} ; do lsnrctl status //$(hostname):$i ; done
grep ^CDB1[AB] /etc/oratab
sleep 15
dgmgrl sys/oracle <<<'show configuration'
dgmgrl sys/oracle <<<'show database verbose CDB1B'
sleep 15
dgmgrl sys/oracle <<<'validate database CDB1B' | grep -EC99 "Ready for Switchover: *Yes" && dgmgrl sys/oracle <<<'switchover to CDB1B'
sleep 15
dgmgrl sys/oracle <<<'validate database CDB1A' | grep -EC99 "Ready for Switchover: *Yes" && dgmgrl sys/oracle <<<'switchover to CDB1A'

# dgmgrl can take a command as argument, but I like bash Here strings.

Drop the databases

The following will clean everything (drop databases CDB1A and CDB1B and listeners)

exit # this to prevent and eager copy/paste
ORACLE_HOME=/u01/app/oracle/product/DB192
mkdir -p /u01/oradata /u01/fast_recovery_area
for i in $(awk -F: '/^CDB[1-2][A-B]?:/{print $1}' /etc/oratab | sort -r)
do
$ORACLE_HOME/bin/dbca -silent -deleteDatabase -sourceDB $i \
-forceArchiveLogDeletion -sysDBAPassword oracle -sysDBAUserName sys
done
for i in {1521..1524} ; do lsnrctl stop //$(hostname):$i ; done
rm -rf /u01/oradata/CDB1* \
/u01/fast_recovery_area/CDB1* \
$ORACLE_HOME/network/admin/{listener,tnsnames,sqlnet}.ora \
/u01/app/oracle/cfgtoollogs/dbca/CDB1{A,B} \
/u01/app/oracle/diag/rdbms/*/CDB1{A,B} \
$ORACLE_HOME/dbs/*{A,B}*\
$ORACLE_HOME/rdbms/log/*{A,B}*\
$ORACLE_HOME/rdbms/audit/*{A,B}*

I do the same with a CDB2 database and that’s my sandbox for many demos. Something like copy/pasting the previous commands through

sed -e s/CDB1/CDB2/g  s/LISTENER1/LISTENER2/g  s/1521/1523/g s/1522/1524/g s/PDB1/PDB2/g | sh

In summary, DBCA automates the duplicate for standby. It would be nice to get the listeners created properly with static strings, and the Data Guard Broker configuration setup as well. And also standby redo log creation.

Franck Pachot's picture

19c Observe-Only Data Guard FSFO: no split-brain risk in manual failover

Fast-Start Failover (FSFO) is an amazing feature of Oracle Data Guard Broker which brings High Availability (HA)features in addition to the Disaster Recovery (DR) one.

Data Guard as an HA solution

By default, a physical standby database protects from Disaster Recovery (like when your Data Center is on fire or underwater, or with a power cut,…). But it requires a manual action to do the failover. Then, even if the failover is quick (seconds to minutes) and there’s no loss of data (if in SYNC), it cannot be considered as HA because of the manual decision which can take hours. The idea of the manual decision is to understand the cause as it may be better to just wait in case of a transient failure. Especially if the standby site is less powerful and application performance will be degraded.

With FSFO, the failure of the primary database is automatically detected (with an observer process constantly testing the connection from another site) and then the failover to a designated standby is initiated. If Transparent Application Failover (TAF) is correctly configured, the application will directly reconnect to the new primary without your intervention.

How does the observer decide that the primary is down? By default, the failover is triggered when the primary is not reachable by both the observer and by the standby. In 12c, thanks to the ObserverOverride property, it is also possible to get the observer issuing the failover as soon as it cannot connect to the primary, even when the standby can see the primary. This can be used when the observer is on the application side.

12cR2 brought new additional possibilities for complex Data Guard configurations, such as defining multiple observers and multiple targets.

No human decision?

However, FSFO is faster than a manual decision but may initiate a failover when we do not want it. If the observer is on the same site as the primary (bad idea) then a network connection between the two sites will trigger the failover. When running in FSFO you must be sure that all your infrastructure is ok so that no undesired failover is initiated, and that you have no manual tasks to do in order to get the applications running again.

Enable FSFO in observe-only mode

If it is not the case, you will not decide to automate the failover. This is where 19c ‘observe only’ mode is interesting. The observer will report the failure but will not initiate a failover. I enable FSFO in this mode:

[oracle@cloud ~]$ dgmgrl / "enable fast_start failover observe only"
DGMGRL for Linux: Release 19.0.0.0.0 - Production on Wed Mar 6 10:02:01 2019
Version 19.2.0.0.0
Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.
Welcome to DGMGRL, type "help" for information.
Connected to "CDB1A"
Connected as SYSDG.
Enabled in Observe-Only Mode.

And start the observer (you should use the full syntax to run it in background with a log directory, but this is just for a small test which logs on my screen):

[oracle@cloud ~]$ dgmgrl sys/oracle "start observer" &
[1] 26589
[oracle@cloud ~]$ DGMGRL for Linux: Release 19.0.0.0.0 - Production on Wed Mar 6 10:04:50 2019
Version 19.2.0.0.0
Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.
Welcome to DGMGRL, type "help" for information.
Connected to "CDB1A"
Connected as SYSDBA.
Observer 'cloud' started
[W000 2019-03-06T10:04:50.342+00:00] Observer trace level is set to USER
[W000 2019-03-06T10:04:50.342+00:00] Try to connect to the primary.

Now if I simulate a crash of the primary database, killing the PMON process:

[oracle@cloud ~]$ kill -9 $(pgrep -f ora_pmon_CDB1A)

This immediately logs the following from the observer:

[oracle@cloud ~]$ [W000 2019-03-06T10:37:50.198+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:37:50.198+00:00] Fast-Start Failover threshold has not exceeded. Retry for the next 30 seconds

The default 30 seconds threshold is there to avoid a failover decision in case of a transient network failure.

After this threshold, a fast-start failover would have been initiated in normal FSFO mode, but here it is only reported in the log:


[W000 2019-03-06T10:37:51.198+00:00] Try to connect to the primary.
ORA-12537: TNS:connection closed
Unable to connect to database using //cloud:1521/CDB1A_DGMGRL
[W000 2019-03-06T10:37:51.235+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:37:52.235+00:00] Try to connect to the primary.
[W000 2019-03-06T10:37:53.287+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:37:54.288+00:00] Try to connect to the primary.
[W000 2019-03-06T10:38:17.902+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:38:17.902+00:00] Fast-Start Failover threshold has not exceeded. Retry for the next 3 seconds
[W000 2019-03-06T10:38:18.902+00:00] Try to connect to the primary.
[W000 2019-03-06T10:38:19.952+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:38:19.952+00:00] Fast-Start Failover threshold has not exceeded. Retry for the next 1 second
[W000 2019-03-06T10:38:20.952+00:00] Try to connect to the primary.
[W000 2019-03-06T10:38:22.003+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:38:22.003+00:00] Fast-Start Failover threshold has expired.
[W000 2019-03-06T10:38:22.003+00:00] Try to connect to the standby.
[W000 2019-03-06T10:38:22.003+00:00] Making a last connection attempt to primary database before proceeding with Fast-Start Failover.
[W000 2019-03-06T10:38:22.003+00:00] Check if the standby is ready for failover.
[W000 2019-03-06T10:38:22.006+00:00] A fast-start failover would have been initiated...
[W000 2019-03-06T10:38:22.006+00:00] Unable to failover since this observer is in observe-only mode
[W000 2019-03-06T10:38:22.006+00:00] Fast-Start Failover is not possible because observe-only mode.
[W000 2019-03-06T10:38:23.007+00:00] Try to connect to the primary.
[W000 2019-03-06T10:38:24.058+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:38:25.058+00:00] Try to connect to the primary.
[W000 2019-03-06T10:38:26.110+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:38:27.110+00:00] Try to connect to the primary.
[W000 2019-03-06T10:38:50.727+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:38:50.727+00:00] Fast-Start Failover threshold has not exceeded. Retry for the next 3 seconds
[W000 2019-03-06T10:38:51.727+00:00] Try to connect to the primary.
[W000 2019-03-06T10:38:52.777+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:38:52.777+00:00] Fast-Start Failover threshold has not exceeded. Retry for the next 1 second
...

Of course, this is something you should monitor, and have a manual decision about it. This is useful to run the FSFO in ‘dry-run’ mode where you want to be sure that your infrastructure is ok, without false alerts, before having it fully automated. But even when you don’t want to go to fully automated FSFO, this mode is very helpful for the post-failover tasks.

Let’s say I decide to failover. I do not stop the observer. I do not disable FSFO. I just initiate the failover manually:

DGMGRL> failover to cdb1b;
Failover succeeded, new primary is "cdb1b"

The observer is still running and detects that the primary has changed:

[W000 2019-03-06T13:13:25.182+00:00] Primary database cannot be reached.
[W000 2019-03-06T13:13:25.182+00:00] Fast-Start Failover threshold has not exceeded. Retry for the next 2 seconds
[W000 2019-03-06T13:13:26.183+00:00] Try to connect to the primary.
[W000 2019-03-06T13:13:27.233+00:00] Primary database cannot be reached.
[W000 2019-03-06T13:13:27.233+00:00] Fast-Start Failover threshold has expired.
[W000 2019-03-06T13:13:27.233+00:00] Try to connect to the standby.
[W000 2019-03-06T13:13:27.233+00:00] Making a last connection attempt to primary database before proceeding with Fast-Start Failover.
[W000 2019-03-06T13:13:27.280+00:00] Check if the standby is ready for failover.
[W000 2019-03-06T13:13:27.285+00:00] A fast-start failover would have been initiated...
[W000 2019-03-06T13:13:27.285+00:00] Unable to failover since this observer is in observe-only mode
[W000 2019-03-06T13:13:27.285+00:00] Fast-Start Failover is not possible because observe-only mode.
[W000 2019-03-06T13:13:28.284+00:00] Try to connect to the primary.
[W000 2019-03-06T13:13:28.284+00:00] Primary database cannot be reached.
[W000 2019-03-06T13:13:28.284+00:00] Fast-Start Failover observe-only mode enabled.
[W000 2019-03-06T13:13:28.284+00:00] Will not attempt a Fast-Start Failover.
[W000 2019-03-06T13:13:28.284+00:00] Retry connecting to primary.
[W000 2019-03-06T13:13:29.284+00:00] Try to connect to the primary.
[W000 2019-03-06T13:13:30.285+00:00] Primary database has changed to cdb1b.
[W000 2019-03-06T13:13:30.337+00:00] Try to connect to the primary.
[W000 2019-03-06T13:13:30.337+00:00] Try to connect to the primary //instance-20190305-2110:1522/CDB1B_DGMGRL.
[W000 2019-03-06T13:13:30.380+00:00] The standby cdb1a needs to be reinstated
[W000 2019-03-06T13:13:30.380+00:00] Try to connect to the new standby cdb1a.
[W000 2019-03-06T13:13:30.380+00:00] Connection to the primary restored!
[W000 2019-03-06T13:13:32.380+00:00] Connection to the new standby restored!
[W000 2019-03-06T13:13:32.380+00:00] Disconnecting from database //instance-20190305-2110:1522/CDB1B_DGMGRL.
[W000 2019-03-06T13:13:33.384+00:00] Failed to ping the new standby.
[W000 2019-03-06T13:13:34.385+00:00] Try to connect to the new standby cdb1a.
[W000 2019-03-06T13:13:36.385+00:00] Connection to the new standby restored!
[W000 2019-03-06T13:13:36.388+00:00] Failed to ping the new standby.
[W000 2019-03-06T13:13:37.389+00:00] Try to connect to the new standby cdb1a.

Reinstate old primary as new standby

This is the other automation of FSFO: as soon as the old site comes up again, this situation is detected. It is mandatory in the fully automated failover because you don’t want the old desynchronized primary to come up and have applications connecting to it again, which is a case of split-brain. FSFO detects this and can even automatically reinstate this old primary as a new standby (this is why FSFO requires the database to be in FLASHBACK ON).

Let’s see it, I Startup CDB1A — the one I killed before

ORACLE instance started.
Total System Global Area 4.2950E+10 bytes
Fixed Size 30386848 bytes
Variable Size 8187281408 bytes
Database Buffers 3.4628E+10 bytes
Redo Buffers 103829504 bytes
Database mounted.
ORA-16649: possible failover to another database prevents this database from being opened

Impossible to open, to prevent against split brain.

And then I do nothing, just let the observer do its magic:

SQL> [W000 2019-03-06T13:17:32.728+00:00] Try to connect to the primary //instance-20190305-2110:1522/CDB1B_DGMGRL.
[W000 2019-03-06T13:17:33.728+00:00] Connection to the primary restored!
[W000 2019-03-06T13:17:33.728+00:00] Wait for new primary to be ready to reinstate.
[W000 2019-03-06T13:17:33.731+00:00] New primary is now ready to reinstate.
[W000 2019-03-06T13:17:34.732+00:00] Issuing REINSTATE command.
2019-03-06T13:17:34.732+00:00
Initiating reinstatement for database "cdb1a"...
Reinstating database "cdb1a", please wait...
[W000 2019-03-06T13:17:48.752+00:00] The standby cdb1a is ready to be a FSFO target
Reinstatement of database "cdb1a" succeeded
2019-03-06T13:18:17.381+00:00
[W000 2019-03-06T13:18:17.789+00:00] Successfully reinstated database cdb1a.

Now the old primary is up in a standby role, fully synchronized. This is awesome. Just imagine the following situation. You have a few hours of power cut. You decided to failover. This is one command only with Data Guard Broker, but you have probably a lot of work to do with all the other systems.

And then you will postpone the reinstate of the primary databases. And maybe you will forget it after that exhausting day. And then you run unprotected… Now, Murphy’s law breaks the DR site… and you have lost everything. If you run FSFO, then the old site has been synchronized again without your intervention and is ready for a new failover. And in 19c this is possible even if you want full control with the manual decision to failover.

Jonathan Lewis's picture

12c Snapshots

I published a note a few years ago about using the 12c “with function” mechanism for writing simple SQL statements to takes deltas of dynamic performance views. The example I supplied was for v$event_histogram but I’ve just been prompted by a question on ODC to supply a couple more – v$session_event and v$sesstat (joined to v$statname) so that you can use one session to get an idea of the work done and time spent by another session – the first script reports wait time:


rem
rem     Program:        12c_with_function_2.sql
rem     Dated:          July 2013
rem
rem     See also
rem     12c_with_function.sql
rem     https://jonathanlewis.wordpress.com/2013/06/30/12c-fun/
rem
rem     Notes:
rem             Reports session WAIT time
rem             Modify the list of SIDs of interest
rem             Set the time in seconds
rem

define m_snap_time = 60
define m_sid_list  = '3, 4, 121, 127'

set timing on
set sqlterminator off

set linesize 180

break on sid skip 1

with
        function wait_row (
                i_secs  number, 
                i_return        number
        ) return number
        is
        begin
                dbms_lock.sleep(i_secs);
                return i_return;
        end;
select
        sid, 
        sum(total_waits),
        sum(total_timeouts), 
        sum(time_waited), 
        event
from    (
        select
                sid, event_id, 
                -total_waits total_waits, 
                -total_timeouts total_timeouts, 
                -time_waited time_waited, 
                -time_waited_micro time_waited_micro, 
                event
        from    v$session_event
        where   sid in ( &m_sid_list )
        union all
        select
                null, null, null, null, null, wait_row(&m_snap_time, 0), null
        from    dual
        union all
        select
                sid, event_id, total_waits, total_timeouts, time_waited, time_waited_micro, event
        from    v$session_event
        where   sid in ( &m_sid_list )
        )
where
        time_waited_micro != 0
group by
        sid, event_id, event
having
        sum(time_waited) != 0
order by
        sid, sum(time_waited) desc
/


And this one reports session activity:

rem
rem     Program:        12c_with_function_3.sql
rem     Dated:          July 2013
rem
rem     See also
rem     12c_with_function.sql
rem     https://jonathanlewis.wordpress.com/2013/06/30/12c-fun/
rem
rem     Notes:
rem             Reports session stats
rem             Modify the list of SIDs of interest
rem             Set the time in seconds
rem

define m_snap_time = 60
define m_sid_list  = '3, 4, 13, 357'


set timing on
set sqlterminator off

set linesize 180

break on sid skip 1
column name format a64

with
        function wait_row (
                i_secs  number, 
                i_return        number
        ) return number
        is
        begin
                dbms_lock.sleep(i_secs);
                return i_return;
        end;
select
        sid, 
        name,
        sum(value)
from    (
        select
                ss.sid, 
                ss.statistic#,
                sn.name,
                -ss.value value
        from
                v$sesstat       ss,
                v$statname      sn
        where   ss.sid in ( &m_sid_list )
        and     sn.statistic# = ss.statistic#
        union all
        select
                null, null, null, wait_row(&m_snap_time, 0)
        from    dual
        union all
        select
                ss.sid, ss.statistic#, sn.name, ss.value value
        from
                v$sesstat       ss,
                v$statname      sn
        where   ss.sid in ( &m_sid_list )
        and     sn.statistic# = ss.statistic#
        )
where
        value != 0
group by
        sid, statistic#, name
having
        sum(value) != 0
order by
        sid, statistic#
/


You’ll notice that I’ve used dbms_lock.sleep() in my wait function – and the session running the SQL can be granted the execute privilege on the package through a role to make this work – but if you’re running Oracle 18 then you’ve probably noticed that the sleep() function and procedure have been copied to the dbms_session package.

 

fritshoogland's picture

Getting locale warnings when logging on to Linux

This blogpost is about the reason and solving getting the following message, or messages alike these when logging i to a linux box using ssh:

-bash: warning: setlocale: LC_CTYPE: cannot change locale (UTF-8): No such file or directory

However, this is a warning. Please mind such an issue might be come up in another way, which can be more disrupting; at least in the past I had issues running perl for the same issue:

[root@dmcel01 ~]# /usr/local/bin/ipconf -verify
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = "en_US.UTF-8",
LC_ALL = "UTF-8",
LC_CTYPE = "UTF-8",
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").

This is a utility that comes with exadata, and I needed it during patching, but it didn’t execute because of it. This was more than 4 years ago, and I actually created a SR for it. But because I was patching during a downtime window, I had to quickly solve it. I managed to find it, and reported it back to Oracle, which was turned into a MOS note (however, as Oracle seems to have a habit of) without proper attribution to me, and a bit wrong. You can see the note here (MOS account required) Running /Usr/Local/Bin/Ipconf -Verify Results In ‘Perl: Warning: Setting Locale Failed.’ (Doc ID 1393517.1)

Anyway, today I ran into this again. The issue is the CLIENT (the machine from which an ssh session is created) explicitly requested locale settings on the machine it connected to. This is done in the client’s global ssh settings in a file called ssh_config.

I should add I use MacOS/OSX, which is derived from BSD, which is unix/linux like.

The file ssh_config was found in /etc/ssh (/etc/ssh/ssh_config), but at least on 10.14.3 (Mojave) this has changed to /private/etc/ssh/ssh_config. The MOS note describes /etc/ssh_config, but I don’t think ssh_config is present in /etc directly in any OS by default, and probably mostly found in /etc/ssh.

In my case, after updating to OSX Mojave, the ssh_config file is overwritten, giving me back the locale issue. This is how the original default host settings look like:

Host *
SendEnv LANG LC_*

To solve the locale issues (again, for me), change it very simply to:

Host *
SendEnv LANG

connor_mc_d's picture

Disabled EZCONNECT

Just a normal start to the day today…I had my coffee

coffee_gif

and then started working on some AskTOM questions. Naturally pretty much the first thing I needed to do is connect to my database, and then this happened:


C:\oracle\product\18\bin>sqlplus scott/tiger@//gtx:1518/pdb1

SQL*Plus: Release 18.0.0.0.0 - Production on Wed Mar 6 09:23:09 2019
Version 18.5.0.0.0

Copyright (c) 1982, 2018, Oracle.  All rights reserved.

ERROR:
ORA-12154: TNS:could not resolve the connect identifier specified

I made the standard assumptions

  • I had set ORACLE_HOME wrong….checked that, nope!
  • I had my TNS_ADMIN set wrong….nope!
  • Listener not configured corrrectly….nope!
  • Database not started or not registered with listener….nope!
  • Had my good friends at Microsoft restarted the PC with Windows Update without telling me?…nope!

so once I’d ticked off the normal culprits, I then tried with other software installations on the same machine


C:\oracle\instantclient>sqlplus scott/tiger@//gtx:1518/pdb1

SQL*Plus: Release 12.1.0.2.0 Production on Wed Mar 6 09:21:44 2019

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Last Successful login time: Wed Mar 06 2019 09:13:31 +08:00

Connected to:
Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production

C:\oracle\product\xe\18\dbhomeXE\bin>sqlplus scott/tiger@//gtx:1518/pdb1

SQL*Plus: Release 18.0.0.0.0 - Production on Wed Mar 6 09:22:38 2019
Version 18.4.0.0.0

Copyright (c) 1982, 2018, Oracle.  All rights reserved.

Last Successful login time: Wed Mar 06 2019 09:21:44 +08:00

Connected to:
Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production
Version 18.5.0.0.0

and voila! Obviously everything is OK from the listener and database perspective.

So back to the original installation, and I had the light bulb moment. I had been tinkering with sqlnet.ora to help out with a different question yesterday, and I had isolated the connection options down to only TNSNAMES


NAMES.DIRECTORY_PATH= (TNSNAMES)

In such a scenario, you need to explicitly allow for EZCONNECT method to allow the connection to work.


NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT)

and then everything was fine back in the original installation directory.


C:\oracle\product\18\bin>sqlplus scott/tiger@//gtx:1518/pdb1

SQL*Plus: Release 18.0.0.0.0 - Production on Wed Mar 6 09:50:52 2019
Version 18.5.0.0.0

Copyright (c) 1982, 2018, Oracle.  All rights reserved.

Last Successful login time: Wed Mar 06 2019 09:45:06 +08:00

Connected to:
Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production
Version 18.5.0.0.0

If you have never touched your sqlnet.ora, it probably looks like this:


NAMES.DIRECTORY_PATH= (TNSNAMES, ONAMES, HOSTNAME)

but there is no need to panic. Those default values allow EZCONNECT as well.

Franck Pachot's picture

My next Conferences in 2019

In my 2019 talks, a lot of performance stuff for DBA and Developers:

This year started with the amazing OBUG Belgium Tech Days 2019

The co-location of data and code, in present and future (like the MLE engine running JavaScript or Python in the database)

The most relevant statistics gathering in the 12.2 database (from 12cR2, 18c, 19c)

A different view on Join Methods by tracing the internal functions.

The Riga Dev Days in Latvia:

Riga Dev Days | Tech conference and the annual meeting point for developers in Baltic States.

Where I talk about microservices, and data/code co-location

Riga, 31.05.2019, 11:00–12:00
(Microservices: Get Rid of Your DBA and Send the DB into Burnout)

And the major conference about Oracle Performance:

Performance Days: Where ORACLE Database Specialists meet

Where I talk on statistic gathering best practices and new features (19c) and I explain join methods and optimization with a full demo session

Zurich, 27.09. 2019, 15:20–16:40
Join: Methods and Optimization
Zurich, 27.09. 2019, 11:00–12:20
Do You Gather the Statistics in the Way the Optimizer Expects Them?

I will add more of course and do not hesitate to follow me on Twitter (@FranckPachot) for news, comments, and questions

Jonathan Lewis's picture

Cartesian Join

I wrote this note a little over 4 years ago (Jan 2015) but failed to publish it for some reason. I’ve just rediscovered it and it’s got a couple of details that are worth mentioning, so I’ve decided to go ahead and publish it now.

A recent [ed: 4 year old] question on the OTN SQL forum asked for help in “multiplying up” data – producing multiple rows from a single row source. This is something I’ve done fairly often when modelling a problem, for example by generating an orders table and then generating an order_lines table from the orders table, and there are a couple of traps to consider.

The problem the OP had was that their base data was the result set from a complex query – which ran “fine”, but took 10 minutes to complete when a Cartesian join to a table holding just three rows was included. Unfortunately the OP didn’t supply, or even comment on, the execution plans. The obvious guess, of course, is that the extra table resulted in a completely different execution plan rather than the expected “do the original query then multiply by 3” plan, in which case the solution to the problem is (probably) simple – stick the original query into a non-mergeable view before doing the join.

Assume we have the following tables, t1 has 100,000 rows (generated from the SQL in this article), t2 has 4 rows where column id2 has the values from 1 to 4, t3 is empty – we can model the basic requirement with the query shown below:


SQL> desc t1
 Name                    Null?    Type
 ----------------------- -------- ----------------
 ID                               NUMBER
 C1                               CHAR(2)
 C2                               CHAR(2)
 C3                               CHAR(2)
 C4                               CHAR(2)
 PADDING                          VARCHAR2(100)

SQL> desc t2
 Name                    Null?    Type
 ----------------------- -------- ----------------
 ID2                              NUMBER

SQL> desc t3
 Name                    Null?    Type
 ----------------------- -------- ----------------
 ID                               NUMBER
 ID2                              NUMBER
 C1                               CHAR(2)
 C2                               CHAR(2)
 C3                               CHAR(2)
 C4                               CHAR(2)
 PADDING                          VARCHAR2(100)


insert into t3
select
        t1.id, t2.id2, t1.c1, t1.c2, c3, t1.c4, t1.padding
from
       (select * from t1) t1,
        t2
;

If we “lose” the plan for the original “select * from t1” (assuming t1 was really a complicated view) when we extend to the Cartesian join all we need to do is the following:


insert into t3
select
        /*+ leading(t1 t2) */
        t1.id, t2.id2, t1.c1, t1.c2, c3, t1.c4, t1.padding
from
        (select /*+ no_merge */ * from t1) t1,
        t2
;

This is where the problem starts to get a little interesting. The /*+ no_merge */ hint is (usually) a winner in situations like this – but why have I included a /*+ leading() */ hint choosing to put t2 (the small table) second in the join order? It’s because of the way that Cartesian Merge Joins work, combined with an eye to where my most important resource bottleneck is likely to be. Here’s the execution plan taken from memory after executing this statement with statistics_level set to all. (11.2.0.4):


SQL_ID  azu8ntfjg9pwj, child number 0
-------------------------------------
insert into t3 select   /*+ leading(t1 t2) */  t1.id, t2.id2, t1.c1,
t1.c2, c3, t1.c4, t1.padding from  (select /*+ no_merge */ * from t1)
t1,   t2

Plan hash value: 1055157753

----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |      |      1 |        |      0 |00:00:10.28 |   48255 |       |       |          |
|   1 |  LOAD TABLE CONVENTIONAL |      |      1 |        |      0 |00:00:10.28 |   48255 |       |       |          |
|   2 |   MERGE JOIN CARTESIAN   |      |      1 |    400K|    400K|00:00:06.30 |    1727 |       |       |          |
|   3 |    VIEW                  |      |      1 |    100K|    100K|00:00:00.94 |    1725 |       |       |          |
|   4 |     TABLE ACCESS FULL    | T1   |      1 |    100K|    100K|00:00:00.38 |    1725 |       |       |          |
|   5 |    BUFFER SORT           |      |    100K|      4 |    400K|00:00:01.78 |       2 |  3072 |  3072 | 2048  (0)|
|   6 |     TABLE ACCESS FULL    | T2   |      1 |      4 |      4 |00:00:00.01 |       2 |       |       |          |
----------------------------------------------------------------------------------------------------------------------

Let’s try that again (building from scratch, of course) with the table order reversed in the leading() hint:


SQL_ID  52qaydutssvn5, child number 0
-------------------------------------
insert into t3 select   /*+ leading(t2 t1) */  t1.id, t2.id2, t1.c1,
t1.c2, c3, t1.c4, t1.padding from  (select /*+ no_merge */ * from t1)
t1,   t2

Plan hash value: 2126214450

----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |      |      1 |        |      0 |00:00:12.29 |   48311 |   6352 |   1588 |       |       |          |
|   1 |  LOAD TABLE CONVENTIONAL |      |      1 |        |      0 |00:00:12.29 |   48311 |   6352 |   1588 |       |       |          |
|   2 |   MERGE JOIN CARTESIAN   |      |      1 |    400K|    400K|00:00:06.64 |    1729 |   6352 |   1588 |       |       |          |
|   3 |    TABLE ACCESS FULL     | T2   |      1 |      4 |      4 |00:00:00.01 |       2 |      0 |      0 |       |       |          |
|   4 |    BUFFER SORT           |      |      4 |    100K|    400K|00:00:04.45 |    1727 |   6352 |   1588 |    13M|  1416K| 9244K (0)|
|   5 |     VIEW                 |      |      1 |    100K|    100K|00:00:00.80 |    1725 |      0 |      0 |       |       |          |
|   6 |      TABLE ACCESS FULL   | T1   |      1 |    100K|    100K|00:00:00.28 |    1725 |      0 |      0 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------

There’s one important detail that’s not explicit in the execution plans – I’ve set the workarea_size_policy to manual and the sort_area_size to 10MB to demonstrate the impact of having a dataset that is too large for the session’s workarea limit.

The results, in terms of timing, are border-line. With the “correct” choice of order the completion time is 10.28 seconds compared to 12.29 seconds, though if you look at the time for the Merge Join Cartesian operation the difference is much less significant. The critical point, though, appears at operation 4 – the Buffer Sort. I set my sort_area_size to something that I know was smaller than the data set I needed to buffer – so the operation had to spill to disc. Ignoring overheads and rounding errors the data from the 1,727 blocks I read from the table at pctfree = 10 were dumped to the temporary space in 1,588 packed blocks (sanity check: 1,727 * 0.9 = 1,554); and then those blocks were read back once for each row from the driving t2 table (sanity check: 1,588 * 4 = 6,352).

With my setup I had a choice of bottlenecks:  scan a very small data set in memory 100,000 times to burn CPU, or scan a large data set from disc 4 times. There wasn’t much difference in my case: but the difference could be significant on a full-scale production system.  By default the optimizer happened to pick the “wrong” path with my data sets.

But there’s something even more important than this difference in resource usage to generate the data: what does the data look like after it’s been generated.  Here’s a simple query to show you the first few rows of the stored result sets in the two different test:


SQL> select id, id2, c1, c2, c3, c4 from t3 where rownum <= 8;

Data from leading (t1 t2)
=========================
        ID        ID2 C1 C2 C3 C4
---------- ---------- -- -- -- --
         1          1 BV GF JB LY
         1          2 BV GF JB LY
         1          3 BV GF JB LY
         1          4 BV GF JB LY
         2          1 YV LH MT VM
         2          2 YV LH MT VM
         2          3 YV LH MT VM
         2          4 YV LH MT VM


Data from leading (t2 t1)
=========================
        ID        ID2 C1 C2 C3 C4
---------- ---------- -- -- -- --
         1          1 BV GF JB LY
         2          1 YV LH MT VM
         3          1 IE YE TS DP
         4          1 DA JY GS AW
         5          1 ZA DC KD CF
         6          1 VJ JI TC RI
         7          1 DN RY KC BE
         8          1 WP EQ UM VY

If we had been using code like this to generate an order_lines table from an orders table, with  leading(orders t2) we would have “order lines” clustered around the “order number” – which is a realistic model; when we have leading(t2 orders) the clustering disappears (technically the order numbers are clustered around the order lines). It’s this aspect of the data that might have a much more important impact on the suitability (and timing) of any testing you may be doing rather than a little time lost or gained in generating the raw data.

Footnote

If you try to repeat this test on your own system don’t expect my timing to match yours. Bear in mind, also, that with statistics_level set to all there’s going to be a CPU overhead that varies between the two options for the leading() hint – the CPU usage on rowsource execution stats could be much higher for the case where one of the operations starts 100,000 times.

 

connor_mc_d's picture

Less slamming V$SQL

It’s a holiday here today in Perth, so a very brief blog post because I’ve been busy doing other things today Smile

IMG_20190303_114220_012

When I’m contacted by customers or friends and asked to help out with tuning their applications, probably the two most common issues are one or two poorly performing SQL statements, or the server being overly stressed with CPU load with no single SQL being the obvious cause.

In either case, a common plan of attack is probe V$SQL to gauge what statements are potential trouble makers. Due to the myriad of statistical metrics stored in V$SQL it is a very flexible view to discover more about potential problems just by adjusting the predicates. For example:

High I/O?

order by DISK_READS desc

High CPU?

order by BUFFER_GETS desc

Poor parsing applications?

order by PARSE_CALLS / EXECUTIONS

Memory hogs?

order by SHARABLE_MEM desc

each of which can be tweaked to hone into more specific requirements. But there is a risk to doing this. In particular, if you are diagnosing a system that is under duress, than hammering away at V$SQL may do more damage than good. Don’t forget that a system under duress is probably having its shared memory structures contended for just as heavily. And by aggressively querying V$SQL, you may be adding to that workload and/or contention.

You might be thinking this leaves in a Catch-22 situation – how can we diagnose a struggling system if diagnosing it will make it struggle more? Fortunately, a simple remedy is at hand.

Instead of querying V$SQL, try querying V$SQLSTATS instead. From the documentation:

V$SQLSTATS displays basic performance statistics for SQL cursors and contains one row per SQL statement (that is, one row per unique value of SQL_ID). The column definitions for columns in V$SQLSTATS are identical to those in the V$SQL and V$SQLAREA views. However, the V$SQLSTATS view differs from V$SQL and V$SQLAREA in that it is faster, more scalable, and has a greater data retention (the statistics may still appear in this view, even after the cursor has been aged out of the shared pool). Note that V$SQLSTATS contains a subset of columns that appear in V$SQL and V$SQLAREA.

Obviously this isn’t a leave pass to smash away at this performance view, but by avoiding V$SQL you are steering clear of a critical shared memory structure which any session wishing to parse/execute SQL will be competing for. V$SQLSTATS has been present since 10g and should be a drop-in replacement for your existing queries that use V$SQL.

Jonathan Lewis's picture

Hash Optimisation-

Franck Pachot did an interesting presentation at the OBUG (Belgium user group) Tech Days showing how to use one of the O/S debug/trace tools to step through the function calls that Oracle made during different types of joins. This prompted me to ask him a question about a possible optimisation of hash joins as follows:

The hash join operation creates an in-memory hash table from the rowsource produced by its first child operation then probes the hash table with rows from the row source produced by the second child operation; but if there are no rows in the first row source then there’s no need to acquire rows from the second row source, so Oracle doesn’t call the second child operation.

Does the hash join adopt a “symmetrical” strategy – starting the second child operation prematurely to find the first couple of rows before executing the first child operation and building the in-memory hash table ?

Franck got back to me before the end of the event with the answer – which was “no”. (This makes sense, of course, because if there’s no data in the second rowsource – which is supposed to be the “big” rowsource – you’ve probably done a lot of work which you may not have needed to do if you’d checked the first rowsource properly first.  (Thought – might Oracle look for the first row from the first row source, then check the second row source for a first row, then finish the first rowsource and build? How difficult do you want to make your life?)

So Franck got me an answer by looking at function calls – but I’d only asked him because I thought he might have done the experiment already and might be able to give me an answer immediately rather than having to waste his valuable Belgian Beer Drinking time to do some tests for me. (And I’ve never got around to playing with strace, ptrace, dtrace, truss, et. al. so I didn’t fancy doing the tracing for myself, of course). But this particular question doesn’t need any examination of function calls to get an answer: an extended trace file would be sufficient. Here’s how to start:


rem
rem     Script:         hash_join_opt.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2019
rem
rem     Last tested
rem             11.2.0.4
rem

create table t1
segment creation immediate
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum,20)                  n20,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;

create table t2
segment creation immediate
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum,10)                  n10,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;


begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T2',
                method_opt  => 'for all columns size 1'
        );
end;
/

set serveroutput off

alter system flush buffer_cache;
execute dbms_lock.sleep(2)

alter session set statistics_level = all;
alter session set events '10046 trace name context forever, level 8';

select
        t1.v1, t2.v1
from
        t1, t2
where
        t1.n20 = 19
and     t2.id = t1.id
and     t2.n10 = 7.5
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));
alter session set events '10046 trace name context off';
alter session set statistics_level = typical;

I’ve created a pair of tables with similar volumes of data and then executed a query that should do a hash join between the tables. The t2 table will return no rows since the predicate on n10 is asking for a value that doesn’t exist but the optimizer will think it will supply 10% of the table since I’ve avoided creating a histogram on the column and the num_distinct for the n10 column is 10. You’ll notice that I’ve flushed the buffer_cache (with a couple of seconds pause, just in case) before executing the query. Here’s the execution plan with rowsource execution stats:


SQL_ID  bscu1r7cu36ur, child number 0
-------------------------------------
select  t1.v1, t2.v1 from  t1, t2 where  t1.n20 = 19 and t2.id = t1.id
and t2.n10 = 7.5

Plan hash value: 1838229974

-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |      0 |00:00:00.04 |     352 |    348 |       |       |          |
|*  1 |  HASH JOIN         |      |      1 |    500 |      0 |00:00:00.04 |     352 |    348 |  1519K|  1519K| 1487K (0)|
|*  2 |   TABLE ACCESS FULL| T1   |      1 |    500 |    500 |00:00:00.03 |     176 |    174 |       |       |          |
|*  3 |   TABLE ACCESS FULL| T2   |      1 |   1000 |      0 |00:00:00.01 |     176 |    174 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T2"."ID"="T1"."ID")
   2 - filter("T1"."N20"=19)
   3 - filter("T2"."N10"=7.5)

As you can see, the tablescan of t2 returned no rows so the hash join returned no rows, but the stats show 174 blocks read from both t1 and t2. The fact that we read 174 blocks from t2 doesn’t mean we didn’t do a pre-emptive check, of course, as those reads could (by the hypothesis) have been the reads required by the pre-emptive check. The fact that we did a complete scan both tables, though, tells us that the reads on t2 (probably) weren’t from a pre-emptive check (if they were then Oracle shouldn’t have scanned the whole of t1).

We can get extra corroborative evidence though when we check the trace file – which shows us the following waits:

PARSING IN CURSOR #139927916187824 len=88 dep=0 uid=62 oct=3 lid=62 tim=1551445144496821 hv=3651246935 ad='97353e10' sqlid='bscu1r7cu36ur'
select
        t1.v1, t2.v1
from
        t1, t2
where
        t1.n20 = 19
and     t2.id = t1.id
and     t2.n10 = 7.5
END OF STMT

PARSE #139927916187824:c=3000,e=2760,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=1838229974,tim=1551445144496811
EXEC #139927916187824:c=0,e=78,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1838229974,tim=1551445144497155
WAIT #139927916187824: nam='SQL*Net message to client' ela= 8 driver id=1650815232 #bytes=1 p3=0 obj#=351199 tim=1551445144497224

WAIT #139927916187824: nam='db file sequential read' ela= 634 file#=5 block#=130 blocks=1 obj#=351198 tim=1551445144498386
WAIT #139927916187824: nam='db file scattered read' ela= 397 file#=5 block#=131 blocks=5 obj#=351198 tim=1551445144498971
WAIT #139927916187824: nam='db file scattered read' ela= 1752 file#=5 block#=136 blocks=8 obj#=351198 tim=1551445144502411
WAIT #139927916187824: nam='db file scattered read' ela= 426 file#=5 block#=145 blocks=7 obj#=351198 tim=1551445144504056
WAIT #139927916187824: nam='db file scattered read' ela= 590 file#=5 block#=152 blocks=8 obj#=351198 tim=1551445144505655
WAIT #139927916187824: nam='db file scattered read' ela= 8063 file#=5 block#=161 blocks=7 obj#=351198 tim=1551445144514574
WAIT #139927916187824: nam='db file scattered read' ela= 402 file#=5 block#=168 blocks=8 obj#=351198 tim=1551445144515741
WAIT #139927916187824: nam='db file scattered read' ela= 803 file#=5 block#=177 blocks=7 obj#=351198 tim=1551445144517495
WAIT #139927916187824: nam='db file scattered read' ela= 423 file#=5 block#=184 blocks=8 obj#=351198 tim=1551445144518743
WAIT #139927916187824: nam='db file scattered read' ela= 422 file#=5 block#=193 blocks=7 obj#=351198 tim=1551445144520696
WAIT #139927916187824: nam='db file scattered read' ela= 266 file#=5 block#=200 blocks=8 obj#=351198 tim=1551445144521460
WAIT #139927916187824: nam='db file scattered read' ela= 307 file#=5 block#=209 blocks=7 obj#=351198 tim=1551445144522249
WAIT #139927916187824: nam='db file scattered read' ela= 313 file#=5 block#=216 blocks=8 obj#=351198 tim=1551445144523166
WAIT #139927916187824: nam='db file scattered read' ela= 247 file#=5 block#=225 blocks=7 obj#=351198 tim=1551445144523927
WAIT #139927916187824: nam='db file scattered read' ela= 315 file#=5 block#=232 blocks=8 obj#=351198 tim=1551445144524646
WAIT #139927916187824: nam='db file scattered read' ela= 326 file#=5 block#=241 blocks=7 obj#=351198 tim=1551445144525565
WAIT #139927916187824: nam='db file scattered read' ela= 186 file#=5 block#=248 blocks=8 obj#=351198 tim=1551445144526255
WAIT #139927916187824: nam='db file scattered read' ela= 341 file#=5 block#=258 blocks=55 obj#=351198 tim=1551445144527363
----------------
WAIT #139927916187824: nam='db file sequential read' ela= 201 file#=5 block#=386 blocks=1 obj#=351199 tim=1551445144531165
WAIT #139927916187824: nam='db file scattered read' ela= 221 file#=5 block#=387 blocks=5 obj#=351199 tim=1551445144531492
WAIT #139927916187824: nam='db file scattered read' ela= 194 file#=5 block#=392 blocks=8 obj#=351199 tim=1551445144531750
WAIT #139927916187824: nam='db file scattered read' ela= 301 file#=5 block#=401 blocks=7 obj#=351199 tim=1551445144532271
WAIT #139927916187824: nam='db file scattered read' ela= 272 file#=5 block#=408 blocks=8 obj#=351199 tim=1551445144532756
WAIT #139927916187824: nam='db file scattered read' ela= 258 file#=5 block#=417 blocks=7 obj#=351199 tim=1551445144533218
WAIT #139927916187824: nam='db file scattered read' ela= 242 file#=5 block#=424 blocks=8 obj#=351199 tim=1551445144533704
WAIT #139927916187824: nam='db file scattered read' ela= 232 file#=5 block#=433 blocks=7 obj#=351199 tim=1551445144534125
WAIT #139927916187824: nam='db file scattered read' ela= 213 file#=5 block#=440 blocks=8 obj#=351199 tim=1551445144534506
WAIT #139927916187824: nam='db file scattered read' ela= 241 file#=5 block#=449 blocks=7 obj#=351199 tim=1551445144534914
WAIT #139927916187824: nam='db file scattered read' ela= 221 file#=5 block#=456 blocks=8 obj#=351199 tim=1551445144535258
WAIT #139927916187824: nam='db file scattered read' ela= 311 file#=5 block#=465 blocks=7 obj#=351199 tim=1551445144536518
WAIT #139927916187824: nam='db file scattered read' ela= 215 file#=5 block#=472 blocks=8 obj#=351199 tim=1551445144536906
WAIT #139927916187824: nam='db file scattered read' ela= 164 file#=5 block#=481 blocks=7 obj#=351199 tim=1551445144537343
WAIT #139927916187824: nam='db file scattered read' ela= 160 file#=5 block#=488 blocks=8 obj#=351199 tim=1551445144537624
WAIT #139927916187824: nam='db file scattered read' ela= 253 file#=5 block#=497 blocks=7 obj#=351199 tim=1551445144538041
WAIT #139927916187824: nam='db file scattered read' ela= 252 file#=5 block#=504 blocks=8 obj#=351199 tim=1551445144538456
WAIT #139927916187824: nam='db file scattered read' ela= 947 file#=5 block#=514 blocks=55 obj#=351199 tim=1551445144539690

FETCH #139927916187824:c=25996,e=43227,p=348,cr=352,cu=0,mis=0,r=0,dep=0,og=1,plh=1838229974,tim=1551445144540502

I’ve inserted a few gaps into this section of the trace file but haven’t deleted any lines. As you can see there’s a series of reads for obj# 351198 (table t1), followed by a series of reads of obj# 351199. We don’t do anything subtle like peeking at t1, switching to peek at t2, then continuing with t1: it’s a simple end to end brute force scan of each table in turn.

But there’s more …

Before I created the test above I decided to check whether I had already done the test some time in the past and written about it. A search through my blog didn’t turn up any notes about this question, but it did uncover the following footnote to an earlier posting about hash joins:

taking a closer look at the set of trace files generated in the broadcast test I discovered that the first set of slaves start their parallel tablescan of t1 but stop after just one read from each slave, then the second set of slaves scans and builds the hash table from t2 before calling for further data from t1.

Naturally I included a link to the footnote in the draft notes for this blog entry and promptly forgot about writing the blog note until a couple of days ago when Chinar Aliyev produced a note on the hash join buffered operation in which he had included an execution plan similar to the broadcast plan from my earlier blog note – so I thought I’d take another look at it, because it looks like some versions of the parallel hash join can do exactly the pre-emptive test that the serial execution plan doesn’t.

So, here’s a query to run against the same data set, and the resulting execution plan – pulled from memory after executing the query:


select
        /*+
                parallel(2)
                leading(t1 t2)
                use_hash(t2)
                pq_distribute(t2 none broadcast)
        */
        t1.v1, t2.v1
--      t1.v1, count(*)
from
        t1, t2
where
        t1.n20 = 19
and     t2.id = t1.id
and     t2.n10 = 7
;

----------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Starts | E-Rows |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
----------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |      1 |        |        |      |            |      0 |00:00:00.22 |      10 |      2 |       |       |          |
|   1 |  PX COORDINATOR          |          |      1 |        |        |      |            |      0 |00:00:00.22 |      10 |      2 |       |       |          |
|   2 |   PX SEND QC (RANDOM)    | :TQ10001 |      0 |    500 |  Q1,01 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|*  3 |    HASH JOIN             |          |      1 |    500 |  Q1,01 | PCWP |            |      0 |00:00:00.12 |     128 |    173 |  1483K|  1483K|     2/0/0|
|   4 |     PX BLOCK ITERATOR    |          |      2 |    500 |  Q1,01 | PCWC |            |    500 |00:00:00.03 |     248 |    173 |       |       |          |
|*  5 |      TABLE ACCESS FULL   | T1       |     25 |    500 |  Q1,01 | PCWP |            |    500 |00:00:00.02 |     248 |    173 |       |       |          |
|   6 |     BUFFER SORT          |          |      2 |        |  Q1,01 | PCWC |            |   2000 |00:00:00.07 |       0 |      0 |   108K|   108K|     2/0/0|
|   7 |      PX RECEIVE          |          |      2 |   1000 |  Q1,01 | PCWP |            |   2000 |00:00:00.06 |       0 |      0 |       |       |          |
|   8 |       PX SEND BROADCAST  | :TQ10000 |      0 |   1000 |  Q1,00 | P->P | BROADCAST  |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   9 |        PX BLOCK ITERATOR |          |      2 |   1000 |  Q1,00 | PCWC |            |   1000 |00:00:00.05 |     248 |    173 |       |       |          |
|* 10 |         TABLE ACCESS FULL| T2       |     25 |   1000 |  Q1,00 | PCWP |            |   1000 |00:00:00.04 |     248 |    173 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------------

Now the rule for reading a DFO tree in a parallel plan is to follow the table queues (a.k.a. DFO numbers / TQ numbers) – but in this case we have a bit of a mixture as the table scan of t1 isn’t used as the input to a distribution so it’s hard to know whether the first physical activity is supposed to be the tablescan of t2 (feeding the zeroth table queue) or the tablescan of t1 which doesn’t feed a table queue until after the hash join.

Given that t2 is being broadcast we should probably expect to see the first set of parallel query slaves starting first with a scan of “random” sections of t1 to build an in-memory hash table, and then see the second set of parallel query slaves scanning t2 and broadcasting the results to the first set of slaves (every slave gets a copy of every row) to do the probe. We can try to confirm this by looking at the trace files. First we extract “db file scattered read” lines from the four trace files and then we sort them by timestamp.

test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 677 file#=5 block#=220 blocks=4 obj#=351218 tim=1551450113700076
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 214 file#=5 block#=225 blocks=3 obj#=351218 tim=1551450113701131
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 238 file#=5 block#=161 blocks=7 obj#=351218 tim=1551450113702026
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 318 file#=5 block#=177 blocks=6 obj#=351218 tim=1551450113703464
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 371 file#=5 block#=190 blocks=2 obj#=351218 tim=1551450113705168
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 227 file#=5 block#=193 blocks=5 obj#=351218 tim=1551450113705690
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 225 file#=5 block#=138 blocks=6 obj#=351218 tim=1551450113706850
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 487 file#=5 block#=131 blocks=5 obj#=351218 tim=1551450113708512
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 228 file#=5 block#=136 blocks=2 obj#=351218 tim=1551450113709161
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 219 file#=5 block#=294 blocks=7 obj#=351218 tim=1551450113710099
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 150 file#=5 block#=184 blocks=6 obj#=351218 tim=1551450113711398
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 229 file#=5 block#=273 blocks=7 obj#=351218 tim=1551450113712456
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 223 file#=5 block#=280 blocks=7 obj#=351218 tim=1551450113713829
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 253 file#=5 block#=168 blocks=7 obj#=351218 tim=1551450113715124
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 1014 file#=5 block#=308 blocks=5 obj#=351218 tim=1551450113717046
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 266 file#=5 block#=235 blocks=5 obj#=351218 tim=1551450113718684
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 208 file#=5 block#=241 blocks=2 obj#=351218 tim=1551450113719321
----
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 326 file#=5 block#=287 blocks=7 obj#=351218 tim=1551450113700003
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 147 file#=5 block#=198 blocks=2 obj#=351218 tim=1551450113701394
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 72 file#=5 block#=200 blocks=5 obj#=351218 tim=1551450113701840
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 459 file#=5 block#=153 blocks=7 obj#=351218 tim=1551450113703442
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 308 file#=5 block#=259 blocks=7 obj#=351218 tim=1551450113705132
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 313 file#=5 block#=266 blocks=7 obj#=351218 tim=1551450113706540
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 304 file#=5 block#=243 blocks=5 obj#=351218 tim=1551450113707925
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 219 file#=5 block#=248 blocks=2 obj#=351218 tim=1551450113708505
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 590 file#=5 block#=228 blocks=4 obj#=351218 tim=1551450113709705
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 178 file#=5 block#=232 blocks=3 obj#=351218 tim=1551450113710270
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 142 file#=5 block#=205 blocks=3 obj#=351218 tim=1551450113711046
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 98 file#=5 block#=209 blocks=4 obj#=351218 tim=1551450113711457
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 179 file#=5 block#=213 blocks=3 obj#=351218 tim=1551450113712308
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 296 file#=5 block#=216 blocks=4 obj#=351218 tim=1551450113712967
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 188 file#=5 block#=301 blocks=7 obj#=351218 tim=1551450113714238
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 207 file#=5 block#=250 blocks=6 obj#=351218 tim=1551450113715393
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 567 file#=5 block#=146 blocks=6 obj#=351218 tim=1551450113717686
----
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 260 file#=5 block#=529 blocks=7 obj#=351219 tim=1551450113698677
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 196 file#=5 block#=522 blocks=7 obj#=351219 tim=1551450113722013
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 145 file#=5 block#=564 blocks=5 obj#=351219 tim=1551450113723239
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 317 file#=5 block#=491 blocks=5 obj#=351219 tim=1551450113724578
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 128 file#=5 block#=497 blocks=2 obj#=351219 tim=1551450113725217
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 495 file#=5 block#=402 blocks=6 obj#=351219 tim=1551450113726250
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 168 file#=5 block#=394 blocks=6 obj#=351219 tim=1551450113727928
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 249 file#=5 block#=424 blocks=7 obj#=351219 tim=1551450113730384
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 120 file#=5 block#=476 blocks=4 obj#=351219 tim=1551450113731225
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 485 file#=5 block#=481 blocks=3 obj#=351219 tim=1551450113732064
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 594 file#=5 block#=387 blocks=5 obj#=351219 tim=1551450113734168
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 323 file#=5 block#=392 blocks=2 obj#=351219 tim=1551450113735040
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 121 file#=5 block#=506 blocks=6 obj#=351219 tim=1551450113736253
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 285 file#=5 block#=440 blocks=6 obj#=351219 tim=1551450113739084
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 345 file#=5 block#=433 blocks=6 obj#=351219 tim=1551450113740940
----
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 261 file#=5 block#=550 blocks=7 obj#=351219 tim=1551450113698687
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 155 file#=5 block#=484 blocks=4 obj#=351219 tim=1551450113721738
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 131 file#=5 block#=488 blocks=3 obj#=351219 tim=1551450113722260
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 143 file#=5 block#=557 blocks=7 obj#=351219 tim=1551450113723245
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 138 file#=5 block#=454 blocks=2 obj#=351219 tim=1551450113724545
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 124 file#=5 block#=456 blocks=5 obj#=351219 tim=1551450113724954
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 243 file#=5 block#=469 blocks=3 obj#=351219 tim=1551450113725947
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 523 file#=5 block#=472 blocks=4 obj#=351219 tim=1551450113726745
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 178 file#=5 block#=409 blocks=7 obj#=351219 tim=1551450113727781
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 230 file#=5 block#=543 blocks=7 obj#=351219 tim=1551450113729470
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 488 file#=5 block#=515 blocks=7 obj#=351219 tim=1551450113731618
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 356 file#=5 block#=461 blocks=3 obj#=351219 tim=1551450113733840
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 185 file#=5 block#=465 blocks=4 obj#=351219 tim=1551450113735195
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 284 file#=5 block#=536 blocks=7 obj#=351219 tim=1551450113736172
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 204 file#=5 block#=499 blocks=5 obj#=351219 tim=1551450113737957
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 306 file#=5 block#=504 blocks=2 obj#=351219 tim=1551450113738871
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 281 file#=5 block#=446 blocks=2 obj#=351219 tim=1551450113739710
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 158 file#=5 block#=449 blocks=5 obj#=351219 tim=1551450113740159
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 99 file#=5 block#=417 blocks=7 obj#=351219 tim=1551450113741709


tim=1551450113698677 obj#=351219
tim=1551450113698687 obj#=351219
----
tim=1551450113700003 obj#=351218
tim=1551450113700076 obj#=351218
tim=1551450113701131 obj#=351218
tim=1551450113701394 obj#=351218
tim=1551450113701840 obj#=351218
tim=1551450113702026 obj#=351218
tim=1551450113703442 obj#=351218
tim=1551450113703464 obj#=351218
tim=1551450113705132 obj#=351218
tim=1551450113705168 obj#=351218
tim=1551450113705690 obj#=351218
tim=1551450113706540 obj#=351218
tim=1551450113706850 obj#=351218
tim=1551450113707925 obj#=351218
tim=1551450113708505 obj#=351218
tim=1551450113708512 obj#=351218
tim=1551450113709161 obj#=351218
tim=1551450113709705 obj#=351218
tim=1551450113710099 obj#=351218
tim=1551450113710270 obj#=351218
tim=1551450113711046 obj#=351218
tim=1551450113711398 obj#=351218
tim=1551450113711457 obj#=351218
tim=1551450113712308 obj#=351218
tim=1551450113712456 obj#=351218
tim=1551450113712967 obj#=351218
tim=1551450113713829 obj#=351218
tim=1551450113714238 obj#=351218
tim=1551450113715124 obj#=351218
tim=1551450113715393 obj#=351218
tim=1551450113717046 obj#=351218
tim=1551450113717686 obj#=351218
tim=1551450113718684 obj#=351218
tim=1551450113719321 obj#=351218
--
tim=1551450113721738 obj#=351219
tim=1551450113722013 obj#=351219
tim=1551450113722260 obj#=351219
tim=1551450113723239 obj#=351219
tim=1551450113723245 obj#=351219
tim=1551450113724545 obj#=351219
tim=1551450113724578 obj#=351219
tim=1551450113724954 obj#=351219
tim=1551450113725217 obj#=351219
tim=1551450113725947 obj#=351219
tim=1551450113726250 obj#=351219
tim=1551450113726745 obj#=351219
tim=1551450113727781 obj#=351219
tim=1551450113727928 obj#=351219
tim=1551450113729470 obj#=351219
tim=1551450113730384 obj#=351219
tim=1551450113731225 obj#=351219
tim=1551450113731618 obj#=351219
tim=1551450113732064 obj#=351219
tim=1551450113733840 obj#=351219
tim=1551450113734168 obj#=351219
tim=1551450113735040 obj#=351219
tim=1551450113735195 obj#=351219
tim=1551450113736172 obj#=351219
tim=1551450113736253 obj#=351219
tim=1551450113737957 obj#=351219
tim=1551450113738871 obj#=351219
tim=1551450113739084 obj#=351219
tim=1551450113739710 obj#=351219
tim=1551450113740159 obj#=351219
tim=1551450113740940 obj#=351219
tim=1551450113741709 obj#=351219

I’ve got four slave processes p000 through to p0003, and in the top part of the output you can see that p000/p001 scan object 351218 (after several experiments the object_id for t1 has climbed a bit), and p002/p003 scan object 351219 (t2). Sorting by the timestamps (tim=), though, we can see that there are two reads for t2 (351219) followed by many reads of t1 (351218), finishing with a load of reads of t2. Cross-checking the timestamps on the sorted output with the original extract we can see that the two initial reads of t2 are the first reads by p002 and p003 respectively.

tim=1551450113698677 obj#=351219
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 260 file#=5 block#=529 blocks=7 obj#=351219 tim=1551450113698677

tim=1551450113698687 obj#=351219
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 261 file#=5 block#=550 blocks=7 obj#=351219 tim=1551450113698687

It really does look as if at run time Oracle does the scan and build for t1 first but starts with a quick peek at t2 to check that it’s going to return some data. We can be a little more fussy with this test – let’s change some of the data in t2 so that Oracle doesn’t find any rows in the first two scattered reads and see if anything changes.

This is going to take a little fiddling but in my case I can see that the first reads by slaves p002/p003 are looking at file 5, blocks 529 – 535 and blocks 550 – 556; and broadly speaking the scattered reads are working backwards down the table – so I’m going to update a load of data that appears in the second half of the table on the next run of the test with a simple command to put the n10 column value out of range:


update t2 set n10 = 0 where n10 = 7 and id > 750;
commit;

-- test the query

update t2 set n10 = 0 where n10 = 7;
commit;

-- test the query

After the first update the pattern of reads (in time order) changed do:

  • 22 scattered reads of t2
  • 34 scattered reads of t1
  • 12 scattered reads of t2

This really does look like Oracle checking whether t2 will have any data before doing the full scan and build from t1.

Unfortunately when I did the final update that eliminated all the interesting data from t2 the pattern (in time order) turned into:

  • 34 scattered reads of t2
  • 34 scattered reads of t1

In other words – having discovered NO data in t2 that could join Oracle still scanned and built the hash table from t1. So maybe there’s a reason other than a sanity check for peeking at the second table, or maybe there’s a bug in the extreme case where the second table produces no data at all. (For reference you might also want to see an older note on how a parallel hash join doesn’t skip the probe table when the build table is empty.)

Just as a quick check that I wasn’t chasing an old bug at this point I repeated the tests on 18.3.0.0 – the same patterns emerged.

 

To prevent automated spam submissions leave this field empty.