Oakies Blog Aggregator

Franck Pachot's picture

12c Multitenant internals: PDB replay DDL for common users

In multitenant, you can create common Users, Roles, and Profiles. You create them in CDB$ROOT, with the CONTAINER=ALL clause (which is optional because it is the only possible value when connected to CDB$ROOT) but they are visible to all containers. As the goal of multitenant is to avoid to duplicate common metadata to all containers, You may think that they are visible through those magic metadata links. But that’s actually wrong: they are simply replicated with a very simple mechanism: the DDL for common objects is replayed into each user PDB.

I’m connected to CDB2’s CDB$ROOT and I have two pluggable databases:

SQL> show pdbs
 
CON_ID CON_NAME OPEN MODE RESTRICTED
------ -------- ---- ---- ----------
2 PDB$SEED READ ONLY NO
3 PDB1 READ WRITE NO
4 PDB2 MOUNTED

PDB1 is opened and PDB2 is closed.

PDB_SYNC$

In this example, I’ll query PDB_SYNC$ which is the table where Oracle stores all DDL for common users, roles, or profiles in order to be able to replay it later:

SQL> select con_id,scnwrp,scnbas,ctime,name,auxname1,auxname2,opcode,replay#,sqlstmt from containers(pdb_sync$) where con_id=1 and bitand(flags,8)!=8 order by con_id,replay#;
 
CON_ID SCNWRP SCNBAS CTIME NAME AUXNAME1 AUXNAME2 OPCODE REPLAY# SQLSTMT
------ ------ ------ ----- ---- -------- -------- ------ ------- -------
1 0 852610 26-jan-17 02:57:26 CTXSYS SYS 5 1 alter user CTXSYS account unlock identified by *^@
1 0 853177 26-jan-17 02:57:34 CTXSYS SYS 5 2 alter user CTXSYS password expire account lock^@
1 0 1405359 26-jan-17 03:31:31 SYSTEM SYS 5 3 alter user system account lock password expire^@
1 0 1408693 23-dec-17 11:34:43 SYS SYS 5 4 alter user sys account unlock identified by *^@
1 0 1408703 23-dec-17 11:34:43 SYSTEM SYS 5 5 alter user system account unlock identified by *^@
1 0 0 26-jan-17 01:53:02 PDB$LASTREPLAY -1 5

I excluded the bitand(flags,8)=8 because it concerns application containers. I query with the container() to show the con_id but this is for con_id=1 which is the CDB$ROOT.

You can see some DDL for CTXSYS recorded on January 26th which is the day where this release (12.2.0.1) was built. I used a template with datafiles to create the CDB with DBCA. And you see some DDL to unlock SYS and SYSTEM on December 23rd when I created the database. You can also see that they are ordeded in sequence with REPLAY#.

More interesting is the OPCODE=-1 which is PDB$LASTREPLAY and looks like the last value of REPLAY#. This means that on this container, CDB$ROOT, all statements where REPLAY#<=5 was run.

With a similar query, I query the opened PDBs:

SQL> select con_id,scnwrp,scnbas,ctime,name,auxname1,auxname2,opcode,replay#,sqlstmt from containers(pdb_sync$) where con_id>1 and bitand(flags,8)!=8 order by con_id,replay#;
 
CON_ID SCNWRP SCNBAS CTIME NAME AUXNAME1 AUXNAME2 OPCODE REPLAY# SQLSTMT
------ ------ ------ ----- ---- -------- -------- ------ ------- -------
3 0 0 26-jan-17 01:53:02 PDB$LASTREPLAY -1 5

There only one row here in CON_ID=3, which is PDB1: the PDB$LASTREPLAY mentioning that all statements up to REPLAY=5 have been run also in this container.
I don’t see PDB2 (CON_ID=4) here because the container() clause cannot query closed containers.

CONTAINER=ALL DDL

I’ll run some common DLL to create a profile, a role and a user:

SQL> create profile C##PROFILE1 limit inactive_account_time 15 container=all;
Profile C##PROFILE1 created.
 
SQL> create role C##ROLE1 container=all;
Role C##ROLE1 created.
 
SQL> create user C##USER1 identified by oracle container=all;
User C##USER1 created.
 
SQL> alter user C##USER1 profile C##PROFILE1;
User C##USER1 altered.
 
SQL> grant C##ROLE1 to C##USER1 container=all;
Grant succeeded.

The C## prefix is mandatory to isolate the common user namespace. You can change it with the common_prefix parameter. You can even set it to the empty string, but then you have a risk of namespace collision when you plug a PDB between CDB having different common profiles or roles.
The CONTAINER=ALL is the default and the only possibility when connected to CDB$ROOT so it is optional. I recommend to mention it explicitly in order to avoid problems when running the same DDL in CDB$ROOT and in PDBs.

All those DDL have been recorded into PDB_SYNC$ and the REPLAY# has been increased:

SQL> select con_id,scnwrp,scnbas,ctime,name,auxname1,auxname2,opcode,replay#,sqlstmt from containers(pdb_sync$) where con_id=1 and bitand(flags,8)!=8 order by con_id,replay#;
 
CON_ID SCNWRP SCNBAS CTIME NAME AUXNAME1 AUXNAME2 OPCODE REPLAY# SQLSTMT
------ ------ ------ ----- ---- -------- -------- ------ ------- -------
1 0 852610 26-jan-17 02:57:26 CTXSYS SYS 5 1 alter user CTXSYS account unlock identified by *^@
1 0 853177 26-jan-17 02:57:34 CTXSYS SYS 5 2 alter user CTXSYS password expire account lock^@
1 0 1405359 26-jan-17 03:31:31 SYSTEM SYS 5 3 alter user system account lock password expire^@
1 0 1408693 23-dec-17 11:34:43 SYS SYS 5 4 alter user sys account unlock identified by *^@
1 0 1408703 23-dec-17 11:34:43 SYSTEM SYS 5 5 alter user system account unlock identified by *^@
1 0 1466615 29-dec-17 09:26:56 C##PROFILE1 SYS 7 6 create profile C##PROFILE1 limit inactive_account_time 15 container=all^@
1 0 1466641 29-dec-17 09:26:57 C##ROLE1 SYS 3 7 create role C##ROLE1 container=all^@
1 0 1466748 29-dec-17 09:26:58 C##USER1 SYS 1 8 create user C##USER1 identified by * container=all^@
1 0 1466812 29-dec-17 09:26:59 C##USER1 SYS 5 9 alter user C##USER1 profile C##PROFILE1^@
1 0 1466853 29-dec-17 09:26:59 C##USER1 C##ROLE1 SYS 10 10 grant C##ROLE1 to C##USER1 container=all^@
1 0 0 26-jan-17 01:53:02 PDB$LASTREPLAY -1 10

PDB1 (CON_ID=3) was opened read write, and then has been synchronized (the DDL has been run in the container to create the same profile, role and user) and the PDB$LASTREPLAY has been updated in this container to show that all has been done:

SQL> select con_id,scnwrp,scnbas,ctime,name,auxname1,auxname2,opcode,replay#,sqlstmt from containers(pdb_sync$) where con_id>1 and bitand(flags,8)!=8 order by con_id,opcode,replay#;
%nbsp;
CON_ID SCNWRP SCNBAS CTIME NAME AUXNAME1 AUXNAME2 OPCODE REPLAY# SQLSTMT
------ ------ ------ ----- ---- -------- -------- ------ ------- -------
3 0 0 26-jan-17 01:53:02 PDB$LASTREPLAY -1 10

MOUNTED or READ ONLY

I open the PDB2 read only because I want to see what is in PDB_SYNC$ there. But READ ONLY means that the DDL cannot be run because no write is allowed in the local dictionary.

SQL> alter pluggable database PDB2 open read only;
Pluggable database PDB2 altered.

Running the same query as above, I can see that PDB2 (CON_ID=4) is synchronized only up to the statements with REPLAY#=5 because my DDL was not replicated there.
SQL> select con_id,scnwrp,scnbas,ctime,name,auxname1,auxname2,opcode,replay#,sqlstmt from containers(pdb_sync$) where con_id>1 and bitand(flags,8)!=8 order by con_id,opcode,replay#;
 
CON_ID SCNWRP SCNBAS CTIME NAME AUXNAME1 AUXNAME2 OPCODE REPLAY# SQLSTMT
------ ------ ------ ----- ---- -------- -------- ------ ------- -------
3 0 0 26-jan-17 01:53:02 PDB$LASTREPLAY -1 10
4 0 0 26-jan-17 01:53:02 PDB$LASTREPLAY -1 5

Sync at OPEN

When I open PDB2 in read write mode, the DDL can be synchronized:

SQL> alter pluggable database PDB2 open read write force;
Pluggable database PDB2 altered.

At open, the DDL from REPLAY#>5 has been replayed and once opened the PDB is in sync with CDB$ROOT:

SQL> select con_id,scnwrp,scnbas,ctime,name,auxname1,auxname2,opcode,replay#,sqlstmt from containers(pdb_sync$) where con_id>1 and bitand(flags,8)!=8 order by con_id,opcode,replay#;
 
CON_ID SCNWRP SCNBAS CTIME NAME AUXNAME1 AUXNAME2 OPCODE REPLAY# SQLSTMT
------ ------ ------ ----- ---- -------- -------- ------ ------- -------
3 0 0 26-jan-17 01:53:02 PDB$LASTREPLAY -1 10
4 0 0 26-jan-17 01:53:02 PDB$LASTREPLAY -1 10

So what?

The common users, roles and profiles are not stored only in CDB$ROOT to be shared, but rather replicated to all PDBs. The DDL is replicated synchronously to all opened pluggable databases in read write, and stored into the CDB$ROOT PDB_SYNC$ table to be replayed later when non-synced PDBs are opened. I’ll show in the next post what happens when the DDL is in error.

Note that even when all pluggable databases are opened read write, the DDL is stored and they are purged later (when replayed on all PDBs) because they are needed when you create a new PDB and open it. The PDB$SEED has REPLAY#=0 for PDB$LASTREPLAY which means that all statements will be replayed.

This is 12.2.0.1 where this mechanism is only for common users, roles and profiles having DDL in CDB$ROOT. With application containers, more than that is recorded: all DML and DDL run between the ‘begin install/upgrade/patch’ and ‘end install/upgrade/patch’ in the application root. Then, the statements can be replayed into the application PDB with a simple SYNC command. In the future release (18c) we expect to have that application root behavior ported to CDB$ROOT so that we don’t have to run catupgrd.sql in all containers. Then the PDB will probably be patched or upgraded when opened.

 

Cet article 12c Multitenant internals: PDB replay DDL for common users est apparu en premier sur Blog dbi services.

Franck Pachot's picture

Drop PDB including datafiles may keep files open

I like that Oracle drops the datafiles from the operating system when we remove them from the database (with drop tablespace or drop pluggable database) because I don’t like to have orphean files remaining in the filesystem. However, to ensure that space is reclaimed, we must be sure that Oracle did not leave a process with this file opened. Linux allows to drop an open file but then drops only the inode. The consequence is that we do not see the file, but space is not reclaimable until the process closes the handle.
Here is a case where I’ve had an issue in 12.2 where plugging a PDB is done in parallel and the parallel processes keep the files opened even if we drop the pluggable database.

I have 1.2 GB free on my filesystem:
SQL> host df -h /u02/oradata
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ol-root 3.1G 1.9G 1.2G 61% /

Plug

I plug a PDB from a PDB archvive:
SQL> create pluggable database PDB0
2 using '/u01/app/temp/PDB0.pdb'
3 file_name_convert=('/u01/app/temp','/u02/oradata')
4 /
 
Pluggable database PDB0 created.

Open

In my example the PDB was from an older PSU level. I open it:
SQL> alter pluggable database PDB0 open;
ORA-24344: success with compilation error
 
Pluggable database PDB0 altered.

I got a warning and the PDB is in restricted session mode:
SQL> show pdbs
CON_ID CON_NAME OPEN MODE RESTRICTED
------ ---------- ------------ ----------
2 PDB$SEED READ ONLY NO
3 PDB0 READ WRITE YES
4 PDB1 READ WRITE NO

The reason is that a PSU installed in the CDB$ROOT was not there when the PDB was unplugged:

SQL> select status,message from pdb_plug_in_violations;
STATUS MESSAGE
------ ------------------------------------------------------------------------------------------------------------
PENDING DBRU bundle patch 171017 (DATABASE RELEASE UPDATE 12.2.0.1.171017): Installed in the CDB but not in the PDB.

I can run datapatch, but let’s say that I realize it’s the wrong PDB archive and I want to drop what I’ve imported. Then I expect to reclaim the space in order to be able to import the right one again.

Processes

Before closing the PDB here are all the processes having one of the datafiles opened:
SQL> host for i in $(fuser /u02/oradata/*) ; do ps --no-headers -p $i ; done
/u02/oradata/sysaux01.dbf:
/u02/oradata/system01.dbf:
/u02/oradata/temp01.dbf:
/u02/oradata/undotbs01.dbf:
/u02/oradata/users01.dbf:
6053 ? 00:00:00 ora_dbw0_cdb1
6122 ? 00:00:01 ora_p003_cdb1
6594 ? 00:00:01 oracle_6594_cdb
6053 ? 00:00:00 ora_dbw0_cdb1
6120 ? 00:00:01 ora_p002_cdb1
6594 ? 00:00:01 oracle_6594_cdb
6053 ? 00:00:00 ora_dbw0_cdb1
6594 ? 00:00:01 oracle_6594_cdb
6053 ? 00:00:00 ora_dbw0_cdb1
6118 ? 00:00:00 ora_p001_cdb1
6594 ? 00:00:01 oracle_6594_cdb
6053 ? 00:00:00 ora_dbw0_cdb1
6116 ? 00:00:00 ora_p000_cdb1
6594 ? 00:00:01 oracle_6594_cdb

There is my session shadow process, also the DBWR, and the Pnnn parallel processes who did the copy of the datafiles during the plug.

Close

So, I want to drop it and then I close it:
SQL> alter pluggable database PDB0 close;
 
Pluggable database PDB0 altered.

Close means that all files are closed. Is it true? Actually not:
SQL> host for i in $(fuser /u02/oradata/*) ; do ps --no-headers -p $i ; done
/u02/oradata/sysaux01.dbf:
/u02/oradata/system01.dbf:
/u02/oradata/undotbs01.dbf:
/u02/oradata/users01.dbf:
6122 ? 00:00:01 ora_p003_cdb1
6120 ? 00:00:01 ora_p002_cdb1
6118 ? 00:00:00 ora_p001_cdb1
6116 ? 00:00:00 ora_p000_cdb1

The parallel processes still have the datafiles opened. This is probably a bug and I’ll open a SR referencing this blog post.

So I want to reclaim space:
SQL> host df -h /u02/oradata
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ol-root 3.1G 2.6G 527M 84% /

I expect to have this 527 MB available go back to 1.2 GB available once I drop the PDB.

Drop including datafiles

I cannot drop the PDB and keep the datafiles, or I’ll get ORA-65179: cannot keep datafiles for a pluggable database that is not unplugged
I don’t want to unplug it but just to drop it, then I must mention the ‘including datafiles':

SQL> drop pluggable database PDB0 including datafiles;
 
Pluggable database PDB0 dropped.

Unfortunately, the space is not reclaimed:
SQL> host df -h /u02/oradata
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ol-root 3.1G 2.6G 527M 84% /

As you have seen that the Pnnn processes were still there after the close, you know the reason. Linux has removed the inode but the file is still there in the filesystem until the processes close the handles (or the processes are killed). You can see them with lsof or from the /proc filesystem:

SQL> host find /proc/*/fd -ls 2>/dev/null | grep deleted
79174 0 lrwx------ 1 oracle oinstall 64 Dec 25 21:20 /proc/6116/fd/257 -> /u02/oradata/users01.dbf\ (deleted)
79195 0 lrwx------ 1 oracle oinstall 64 Dec 25 21:20 /proc/6118/fd/257 -> /u02/oradata/undotbs01.dbf\ (deleted)
79216 0 lrwx------ 1 oracle oinstall 64 Dec 25 21:20 /proc/6120/fd/257 -> /u02/oradata/system01.dbf\ (deleted)
79237 0 lrwx------ 1 oracle oinstall 64 Dec 25 21:20 /proc/6122/fd/257 -> /u02/oradata/sysaux01.dbf\ (deleted)

On a running CDB I cannot kill the background processes because they may be doing something useful. Until I can re-start the instance, the only way to reclaim the space is to write an empty file to replace those files. I cannot use the file name which has been deleted but I can use the /proc link:

SQL> host find /proc/*/fd -ls 2>/dev/null | grep -E " [/]u02[/]oradata[/].* [(]deleted[)]" | awk '{print $11}' | while read f ; do : > $f ; done

And finally space is reclaimed:
SQL> host df -h /u02/oradata
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ol-root 3.1G 1.9G 1.2G 61% /

So what?

I encountered this issue with a PDB plug but I’m quite sure we can encounter it in other situations when the parallel processes had opened the PDB datafiles. You can imagine the consequence on a multitenant CDB for DBaaS where you have hundreds of PDBs and constantly create and drop them, probably in an automated way. Space not reclaimed means that at one time the provisioning will fail. I hope this bug will be filled and fixed. Closing a PDB should take care that all processes close the files. The safe way is to include something like I did, emptying the deleted files, with the proper verifications that the old files belong to a dropped PDB.

 

Cet article Drop PDB including datafiles may keep files open est apparu en premier sur Blog dbi services.

dbakevlar's picture

The 12 Days of Database Christmas

My brain has a tendency to wake up way before everything else in the house, so I try to keep it occupied best I’m able without disturbing anyone.  This may explain why so many if my plans are well flushed out, as I have a tendency to hash them out in the early morning hours as a way of letting the rest of the household sleep.  This morning, on the eve of Christmas, I may have let my brain offer a Database Administrator twist to an old Christmas favorite…

Yes, to the tune of “the 12 Days of Christmas “ </p />
</p></div>
    <div class=»

fritshoogland's picture

Introduction to pinatrace annotate version 2: a look into latches again

This post is an introduction to pinatrace annotate version 2, which is a tool to annotate the output of the Intel Pin tools ‘pinatrace’ tool.

The pinatrace tool generates a file with every single memory access of a process. Please realise what this means: this is every single read from main memory or write to main memory from the CPU. This allows you to get an understanding what happens within a C function. This means you can determine what information or data is accessed in what function. Needless to say this is a tool for internals investigations and research, not something for normal daily database maintenance and support. Also, the performance of the process that you attached to is severely impacted, and it can only be turned off by stopping the process. Do not use this on a production database, use this at your own risk for research and investigational purposes only.

The general way to use the pin tool is to run something together with pin, for example ‘ls’:

$ ./pin -t source/tools/SimpleExamples/obj-intel64/pinatrace.so -- ls

This generates an output file that looks like this:

#
# Memory Access Trace Generated By Pin
#
0x00007f49de60f173: W 0x00007ffd160c46f8  8     0x7f49de60f178
0x00007f49de60f840: W 0x00007ffd160c46f0  8                  0
0x00007f49de60f844: W 0x00007ffd160c46e8  8                  0
0x00007f49de60f846: W 0x00007ffd160c46e0  8                  0
0x00007f49de60f848: W 0x00007ffd160c46d8  8                  0
0x00007f49de60f84a: W 0x00007ffd160c46d0  8                  0
0x00007f49de60f84f: W 0x00007ffd160c46c8  8                  0
0x00007f49de60f85f: R 0x00007f49de82fe00  8                0xe
0x00007f49de60f866: W 0x00007f49de82fbd8  8      0x5f25f53b3dc
0x00007f49de60f877: R 0x00007f49de82ff90  8           0x221e00
0x00007f49de60f881: W 0x00007f49de8309a8  8     0x7f49de82fe00
0x00007f49de60f888: W 0x00007f49de830998  8     0x7f49de60e000

This on itself is not very helpful. Let me explain what you are seeing: the first column is the instruction pointer address, in other words: the place in the executable in which the memory transfer was initiated, ‘W’ and ‘R’ are self explanatory, the third column is the memory address to which the read or write is addressed, the fourth column is the size of the memory IO, and the fifth is the value.

Luckily, there is another way to run pin, which is attaching to an already running process. This is what is helpful in using the memory access trace with an Oracle process. The way this is done, is quite simple; first get the linux process number, then run pin specifying the process number:

$ ps -ef | grep [L]OC
oracle    7250  7249  0 09:42 ?        00:00:00 oracletest (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
$ ./pin -pid 7250 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so

There is an important caveat here: because pin inserts itself into the running process, the pin library (source/tools/SimpleExamples/obj-intel64/pinatrace.so in this case) must be available to this process. So if you downloaded, extracted and linked pin tools as root, and then as root try to attach to an oracle process, it will not work because the oracle process can not reach the library. So: extract and link the pin tools as oracle.

The output of the pinatrace (a file called pinatrace.out by default) is in $ORACLE_HOME/dbs when you attach to an Oracle database process.

Back to the pinatrace output: what if we could see the C function that was at the instruction pointer address, the name of the Oracle memory area for the memory address and for the fifth column the name of the Oracle memory area again in case the value is actually a pointer? That is exactly what my pinatrace annotate oracle tool does!

How to use it: first clone the pinatrace_annotate repository:

git clone https://gitlab.com/FritsHoogland/pinatrace_annotate.git

Then set the O_BINARY to your oracle binary at line 6 of pinatrace_annotate_oracle.sh.

The next thing is to generate the oracle memory area CSV files. I added a script to generate the memory ranges for the PGA too, however you should be aware this is very fluent, and it requires an execution of the oracle PID you point it to before the v$process_memory_detail gets populated. Sometimes I skip the PGA detail step because it doesn’t add too much value.
To run the scripts for generating the memory area CSV files:

$ sqlplus / as sysdba
...
SQL> @get_memory_ranges
...
	76 14,20,@1	   8645 		    SYS 			   sqlplus@memory-presentation.local (TNS V1-V3)
...
enter pid number to get pga details: 76
old   1: alter session set events 'immediate trace name pga_detail_get level &pid'
new   1: alter session set events 'immediate trace name pga_detail_get level 76'

Session altered.

execute something (eg. select * from dual) in the target process, then press enter

Now head over to the session you want the PGA memory details from, and execute something like ‘select * from dual’, and press enter in the sqlplus session in which you ran get_memory_ranges.sql. This will then run for a while, it generates pipe-delimited files that sqlite can parse.

After the CSV files have been generated, create the sqlite database for use by the tool:

$ ./pinatrace_annotate_oracle.sh -c
generate symbols from /u01/app/oracle/product/12.2.0.1/dbhome_1/bin/oracle.
create and populate table symbol_ranges with symbols.
create index.
create table memory_ranges.
import memory_ranges.csv
import memory_ranges_xtables.csv
import memory_ranges_pga.csv
create index.

Now the pinatrace_annotate_oracle.sh tool is all set for usage!

Example usage
Let’s use an actual example to see where the memory trace provides benefit: latches. Nowadays, latches are not the issue these were in the oracle 8 days. In these days, the latches protecting the database buffers double linked lists, the ‘cache buffers chains’ latches, were not gotten in shared mode. That meant especially index root blocks could get latch contention if there was much concurrency.

Some great work has been done on the internals of latches by Andrey Nikolaev; see his latch internals presentation. His work is done on Solaris with Dtrace, I did some of that on linux (part 1, part 2 and part 3).

Let’s replay a willing to wait latch get and see how that looks like in the annotated pinatrace output! Setup the host with the pin tools (link it to get the pinatrace.so library), and clone pinatrace_annotate and configure it to point to the oracle binary you are using.

1. session 1:

$ sqlplus / as sysdba
SQL> select distinct sid from v$mystat;

2. session 2:

$ sqlplus / as sysdba
SQL> @get_memory_ranges

identify session 1, and fill out the pid.
take note of the O/S pid.
it will ask you to execute something in the session for which the pid was entered.

3. session 1:

SQL> select * from dual;

4. session 2:
press enter, and wait for @get_memory_ranges to finish.

SQL> exit
$ ./pinatrace_annotate_oracle.sh -c
$ sqlplus / as sysdba
SQL> select addr from v$latch where name = 'cache table scan latch';

note the latch address, it’s 0x6004A400 in my case.

SQL> oradebug setmypid
SQL> oradebug call kslgetl 0x6004A400 0 0 2442

result: ‘Function returned 1’ indicating a successful latch get.

5. session 1:

SQL> oradebug setmypid

6. session 3:
go to the directory where pin is extracted and linked, and go to the directory where the pin executable is located. Execute the following and change the pid for the O/S pid obtained in step 2.

$ ./pin -pid 8645 -t source/tools/SimpleExamples/obj-intel64/pinatrace.so -o pina_latch.txt

now the memory trace is enabled for pid 8645

7. session 1:

SQL> oradebug call kslgetl 0x6004A400 1 0 2442

this will hang, it is waiting for the latch which is taken by session 2.

8. session 2:

SQL> oradebug call kslfre 0x6004A400

this will return ‘Function returned 0’ indicating it freed the latch.

9. session 1:
the oradebug call will now return ‘Function returned 1’ indicating it has gotten the latch.

10. session 3:

$ cp /u01/app/oracle/product/12.2.0.1/dbhome_1/dbs/pina_latch.txt ~/pinatrace_annotate/pina_latch.txt

as indicated, the pinatrace output file will be in $ORACLE_HOME/dbs, and with this I copy the current state of the trace to the pinatrace_annotate directory where my pinatrace annotate tool is cloned.

$ cd ~/pinatrace_annotate
$ ./pinatrace_annotate_oracle.sh pina_latch.txt > pina_latch_annotated.txt

This will run for a while…

Once pinatrace_annotate_oracle.sh is done, open the pina_latch_annotated.txt file. Short guide on how this looks like; the first two lines in my file are:

function+offset:memory address(annotation+offset):Read/Write:size:value hex/dec(pointer annotation+offset)
??:0x00007fff224e6498():R:8:0x11060785/285607813()

The first line is the header, which is displayed once, and essentially explains it all. The second line starts with ‘??’. As indicated by the header, the first field is the function. Any call to a function (‘symbol’) that is not in the oracle executable, like functions in shared libraries, do not have an address in the executable. In fact, the place at which the shared libraries are loaded, and thus their functions are located is randomised by default, which explains why the executable does not have the address.

The second column is the address memory is read from or written into. In the second line, there is nothing in between the parentheses, meaning this address is not in a known memory range.

The ‘R’ or ‘W’ is obvious, and the number is the amount of memory read or written too.

The fifth field shows the content of the read or write in hexadecimal and in decimal. Here again there are parenthesis which are empty. I look up the value in the memory area’s, and put the memory area, if the value is in a known memory area, as annotation including offset. What you should realise is that I don’t know if a value is in fact a memory area and thus a pointer, or that the value coincidentally the same as a known memory area. I do think understanding if the value is a known memory area is handy and useful, in the C language pointers are heavily used.

Another example:

nsbasic_brc+548:0x00007f33d3026690(pga|Other, Fixed Uga, Fixed UGA heap+88712 ):R:8:0x7f33d3019600/139860560156160(pga|Other, Fixed Uga, Fixed UGA heap+3532
0 )

Here we see an example where the function is a function in the oracle executable, nsbasic_brc, which at offset 548 of the function read 8 bytes from memory address 0x00007f33d3026690, which was a known memory area at the time of running 0_get_pga_detail.sql when ‘select * from dual’ was run, which was in the PGA, in the Fixed UGA heap at offset 88712, and the value it read was 0x7f33d3019600, which is a memory address in the same chunk of memory at offset 3532, and therefore likely to be a pointer.

Now let’s look further into the annotated pinatracefile. If you search for ‘kslgetl’, you will find the actual latch get executed with oradebug. This is how that looks like on my system:

kslgetl+2:0x00007fff224e4ec0():W:8:0x7fff224e4ff0/140733768945648()
kslgetl+13:0x00007fff224e4ea0():W:8:0x7fff224e5870/140733768947824()
kslgetl+17:0x00007f33d8095f40(pga|Other, free memory, top call heap+243776 ):R:8:0x7f33d8095f40/139860644552512(pga|Other, free memory, top call heap+243776 )
kslgetl+26:0x00007fff224e4e98():W:8:0x1/1()
kslgetl+33:0x00007fff224e4e90():W:8:0/0()
kslgetl+37:0x00007fff224e4ea8():W:8:0x4/4()
kslgetl+44:0x00007f33d8087e78(pga|Other, free memory, top call heap+186232 ):R:8:0x7a63d268/2053362280(shared pool|permanent memor,duration 1,cls perm+5702248 )
kslgetl+58:0x00007fff224e4eb0():W:8:0x7fff224e57d0/140733768947664()
kslgetl+65:0x00007fff224e4e60():W:8:0/0()
kslgetl+69:0x000000006004a40f(fixed sga|(parent)latch:cache table scan latch+15 fixed sga|var:kcb_table_scan_latch_+15 ):R:1:0x37/55()
kslgetl+80:0x000000007a63d5c0(shared pool|X$KSUPR.KSLLALAQ+0 shared pool|permanent memor,duration 1,cls perm+5703104 ):R:8:0/0()
kslgetl+96:0x000000006004a410(fixed sga|(parent)latch:cache table scan latch+16 fixed sga|var:kcb_table_scan_latch_+16 ):R:4:0x1/1()
kslgetl+103:0x000000007a63d5f0(shared pool|X$KSUPR.KSLLALOW+0 shared pool|permanent memor,duration 1,cls perm+5703152 ):R:4:0/0()
kslgetl+109:0x0000000011141524():R:4:0x1cffe/118782()
kslgetl+122:0x000000007a63d5c0(shared pool|X$KSUPR.KSLLALAQ+0 shared pool|permanent memor,duration 1,cls perm+5703104 ):W:8:0x6004a400/1610916864(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 )
kslgetl+131:0x00007f33d8087ea8(pga|Other, free memory, top call heap+186280 ):R:8:0x4b/75()
kslgetl+138:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslgetl+138:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):W:8:0x22/34()
kslgetl+1898:0x000000007a63d5c0(shared pool|X$KSUPR.KSLLALAQ+0 shared pool|permanent memor,duration 1,cls perm+5703104 ):W:8:0/0()
kslgetl+1909:0x00000000600119b4(fixed sga|var:kcbccc_+0 ):R:1:0/0()
kslgetl+1938:0x00007fff224e4e60():R:8:0/0()
kslgetl+1942:0x00007fff224e4e38():W:8:0x10a6084b/279316555()

The two highlighted lines show that the function read the beginning of the latch memory and found a number, indicating the oracle process id that has the latch taken already. In other words, this means the process found a latch that has already been taken by another process, oracle pid 34.

If you scroll down further you see kslgetl called kslges, which is how a willing to wait latch works if the latch it wants to take is already taken, and then further down, you will see this pattern:

kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+930:0x00007fff224e4d58():W:4:0/0()
kslges+937:0x00007fff224e4d8a():W:1:0/0()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()

This is what is the spinning for the latch looks like from a memory access perspective. So that means that spinning on a latch does not mean the entire latching function (kslgetl or kslges) is called to try to take the latch, but rather when kslgetl finds the latch to be taken, it calls kslges, which does some housekeeping, and then in a really tight loop reads the latch’ memory at offset 0 to detect when it is freed. We can see it is a loop by looking at the offset numbers of the function, these are 930, 937 and 944, which are the offsets we see over and over.

In fact, if I grep for that line and count the number of lines, we can see this version of oracle (12.2.0.1) still does spin 20’000 times for a latch in willing to wait mode:

grep 'kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch' pina_latch_annotated.txt | wc -l
20001

Yes, this shows 20001 executions, remember we freed the latch in the other session?

$ grep 'kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch' pina_latch_annotated.txt | tail -3
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0x22/34()
kslges+944:0x000000006004a400(fixed sga|(parent)latch:cache table scan latch+0 fixed sga|var:kcb_table_scan_latch_+0 ):R:8:0/0()

The last line shows kslges finding the latch in a free state, which is after it has been woken from sleeping on a semaphore.

A modern version of Oracle doesn’t sleep for an arbitrary time and then spin again, it goes to sleep on a semaphore after it spun 20’000 times and then waits until it gets posted. That can be seen in the pinatrace output too, but if you want to look at the function flow, it might be better to generate a debugtrace trace, the pinatrace trace shows what information is used within a function, and might be too verbose. In fact, the function flow towards the semaphore call is kslges > skgpwwait > sskgpwwait.

Conclusion
pinatrace_annotate_oracle.sh is a tool to annotate a tracefile generated by the pin tool ‘pinatrace’ when running the oracle binary. It shows information about any memory access done by a process, which allows you to understand what information is obtained in what function of Oracle. This is a tool only for research and debug purposes.

Tagged: internals, memory, memory trace, oracle, pin, pin tools, pinatrace

Franck Pachot's picture

12cR2 Subquery Elimination

More and more we can see crazy queries generated by ORM frameworks or BI query generators. They are build to be easily generated rather than being optimal. Then, the optimizer has to implement more and more transformations to get an efficient execution plan. Here is one new that appeared in Oracle 12cR2: Subquery Elimination when the subquery do not filter any rows.

A semi-join is a join where we do not need to match with all rows, but only one. We write it with an EXISTS subquery or a =ANY or =SOME one, which is equivalent.

12.1

Here is the behaviour in 12.1.0.2 when the subquery do not filter any row because it reads the same table as the outer one, without any predicate:
SQL> select * from dbms_xplan.display_cursor(null,null,'allstats last +outline');
PLAN_TABLE_OUTPUT
-----------------
SQL_ID az1jcra46h5ua, child number 1
-------------------------------------
select * from EMP where ename in (select ename from EMP)
 
Plan hash value: 977554918
 
----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 14 |00:00:00.01 | 13 | | | |
|* 1 | HASH JOIN SEMI | | 1 | 14 | 14 |00:00:00.01 | 13 | 1098K| 1098K| 889K (0)|
| 2 | TABLE ACCESS FULL| EMP | 1 | 14 | 14 |00:00:00.01 | 7 | | | |
| 3 | TABLE ACCESS FULL| EMP | 1 | 14 | 14 |00:00:00.01 | 6 | | | |
----------------------------------------------------------------------------------------------------------------

We read 2 times the same table, join all rows and finally return as result exacly the same rows as those coming from the first full scan. This is not efficient.

12.2

Here is the same query in 12.2 where we can see that the table is scanned only once because the optimizer knows that the subquery do not filter anything:

SQL> select * from dbms_xplan.display_cursor(null,null,'allstats last +outline');
PLAN_TABLE_OUTPUT
SQL_ID az1jcra46h5ua, child number 0
-------------------------------------
select * from EMP where ename in (select ename from EMP)
 
Plan hash value: 3956160932
 
------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 14 |00:00:00.01 | 9 |
|* 1 | TABLE ACCESS FULL| EMP | 1 | 14 | 14 |00:00:00.01 | 9 |
------------------------------------------------------------------------------------

This is mentioned in the outline hints with ELIMINATE_SQ:

Outline Data
-------------
 
/*+
BEGIN_OUTLINE_DATA
IGNORE_OPTIM_EMBEDDED_HINTS
OPTIMIZER_FEATURES_ENABLE('12.2.0.1')
DB_VERSION('12.2.0.1')
ALL_ROWS
OUTLINE_LEAF(@"SEL$D0DB0F9A")
ELIMINATE_SQ(@"SEL$2")
OUTLINE(@"SEL$1")
OUTLINE(@"SEL$2")
FULL(@"SEL$D0DB0F9A" "EMP"@"SEL$1")
END_OUTLINE_DATA
*/

This feature can be disabled with the following paramter:

SQL> alter session set "_optimizer_eliminate_subquery"=false;
Session altered.

Or with the following hint:

SQL> select * from EMP where ename in (select /*+ NO_ELIMINATE_SQ */ ename from EMP);

Finally here is what you can see in the CDB trace:

SQE: Trying SQ elimination.
SQE: Trying SQ elimination.
SQE: Query valid for SQ elimination:******* UNPARSED QUERY IS *******
SELECT "EMP"."EMPNO" "EMPNO","EMP"."ENAME" "ENAME","EMP"."JOB" "JOB","EMP"."MGR" "MGR","EMP"."HIREDATE" "HIREDATE","EMP"."SAL" "SAL","EMP"."COMM" "COMM","EMP"."DEPTNO" "DEPTNO" FROM "SCOTT"."EMP" "EMP" WHERE "EMP"."ENAME"=ANY (SELECT "EMP"."ENAME" "ENAME" FROM "SCOTT"."EMP" "EMP")
Registered qb: SEL$D0DB0F9A 0xe7035778 (SUBQUERY ELIMINATE SEL$1; SEL$2)
---------------------
QUERY BLOCK SIGNATURE
---------------------
signature (): qb_name=SEL$D0DB0F9A nbfros=1 flg=0
fro(0): flg=0 objn=73253 hint_alias="EMP"@"SEL$1"
 
SQE: Query after SQ elimination:******* UNPARSED QUERY IS *******
SELECT "EMP"."EMPNO" "EMPNO","EMP"."ENAME" "ENAME","EMP"."JOB" "JOB","EMP"."MGR" "MGR","EMP"."HIREDATE" "HIREDATE","EMP"."SAL" "SAL","EMP"."COMM" "COMM","EMP"."DEPTNO" "DEPTNO" FROM "SCOTT"."EMP" "EMP" WHERE 0=0 AND "EMP"."ENAME" IS NOT NULL

This example is simple and it is obvious that the SQL should be re-written. But with large generated queries, on complex views, this is the kind of thing that can be seen in the resulting query and this transformation will help to avoid unnecessary work.

 

Cet article 12cR2 Subquery Elimination est apparu en premier sur Blog dbi services.

Bertrand Drouvot's picture

Visualize PostgreSQL index file with pgdfv

Introduction

In the previous blog post pgdfv (PostgreSQL data file visualizer) has been introduced. At that time the utility was able to display data file. It is now able to display index file. If you are not familiar with PostgreSQL block internals I would suggest to read Frits Hoogland study in this series of blogposts.

The utility usage is:

$ ./pgdfv
-df     Path to a datafile (mandatory if indexfile is used)
-if     Path to an indexfile
-b      Block size (default 8192)

As you can see you can now specify an indexfile. In that case the following information will be displayed:

  • The percentage of free space within an index page
  • The percentage of current rows an index page refers to
  • The percentage of HOT redirect rows an index page refers to

It works for B-tree index and will display those informations for the leaf blocks only.

As a picture is worth a thousand words, let’s see some examples.

Examples

Let’s create a table, an index, insert 4 rows:

pgdbv=# create table bdtable(id int not null, f varchar(30) );
CREATE TABLE
pgdbv=# insert into bdtable ( id, f ) values (1, 'aaaaaaaaaa'), (2, 'bbbbbbbbbb'), (3, 'cccccccccc'), (4, 'dddddddddd');
INSERT 0 4
pgdbv=# create index bdtindex on bdtable (f);
CREATE INDEX

and inspect the table and index content:

pgdbv=# select * from heap_page_items(get_raw_page('bdtable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   2489 |      0 |        0 | (0,1)  |           2 |       2050 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   2489 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   2489 |      0 |        0 | (0,3)  |           2 |       2050 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   2489 |      0 |        0 | (0,4)  |           2 |       2050 |     24 |        |       | \x040000001764646464646464646464
(4 rows)

pgdbv=# select * from bt_page_items('bdtindex',1);
 itemoffset | ctid  | itemlen | nulls | vars |                      data
------------+-------+---------+-------+------+-------------------------------------------------
          1 | (0,1) |      24 | f     | t    | 17 61 61 61 61 61 61 61 61 61 61 00 00 00 00 00
          2 | (0,2) |      24 | f     | t    | 17 62 62 62 62 62 62 62 62 62 62 00 00 00 00 00
          3 | (0,3) |      24 | f     | t    | 17 63 63 63 63 63 63 63 63 63 63 00 00 00 00 00
          4 | (0,4) |      24 | f     | t    | 17 64 64 64 64 64 64 64 64 64 64 00 00 00 00 00
(4 rows)

In PostgreSQL, each table is stored in a separate file. When a table exceeds 1 GB, it is divided into gigabyte-sized segments.

Let’s check which file contains the table and which one contains the index:

pgdbv=# SELECT pg_relation_filepath('bdtable');
 pg_relation_filepath
----------------------
 base/16416/16497
(1 row)

pgdbv=# SELECT pg_relation_filepath('bdtindex');
 pg_relation_filepath
----------------------
 base/16416/16503
(1 row)

So that we can use the utility on those files that way:

https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 150w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 300w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 768w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 1024w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 1598w" sizes="(max-width: 1100px) 100vw, 1100px" />

So the data block has been displayed (same behavior as the previous blog post) and also 2 index blocks.

The index block display can be :

  • a letter or a question mark: M for meta-page, R for root page and ? for “non root, non meta and non leaf”.
  • a number: It represents the percentage of HOT redirect rows the index refers to (instead of unused rows as it is the case for the data block).

The number is displayed only in case of a leaf block.

So, one leaf index block has been detected with more than 75% of free space (so the green color), more than 50% of the rows the index refers to are current (100% in our case as tx_max = 0 for all the rows) and HOT redirect are less than 10% (0 is displayed) (0% in our case as no rows with lp_flags = 2).

Let’s update 3 rows: one update on the indexed column and 2 updates on the non indexed column:

pgdbv=# update bdtable set f='aqwzsxedc' where id =1;
UPDATE 1
pgdbv=# update bdtable set id=id+10 where id in (2,3);
UPDATE 2
pgdbv=# checkpoint;
CHECKPOINT
pgdbv=# select * from heap_page_items(get_raw_page('bdtable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |   8152 |        1 |     39 |   2489 |   2490 |        0 | (0,5)  |           2 |       1282 |     24 |        |       | \x010000001761616161616161616161
  2 |   8112 |        1 |     39 |   2489 |   2491 |        0 | (0,6)  |       16386 |        258 |     24 |        |       | \x020000001762626262626262626262
  3 |   8072 |        1 |     39 |   2489 |   2491 |        0 | (0,7)  |       16386 |        258 |     24 |        |       | \x030000001763636363636363636363
  4 |   8032 |        1 |     39 |   2489 |      0 |        0 | (0,4)  |           2 |       2306 |     24 |        |       | \x040000001764646464646464646464
  5 |   7992 |        1 |     38 |   2490 |      0 |        0 | (0,5)  |           2 |      10498 |     24 |        |       | \x01000000156171777a7378656463
  6 |   7952 |        1 |     39 |   2491 |      0 |        0 | (0,6)  |       32770 |      10242 |     24 |        |       | \x0c0000001762626262626262626262
  7 |   7912 |        1 |     39 |   2491 |      0 |        0 | (0,7)  |       32770 |      10242 |     24 |        |       | \x0d0000001763636363636363636363
(7 rows)

pgdbv=# select * from bt_page_items('bdtindex',1);
 itemoffset | ctid  | itemlen | nulls | vars |                      data
------------+-------+---------+-------+------+-------------------------------------------------
          1 | (0,1) |      24 | f     | t    | 17 61 61 61 61 61 61 61 61 61 61 00 00 00 00 00
          2 | (0,5) |      24 | f     | t    | 15 61 71 77 7a 73 78 65 64 63 00 00 00 00 00 00
          3 | (0,2) |      24 | f     | t    | 17 62 62 62 62 62 62 62 62 62 62 00 00 00 00 00
          4 | (0,3) |      24 | f     | t    | 17 63 63 63 63 63 63 63 63 63 63 00 00 00 00 00
          5 | (0,4) |      24 | f     | t    | 17 64 64 64 64 64 64 64 64 64 64 00 00 00 00 00
(5 rows)

And launch the tool again:

https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 150w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 300w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 768w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 1024w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 1606w" sizes="(max-width: 1100px) 100vw, 1100px" />

As you can see we still have more than 75% of free space in the leaf index block but the way to display the color has been changed (because now less than 50% of the rows the index refers to are current aka tx_max = 0) and HOT redirect is still less than 10% (0 is displayed).

Let’s vacuum the table:

pgdbv=# vacuum  bdtable;
VACUUM
pgdbv=# checkpoint;
CHECKPOINT
pgdbv=# select * from heap_page_items(get_raw_page('bdtable',0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |              t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------------
  1 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       |
  2 |      6 |        2 |      0 |        |        |          |        |             |            |        |        |       |
  3 |      7 |        2 |      0 |        |        |          |        |             |            |        |        |       |
  4 |   8152 |        1 |     39 |   2489 |      0 |        0 | (0,4)  |           2 |       2306 |     24 |        |       | \x040000001764646464646464646464
  5 |   8112 |        1 |     38 |   2490 |      0 |        0 | (0,5)  |           2 |      10498 |     24 |        |       | \x01000000156171777a7378656463
  6 |   8072 |        1 |     39 |   2491 |      0 |        0 | (0,6)  |       32770 |      10498 |     24 |        |       | \x0c0000001762626262626262626262
  7 |   8032 |        1 |     39 |   2491 |      0 |        0 | (0,7)  |       32770 |      10498 |     24 |        |       | \x0d0000001763636363636363636363
(7 rows)

pgdbv=# select * from bt_page_items('bdtindex',1);
 itemoffset | ctid  | itemlen | nulls | vars |                      data
------------+-------+---------+-------+------+-------------------------------------------------
          1 | (0,5) |      24 | f     | t    | 15 61 71 77 7a 73 78 65 64 63 00 00 00 00 00 00
          2 | (0,2) |      24 | f     | t    | 17 62 62 62 62 62 62 62 62 62 62 00 00 00 00 00
          3 | (0,3) |      24 | f     | t    | 17 63 63 63 63 63 63 63 63 63 63 00 00 00 00 00
          4 | (0,4) |      24 | f     | t    | 17 64 64 64 64 64 64 64 64 64 64 00 00 00 00 00
(4 rows)

and launch the utility:

https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 150w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 300w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 768w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 1024w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 1616w" sizes="(max-width: 1100px) 100vw, 1100px" />

As you can see we still have more than 75% of free space. Now more than 50% of the rows the index refers to are current. The index refers to between 50 and 60% of HOT redirect rows (so 5 is displayed) (for ctid (0,2) and (0,3): so 2 rows out of 4 the index refers to).

The legend and summary are dynamic and depend of the contents of the scanned blocks (data and index).

For example on a table made of 757 blocks, you could end up with something like:

https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 150w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 300w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 768w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 1024w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 2144w" sizes="(max-width: 1100px) 100vw, 1100px" />

The same table, once FULL vacuum would produce:

https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 150w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 300w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 768w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 1024w, https://bdrouvot.files.wordpress.com/2017/12/screen-shot-2017-12-21-at-1... 2126w" sizes="(max-width: 1100px) 100vw, 1100px" />

Remarks

  • The tool is available in this repository.
  • The tool is for B-tree index only.
  • The tool is 100% inspired by Frits Hoogland blogpost series and by odbv (written by Kamil Stawiarski) that can be used to visualize oracle database blocks.

Conclusion

pgdfv can now be used to visualize PostgreSQL data and index file pages.

Kamil Stawiarski's picture

ODBVv2 – ghostdata busters

Some time ago I wrote a simple tool to learn about Oracle data block internals – ODBV.
The series of articles can be found here: http://blog.ora-600.pl/?s=odbv&submit= and the github repo is here: https://github.com/ora600pl/odbv

This is not a production tool but during the last session in Birmingham at UKOUG_TECH17 – where I was doing a presentation using this tool – I came to the conclusion that with a little bit of work it could be used to trace ghost data in a database.

What is ghost data? This is very simple – each time we delete something or truncate or move, Oracle database is not removing data from our datafile – the blocks are "marked" for reuse and are not associated with any logical object in a database, but our data is still there.

These days we have to think about security a lot – that’s why when you move a table to encrypted tablespace you might want to be sure, that you leave no trace of important data behind.

I propose the v2 of ODBV (written in Python3) – this is a demo version and it requires a lot to be actually useful, but for now, it works like this:

ora-600:odbv2 inter$ python3.6 odbv2.py
ODBV v2 by Kamil Stawiarski (@ora600pl | www.ora-600.pl)
ODBV v2 requires cx_Oralce - you pip to install it


Enter interactive mode: $ python3 odbv2.py -i
Or do a batch processing: $ python3 -f file_with_path_to_dbfs -c user/pass@ip:port/service -o file.html


Usage for interactive mode:
	to add a datafile for parsing and visualizing:
		ODBV> add file <>
	to generate html visualization report:
		ODBV> make html file_name.html
	to connect to Oracle for dictionary data:
		get dict user/password@ip:port/service

	Default block size is 8192 - to change it to N:
		ODBV> set blocksize N

As you can see you can use this tool in interactive or batch mode. This tool requires a cx_Oralce package for Python3 to connect to Oracle database – it is necessary to get DBA_OBJECTS contents to map blocks to existing objects and show the rest of the blocks that contents TABLE DATA or INDEX DATA as ghost blocks.

So let’s check how it works.

First of all let’s create a secured tablespace, encrypted with AES256:

SQL> administer key management create keystore '/u01/wallet' identified by "Dupiszcze";

Zmieniono magazyn kluczy.

SQL> administer key management set keystore open identified by "Dupiszcze";

Zmieniono magazyn kluczy.

SQL> administer key management set key identified by "Dupiszcze" with backup;

Zmieniono magazyn kluczy.

SQL> create tablespace secure_data
  2  datafile size 1m
  3  autoextend on next 1m
  4  maxsize 1g
  5  encryption using 'AES256'
  6  default storage (encrypt);

Utworzono przestrzen tabel.

I have 2 tables in my tablespace TBS_DUPA which is not encrypted:

SQL> select table_name
  2  from dba_tables
  3  where tablespace_name='TBS_DUPA';

TABLE_NAME
--------------------------------------------------------------------------------
EMPLOYEES_SEC
EMPLOYEES_TMP2

Now let’s move table EMPLOYEES_SEC to a new, highly secure tablespace </p />
</p></div>
    <div class=»

davidkurtz's picture

nVision Performance Tuning 12: Hinting nVision with SQL Profiles

This blog post is part of a series that discusses how to get optimal performance from PeopleSoft nVision reporting as used in General Ledger.  It is a PeopleSoft specific version of a posting on my Oracle blog.

As I explained earlier in this series, it is not possible to add hints to nVision.  The dynamic nature of the nVision SQL means that it is not possible to use SQL Patches.  nVision SQL statements contain literal values and never use bind variables.  When dynamic selectors are used, the SELECTOR_NUM will be different for every execution. A SQL_ID found in one report will be not be seen again in another report.  Even static selector numbers will change after the tree is updated or when a new tree is created.
It is possible to use SQL Profiles to introduce hints because they can optionally match the force match signature of a SQL.  SQL statements that differ only in the literal values they contain will have different SQL IDs but will have the same force matching signature.  Although you will still have a lot of force matching signatures, you should find that you have far fewer force matching signatures than SQL_IDs.   Picking out the signatures that account for the most elapsed execution time and creating profiles for them is manageable.
Note: SQL Profiles require the Tuning Pack to be licenced.
As far as is possible, good nVision performance should be achieved by setting appropriate tree performance options at tree level.  These are global settings.  You may find that a particular setting on a particular tree is not optimal for all reports.  You may then choose to override the tree-level setting in specific layouts.  You may also find that you still need hints to control execution plans.
In particular, parallel query can be an effective tactic in nVision performance tuning.  However, you should put a degree of parallelism on PS_LEDGER or PS_LEDGER_BUDG because that will invoke parallelism in many other processes.  I have found that even putting a degree of parallelism on a summary ledger table can easily result in too many concurrent parallel queries.   On OLTP systems, such as PeopleSoft, I recommend that parallelism should be used sparingly and in a highly controlled and targetted fashion.

Example

Let's take the following nVision query as an example.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">SELECT L2.TREE_NODE_NUM,L3.TREE_NODE_NUM,SUM(A.POSTED_TOTAL_AMT) 
FROM PS_XX_SUM_CONSOL_VW A, PSTREESELECT05 L2, PSTREESELECT10 L3
WHERE A.LEDGER='S_USMGT'
AND A.FISCAL_YEAR=2017
AND A.ACCOUNTING_PERIOD BETWEEN 0 AND 12
AND (A.DEPTID BETWEEN 'A0000' AND 'A8999' OR A.DEPTID BETWEEN 'B0000' AND 'B9149'
OR A.DEPTID='B9156' OR A.DEPTID='B9158' OR A.DEPTID BETWEEN 'B9165' AND 'B9999'
OR A.DEPTID BETWEEN 'C0000' AND 'C9999' OR A.DEPTID BETWEEN 'D0000' AND 'D9999'
OR A.DEPTID BETWEEN 'G0000' AND 'G9999' OR A.DEPTID BETWEEN 'H0000' AND 'H9999'
OR A.DEPTID='B9150' OR A.DEPTID=' ')
AND L2.SELECTOR_NUM=10228
AND A.BUSINESS_UNIT=L2.RANGE_FROM_05
AND L3.SELECTOR_NUM=10231
AND A.ACCOUNT=L3.RANGE_FROM_10
AND A.CHARTFIELD1='0012345'
AND A.CURRENCY_CD='GBP'
GROUP BY L2.TREE_NODE_NUM,L3.TREE_NODE_NUM
/

We can tell from the equality join conditions that the two selectors still joined to the are dynamic selectors.
A third selector on DEPTID has been suppressed with the 'use literal values' performance option.  The number of DEPTID predicates in the statement will depend on the tree and the node selected for the report.  Note, that if these change then the statement will not force match the same profile.  SQL profiles might suddenly cease to work due to a tree or selection criteria change.
This is the plan I get initially and without a profile. It doesn't perform well.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">Plan hash value: 808840077
-----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 10408 (100)| | | |
| 1 | HASH GROUP BY | | 517 | 50666 | 10408 (1)| 00:00:01 | | |
| 2 | HASH JOIN | | 517 | 50666 | 10407 (1)| 00:00:01 | | |
| 3 | PARTITION RANGE SINGLE | | 731 | 13158 | 3 (0)| 00:00:01 | 10228 | 10228 |
| 4 | INDEX FAST FULL SCAN | PSAPSTREESELECT05 | 731 | 13158 | 3 (0)| 00:00:01 | 10228 | 10228 |
| 5 | HASH JOIN | | 518 | 41440 | 10404 (1)| 00:00:01 | | |
| 6 | PARTITION RANGE SINGLE | | 249 | 5727 | 2 (0)| 00:00:01 | 10231 | 10231 |
| 7 | INDEX FAST FULL SCAN | PSAPSTREESELECT10 | 249 | 5727 | 2 (0)| 00:00:01 | 10231 | 10231 |
| 8 | PARTITION RANGE ITERATOR | | 7785 | 433K| 10402 (1)| 00:00:01 | 28 | 40 |
| 9 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| PS_X_LEDGER_ACCTS | 7785 | 433K| 10402 (1)| 00:00:01 | 28 | 40 |
| 10 | SORT CLUSTER BY ROWID BATCHED | | 5373 | | 5177 (1)| 00:00:01 | | |
| 11 | INDEX SKIP SCAN | PS_X_LEDGER_ACCTS | 5373 | | 5177 (1)| 00:00:01 | 28 | 40 |
-----------------------------------------------------------------------------------------------------------------------------------

These are the hints I want to introduce (on Oracle 12c).

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">SELECT /*+OPT_PARAM('parallel_degree_policy','AUTO') OPT_PARAM('parallel_min_time_threshold',2) 
OPT_PARAM('parallel_degree_limit',4) REWRITE PX_JOIN_FILTER(PS_XX_SUM_GCNSL_MV)*/…
  • Use automatic parallel degree, statement queuing and in-memory parallel execution.
  • Invoke parallelism if the statement is estimated to run for at least 2 seconds
  • However, I will also limit the automatic parallelism to a degree of 4
  • Force materialize view rewrite
  • Use a Bloom filter when joining to the materialized view.

I have created a data-driven framework to create the profiles. I have created working storage table to hold details of each force matching signature for which I want to create a profile.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">CREATE TABLE dmk_fms_profiles
(force_matching_signature NUMBER NOT NULL
,sql_id VARCHAR2(13)
,plan_hash_value NUMBER
,module VARCHAR2(64)
,report_id VARCHAR2(32) /*Application Specific*/
,tree_list CLOB /*Application Specific*/
,sql_profile_name VARCHAR2(30)
,parallel_min_time_threshold NUMBER
,parallel_degree_limit NUMBER
,other_hints CLOB
,delete_profile VARCHAR2(1)
,sql_text CLOB
,CONSTRAINT dmk_fms_profiles_pk PRIMARY KEY (force_matching_signature)
,CONSTRAINT dmk_fms_profiles_u1 UNIQUE (sql_id)
,CONSTRAINT dmk_fms_profiles_u2 UNIQUE (sql_profile_name)
)
/

Using conditional parallelism with the PARALLEL_MIN_TIME_THRESHOLD, but limited with PARALLEL_DEGREE_LIMIT is an effective tactic with nVision, so I have specified columns in the metadata table for those hints, otherwise, hints are injected via a string. I identified the problematic SQL by analysis with ASH, and hence I also obtained the FORCE_MATCHING_SIGNATURE. The metadata is keyed by FORCE_MATCHING_SIGNATURE. I have specified a meaningful name for the SQL profile.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">INSERT INTO dmk_fms_profiles (force_matching_signature, parallel_min_time_threshold, parallel_degree_limit, other_hints, sql_profile_name) 
VALUES (16625752171077499412, 1, 4, 'REWRITE PX_JOIN_FILTER(PS_XX_SUM_GCNSL_MV)', 'NVS_GBGL123I_BU_CONSOL_ACCOUNT');
COMMIT;

Profiles are created using the text of a SQL rather than the SQL_ID or FORCE_MATCHING_SIGNATURE directly. Therefore the SQL_TEXT must be extracted from the AWR, so this method also requires that the SQL statement has been captured by an AWR snapshot.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">UPDATE dmk_fms_profiles a
SET (module, action, sql_id, plan_hash_value, sql_text)
= (SELECT s.module, s.action, s.sql_id, s.plan_hash_value, t.sql_text
FROM dba_hist_sqlstat s
, dba_hist_sqltext t
WHERE t.dbid = s.dbid
AND t.sql_id = s.sql_id
AND s.force_matching_signature = a.force_matching_signature
AND s.snap_id = (
SELECT MAX(s1.snap_id)
FROM dba_hist_sqlstat s1
WHERE s1.force_matching_signature = a.force_matching_signature
AND s1.module = 'RPTBOOK' /*Application Specific*/
AND s1.action LIKE 'PI=%:%:%' /*Application Specific*/)
AND s.module = 'RPTBOOK' /*Application Specific*/
AND s.action LIKE 'PI=%:%:%' /*Application Specific*/
AND ROWNUM = 1)
WHERE sql_id IS NULL
/

MERGE INTO dmk_fms_profiles u
USING (
SELECT a.sql_id, a.force_matching_signature, p.name
FROM dmk_fms_profiles a
, dba_sql_profiles p
WHERE p.signature = a.force_matching_signature
) s
ON (s.force_matching_signature = u.force_matching_signature)
WHEN MATCHED THEN UPDATE
SET u.sql_profile_name = s.name
/

Columns REPORT_ID and TREE_LIST contain application specific information extracted from the application instrumentation and tree selector logging.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">/*Application Specific - extract report ID from ACTION*/
UPDATE dmk_fms_profiles a
SET report_id = substr(regexp_substr(s.action,':([A-Za-z0-9_-])+',1,1),2)
WHERE report_id IS NULL
AND action IS NOT NULL
/
#eeeeee; border: 0px solid rgb(0, 0, 0); overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">/*Application Specific - extract financial analysis tree from application logging*/
UPDATE dmk_fms_profiles a
SET tree_list =
(SELECT LISTAGG(tree_name,', ') WITHIN GROUP (ORDER BY tree_name)
FROM (select l.tree_name, MAX(l.length) length
FROM dba_hist_sql_plan p
, ps_nvs_treeslctlog l
WHERE p.plan_hash_value = a.plan_hash_value
AND p.sql_id = a.sql_id
AND p.object_name like 'PS%TREESELECT__'
AND p.partition_start = partition_stop
AND p.partition_start = l.selector_num
AND l.tree_name != ' '
GROUP BY l.tree_name)
)
WHERE tree_list IS NULL
/

Now I can produce a simple report of the metadata in order to see what profiles should be created.
#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">column sql_text word_wrapped on format a110
column module format a8
column report_id heading 'nVision|Report ID'
column tree_list word_wrapped on format a20
column plan_hash_value heading 'SQL Plan|Hash Value' format 9999999999
column parallel_min_time_threshold heading 'Parallel|Min Time|Threshold' format 999
column parallel_degree_limit heading 'Parallel|Degree|Limit' format 999
set long 500
SELECT * FROM dmk_fms_profiles
/

SQL Plan
FORCE_MATCHING_SIGNATURE SQL_ID Hash Value MODULE ACTION
------------------------ ------------- ----------- -------- ----------------------------------------------------------------
Parallel Parallel
nVision Min Time Degree
Report ID TREE_LIST SQL_PROFILE_NAME Threshold Limit D
-------------------------------- -------------------- ------------------------------ --------- -------- -
OTHER_HINTS
--------------------------------------------------------------------------------
SQL_TEXT
--------------------------------------------------------------------------------------------------------------
12803175998948432502 5pzxhha3392cs 988048519 RPTBOOK PI=3186222:USGL233I:10008
USGL233I BU_GAAP_CONSOL, NVS_GBGL123I_BU_CONSOL_ACCOUNT 1 4
GAAP_ACCOUNT
REWRITE PX_JOIN_FILTER(PS_XX_SUM_GCNSL_MV)
SELECT L2.TREE_NODE_NUM,A.ACCOUNT,SUM(A.POSTED_TOTAL_AMT) FROM PS_LEDGER A, PSTREESELECT05 L2, PSTREESELECT10 L3
WHERE A.LEDGER='S_GBMGT' AND A.FISCAL_YEAR=2017 AND A.ACCOUNTING_PERIOD BETWEEN 0 AND 12 AND (A.DEPTID BETWEEN
'A0000' AND 'A8999' OR A.DEPTID BETWEEN 'B0000' AND 'B9149' OR A.DEPTID='B9156' OR A.DEPTID='B9158' OR A.DEPTID
BETWEEN 'B9165' AND 'B9999' OR A.DEPTID BETWEEN 'C0000' AND 'C9999' OR A.DEPTID BETWEEN 'D0000' AND 'D9999' OR
A.DEPTID BETWEEN 'G0000' AND 'G9999' OR A.DE

Next, this PL/SQL block will create or recreate SQL profiles from the metadata. The various hints can be concatenated into a single string and passed as a parameter to SQLPROF_ATTR. The SQL text is passed as a parameter when the profile is created.

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 85%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">set serveroutput on
DECLARE
l_signature NUMBER;
h SYS.SQLPROF_ATTR;
e_no_sql_profile EXCEPTION;
PRAGMA EXCEPTION_INIT(e_no_sql_profile, -13833);
l_description CLOB;
BEGIN

FOR i IN (
SELECT f.*, s.name
FROM dmk_fms_profiles f
LEFT OUTER JOIN dba_sql_profiles s
ON f.force_matching_signature = s.signature
) LOOP

BEGIN
IF i.name IS NOT NULL AND i.delete_profile = 'Y' THEN
dbms_sqltune.drop_sql_profile(name => i.name);
END IF;
EXCEPTION WHEN e_no_sql_profile THEN NULL;
END;

IF i.delete_profile = 'Y' THEN
NULL;
ELSIF i.sql_text IS NOT NULL THEN
h := SYS.SQLPROF_ATTR(
q'[BEGIN_OUTLINE_DATA]',
CASE WHEN i.parallel_min_time_threshold>=0 THEN 'OPT_PARAM(''parallel_degree_policy'',''AUTO'') ' END||
CASE WHEN i.parallel_degree_limit >=0 THEN 'OPT_PARAM(''parallel_degree_limit'',' ||i.parallel_degree_limit ||') ' END||
CASE WHEN i.parallel_min_time_threshold>=0 THEN 'OPT_PARAM(''parallel_min_time_threshold'','||i.parallel_min_time_threshold||') ' END||
i.other_hints,
q'[END_OUTLINE_DATA]');

l_signature := DBMS_SQLTUNE.SQLTEXT_TO_SIGNATURE(i.sql_text);
l_description := 'coe nVision '||i.report_id||' '||i.tree_list||' '||i.force_matching_signature||'='||l_signature;
dbms_output.put_line(i.sql_profile_name||' '||l_description);

DBMS_SQLTUNE.IMPORT_SQL_PROFILE (
sql_text => i.sql_text,
profile => h,
name => i.sql_profile_name,
description => l_description,
category => 'DEFAULT',
validate => TRUE,
replace => TRUE,
force_match => TRUE /* TRUE:FORCE (match even when different literals in SQL). FALSE:EXACT (similar to CURSOR_SHARING) */ );

END IF;
END LOOP;
END;
/

I can verify that the profile has been created, and the hints that it contains, thus:

#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 70%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">SELECT profile_name,
xmltype(comp_data) as xmlval
FROM dmk_fms_profiles p
, dbmshsxp_sql_profile_attr x
WHERE x.profile_name = p.sql_profile_name
AND p.status = 'ENABLED'
ORDER BY 1
/

PROFILE_NAME
------------------------------
XMLVAL
------------------------------------------------------------------------------------------------
NVS_GBGL123I_BU_CONSOL_ACCOUNT


And now when the application runs, I get the plan that I wanted.

  • The query runs in parallel.
  • The SQL is rewritten to use materialized view.
  • There are no indexes on the materialized view, so it must full scan it.
  • It generates a bloom filter from PSTREESELECT10 and applies it to the materialized view.
#eeeeee; border: 0px solid #000000; font-family: courier new; font-size: 70%; overflow: auto; padding-left: 4px; padding-right: 4px; width: 95%;">---------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop | TQ |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 2219 (100)| | | | | | |
| 1 | PX COORDINATOR | | | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10004 | 111 | 9879 | 2219 (6)| 00:00:01 | | | Q1,04 | P->S | QC (RAND) |
| 3 | HASH GROUP BY | | 111 | 9879 | 2219 (6)| 00:00:01 | | | Q1,04 | PCWP | |
| 4 | PX RECEIVE | | 111 | 9879 | 2219 (6)| 00:00:01 | | | Q1,04 | PCWP | |
| 5 | PX SEND HASH | :TQ10003 | 111 | 9879 | 2219 (6)| 00:00:01 | | | Q1,03 | P->P | HASH |
| 6 | HASH GROUP BY | | 111 | 9879 | 2219 (6)| 00:00:01 | | | Q1,03 | PCWP | |
| 7 | HASH JOIN | | 536 | 47704 | 2218 (6)| 00:00:01 | | | Q1,03 | PCWP | |
| 8 | PX RECEIVE | | 536 | 38056 | 2215 (6)| 00:00:01 | | | Q1,03 | PCWP | |
| 9 | PX SEND HYBRID HASH | :TQ10002 | 536 | 38056 | 2215 (6)| 00:00:01 | | | Q1,02 | P->P | HYBRID HASH|
| 10 | STATISTICS COLLECTOR | | | | | | | | Q1,02 | PCWC | |
| 11 | HASH JOIN | | 536 | 38056 | 2215 (6)| 00:00:01 | | | Q1,02 | PCWP | |
| 12 | BUFFER SORT | | | | | | | | Q1,02 | PCWC | |
| 13 | JOIN FILTER CREATE | :BF0000 | 236 | 3776 | 2 (0)| 00:00:01 | | | Q1,02 | PCWP | |
| 14 | PX RECEIVE | | 236 | 3776 | 2 (0)| 00:00:01 | | | Q1,02 | PCWP | |
| 15 | PX SEND BROADCAST | :TQ10000 | 236 | 3776 | 2 (0)| 00:00:01 | | | | S->P | BROADCAST |
| 16 | PARTITION RANGE SINGLE | | 236 | 3776 | 2 (0)| 00:00:01 | 36774 | 36774 | | | |
| 17 | INDEX FAST FULL SCAN | PSAPSTREESELECT10 | 236 | 3776 | 2 (0)| 00:00:01 | 36774 | 36774 | | | |
| 18 | JOIN FILTER USE | :BF0000 | 8859 | 475K| 2213 (6)| 00:00:01 | | | Q1,02 | PCWP | |
| 19 | PX BLOCK ITERATOR | | 8859 | 475K| 2213 (6)| 00:00:01 | 29 | 41 | Q1,02 | PCWC | |
| 20 | MAT_VIEW REWRITE ACCESS STORAGE FULL| PS_XX_SUM_GCNSL_MV | 8859 | 475K| 2213 (6)| 00:00:01 | 29 | 41 | Q1,02 | PCWP | |
| 21 | BUFFER SORT | | | | | | | | Q1,03 | PCWC | |
| 22 | PX RECEIVE | | 731 | 13158 | 3 (0)| 00:00:01 | | | Q1,03 | PCWP | |
| 23 | PX SEND HYBRID HASH | :TQ10001 | 731 | 13158 | 3 (0)| 00:00:01 | | | | S->P | HYBRID HASH|
| 24 | PARTITION RANGE SINGLE | | 731 | 13158 | 3 (0)| 00:00:01 | 36773 | 36773 | | | |
| 25 | INDEX FAST FULL SCAN | PSAPSTREESELECT05 | 731 | 13158 | 3 (0)| 00:00:01 | 36773 | 36773 | | | |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Conclusion SQL 

Profiles can be used in much the same way as SQL Patches to introduce hints into application SQL without changing the code, the difference being that SQL Profiles can force match SQL.  However, SQL Profiles do require the Tuning pack to be licenced, whereas SQL Patches and Baselines do not.
Applying force matching SQL profiles to nVision is an effective, though reactive tactic.   Tree changes can result in changes to the number of literal criteria in nVision SQL statements that may, therefore, cease to match existing profiles.  nVision will always require on-going monitoring and introduction of new profiles.

martin.bach's picture

Little things worth knowing: redo transport in Data Guard 12.2 part 2

In the first part of this article I looked at a number of views and some netstat output to show how redo is transported from the primary database to its standby systems. The long story short is that TT02 (“async ORL multi”) was found sending redo to CDB3 asynchronously whilest NSS2 (“sync”) transferred redo to the synchronised target – CDB2. Unlike v$dataguard_process wanted me to believe, it really wasn’t LGWR sending redo over the network.

In this little article I would like to show you how the standby databases CDB2 and CDB3 receive redo and how you can map this back to the primary database, closing the loop.

How does CDB2 receive redo?

First I’m looking at CDB2, which receives redo via synchronous mode. I should be able to narrow the communication down between primary and standby by referring to the LGWR and TT02 process IDs in the CLIENT_PID column on the standby. As a quick reminder, 14986 is the PID for LGWR, 15029 belongs to NSS2, and 15252 maps to TT02. Let’s try:

SQL> select db_unique_name,database_role from v$database;

DB_UNIQUE_NAME                 DATABASE_ROLE
------------------------------ ----------------
CDB2                           PHYSICAL STANDBY

SQL> select name, pid, role, action, client_pid, client_role, sequence#, block#, dest_id
  2  from v$dataguard_process where client_pid in (14986,15029,15252);

NAME  PID        ROLE                    ACTION       CLIENT_PID CLIENT_ROLE       SEQUENCE#     BLOCK#    DEST_ID
----- ---------- ----------------------- ------------ ---------- ---------------- ---------- ---------- ----------
rfs   5517       RFS sync                IDLE              14986 log writer               95     174565          0

So it would appear the process responsible for shipping redo to “SYNC” destinations is the log writer. Actually, the output of v$dataguard_process is quite interesting, which is why I’m adding it here for the sake of completeness:

SQL> select name, pid, role, action, client_pid, client_role, sequence#, block#, dest_id
  2  from v$dataguard_process order by action;

NAME  PID        ROLE                    ACTION       CLIENT_PID CLIENT_ROLE       SEQUENCE#     BLOCK#    DEST_ID
----- ---------- ----------------------- ------------ ---------- ---------------- ---------- ---------- ----------
rfs   5517       RFS sync                IDLE              14986 log writer               95     229446          0
rfs   5350       RFS archive             IDLE              15224 archive gap               0          0          1
rfs   5346       RFS ping                IDLE              15124 gap manager              95          0          0
rfs   5354       RFS archive             IDLE              15233 archive gap               0          0          1
MRP0  5348       managed recovery        IDLE                  0 none                      0          0          0
rfs   5352       RFS archive             IDLE              15240 archive gap               0          0          1
LGWR  5207       log writer              IDLE                  0 none                      0          0          0
TT01  5259       redo transport timer    IDLE                  0 none                      0          0          0
TT00  5255       gap manager             IDLE                  0 none                      0          0          0
ARC1  5263       archive redo            IDLE                  0 none                      0          0          0
ARC2  5265       archive redo            IDLE                  0 none                      0          0          0
ARC3  5267       archive redo            IDLE                  0 none                      0          0          0
TMON  5242       redo transport monitor  IDLE                  0 none                      0          0          0
ARC0  5257       archive local           IDLE                  0 none                      0          0          0

14 rows selected.

This view tells me that LGWR is attached to the RFS sync proces. But now I know better than that, and it is similar to what I saw on the primary. Looking a little closer, I can see that strictly speaking, the RFS process is connected to NSS2:

[root@server2 ~]# netstat -tunalp | egrep 'Active|Proto|5517'
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp6       0      0 192.168.100.22:1521     192.168.100.21:15515    ESTABLISHED 5517/oracleCDB2   

I am repeating the values for the primary here so you don’t have to go back to the previous article:

[root@server1 ~]# ps -ef | egrep 'lgwr|lg0|nss'
oracle   14986     1  0 09:58 ?        00:01:19 ora_lgwr_CDB1
oracle   14990     1  0 09:58 ?        00:00:00 ora_lg00_CDB1
oracle   14994     1  0 09:58 ?        00:00:00 ora_lg01_CDB1
oracle   15029     1  0 09:58 ?        00:00:43 ora_nss2_CDB1

[root@server1 ~]# netstat -vW --numeric-ports -ee -p -la --tcp | egrep '14986|14990|14994|15029'
tcp        0      0 server1.example.com:15515 server2.example.com:1521 ESTABLISHED oracle     16400768   15029/ora_nss2_CDB1

You will notice that port 15515 on server1 belongs to ora_nss2_CDB1.

Going back a little to v$dataguard_process, it seems a bit weird to see MRP0 as “idle” when the database is in managed recovery mode using real time apply. Trying something else I am querying v$managed_standby and voila: MRP0 is said to apply logs:

SQL> select process,pid,status,client_process,client_pid,sequence#,block# 
  2  from v$managed_standby order by status;

PROCESS   PID        STATUS       CLIENT_P CLIENT_PID                                SEQUENCE#     BLOCK#
--------- ---------- ------------ -------- ---------------------------------------- ---------- ----------
DGRD      5255       ALLOCATED    N/A      N/A                                               0          0
DGRD      5259       ALLOCATED    N/A      N/A                                               0          0
MRP0      5348       APPLYING_LOG N/A      N/A                                              95     246625
ARCH      5257       CLOSING      ARCH     5257                                             92       4096
ARCH      5263       CLOSING      ARCH     5263                                             93       2048
ARCH      5265       CLOSING      ARCH     5265                                             94     342016
ARCH      5267       CONNECTED    ARCH     5267                                              0          0
RFS       5350       IDLE         UNKNOWN  15224                                             0          0
RFS       5354       IDLE         UNKNOWN  15233                                             0          0
RFS       5352       IDLE         UNKNOWN  15240                                             0          0
RFS       5517       IDLE         LGWR     14986                                            95     246626
RFS       5346       IDLE         Archival 15124                                             0          0

12 rows selected.

I guess that’s true, as the system is in constant recovery using the standby logfiles.

And what about CDB3?

On the other hand, CDB3 – to which redo is shipped asynchronously – lists TT02 as it’s counterpart:

SQL> select db_unique_name,database_role from v$database;

DB_UNIQUE_NAME                 DATABASE_ROLE
------------------------------ ----------------
CDB3                           PHYSICAL STANDBY

SQL> select name, pid, role, action, client_pid, client_role, sequence#, block#, dest_id 
  2   from v$dataguard_process order by action;

NAME  PID                      ROLE                    ACTION       CLIENT_PID CLIENT_ROLE       SEQUENCE#     BLOCK#    DEST_ID
----- ------------------------ ----------------------- ------------ ---------- ---------------- ---------- ---------- ----------
rfs   14803                    RFS ping                IDLE              15124 gap manager              96          0          0
rfs   14809                    RFS archive             IDLE              15233 archive gap               0          0          0
rfs   14811                    RFS async               IDLE              15252 async ORL multi          96      34674          0
MRP0  11825                    managed recovery        IDLE                  0 none                      0          0          0
ARC0  11776                    archive local           IDLE                  0 none                      0          0          0
ARC2  11786                    archive redo            IDLE                  0 none                      0          0          0
TT00  11774                    gap manager             IDLE                  0 none                      0          0          0
ARC3  11788                    archive redo            IDLE                  0 none                      0          0          0
TMON  11706                    redo transport monitor  IDLE                  0 none                      0          0          0
LGWR  11676                    log writer              IDLE                  0 none                      0          0          0
ARC1  11784                    archive redo            IDLE                  0 none                      0          0          0
TT01  11778                    redo transport timer    IDLE                  0 none                      0          0          0

12 rows selected.

Unlike the case with CDB2, the local RFS process is indeed connecting to TT02 on server1:

[root@server2 ~]# netstat -vW --numeric-ports -ee -p -la --tcp | egrep '^Active|^Proto|14811'
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name    
tcp6       0      0 server2.example.com:1521 server1.example.com:12936 ESTABLISHED oracle     4658198    14811/oracleCDB3    

… and on server1:

[root@server1 ~]# netstat -vW --numeric-ports -ee -p -la --tcp | egrep '^Active|^Proto|12936'
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address            State       User       Inode      PID/Program name    
tcp        0      0 server1.example.com:12936 server2.example.com:1521 ESTABLISHED oracle     15820538   15252/ora_tt02_CDB1 

This should be enough evidence, I rest my case :)

Summary

So I guess that answers my question: On my small VMs in the lab, NSSn is responsible for shipping redo to targets in “SYNC” mode. The redo transport server processes TTnn ship redo to destinations that are defined for asynchronous transport.

Richard Foote's picture

Early Bird Extension – UK February Dates: “Oracle Indexing Internals and Best Practices” Seminar

As a Christmas present to those in the UK looking at attending my “Oracle Indexing Internals and Best Practices” seminar in February next year, the Early Bird rates are now available until 19th January 2018. Take this opportunity to attend this highly acclaimed seminar that is packed full of information designed to significantly improve the […]