Oakies Blog Aggregator

jkstill's picture

MetaLink, we barely knew ye

But, we wish we had more time to get better acquainted.

If you work with Oracle, you probably know that MetaLink went the way of the Dodo as part of an upgrade to My Oracle Support during the weekend of November 6th, 2009.

And so far it hasn't gone too well, as evidenced by these threads on Oracle-L:

Issues with My Oracle Support
Metalink Fiasco

Many people were lamenting the loss of MetaLink well before its demise, but I don't think any were quite expecting the issues that are currently appearing.

A few have reported that it is working fine for them, but personally, I have found  it unusable all morning.

At least one issue with MetaLink appears to have been cleared up with MOS, that is while I was able to login to it last week.

During a routine audit of who had access to our CSI numbers, I came across a group of consultants that were no longer working for our company, and froze their accounts.  The next day I received a frantic voice mail  from a member of the consulting firm, and he informed me that they had no access to MetaLink because I had frozen their accounts.

I returned the call just a few minutes later, but they had already been able to resolve the issue, as one of their consultants with admin rights had been unaffected, and was able to unfreeze their accounts.
Removing them from the CSI is the better procedure, but in the past when I have attempted to do so, I found that there were still open issues owned by the accounts, and could not remove them. The application owners had been very clear that this access should be removed, so I froze the accounts, so that is what I did on this occasion as well.

This all seemed quite bizarre to me.  This must be a very strange schema in the ML user database, and some strange logic to go along with it.  By granting a user access to a CSI, MetaLink was giving me Carte Blanche to effectively remove them from MetaLink.
How has My Oracle Support fixed this?  Try as I might, I could not find a 'freeze' button in user administration in MOS.  So the fix seems to have been "remove the button"

James Morle's picture

The Oracle Wait Interface Is Useless (sometimes) – Part One: The Problem Definition

So here we go, this is part one of this experiment in blogging and co-writing. Tanel has actually written some good stuff already for this, but I wanted to try and formalise things under a common title and make it easier to follow between our sites.

I thought it would be logical to start this process by producing a more concrete problem definition, so that’s the focus of this part. It’s unlikely that we will come up with a complete method in this initial work, but hopefully the wheels will at least turn a little by the end of it!

So first of all, why would I dare to say that the Oracle Wait Interface is useless? Well, partly because I quite like titles that are a little bit catchy, and partly because it is indeed sometimes useless. The emphasis is on the word sometimes, though, because the Oracle Wait Interface is still the single most useful feature in any database product. Wow – that’s quite a claim, isn’t it? This isn’t the place to fully explain why that is, and many others have written great works on this subject already. Check out Cary Millsap’s works, notably his book, Optimizing Oracle Performance, which focuses in great detail on this subject. For the sake of this article, however, here’s why it is so useful: It tells you where the time goes. Think about it: If something is running too slowly, knowing where the time is used up is the single piece of information required to focus on the right subject for tuning.

So what’s wrong with the Oracle wait interface? Just one thing, actually – it is designed to  provide visibility of relatively slow waits. The reason for this is simply that there is a slight overhead in timing every single wait. If that overhead becomes a noticeable proportion of the actual wait itself then the measurement becomes inaccurate (and makes the problem worse). On UNIX-like platforms (yes, that includes Linux), the actual timing interface is implemented using gettimeofday(2) system calls, one before the event and one after the event. This call gives microsecond granularity of timing, at least in theory (on my Opteron 280 test machine, gettimeofday() calls take 1.5 microseconds). So, using this kind of mechanism for events that take a relatively long time makes perfect sense – disk I/O, for example, that will take at least three orders of magnitude longer to complete than the timing calls themselves. Conversely, they make no sense at all for calls that take even as little as 50 microseconds, as the 3 microsecond penalty for measuring the wait is 6% of the actual event time itself at that point. There you go, that’s the beginning of the justification that the wait interface is useless,  in a nutshell.

But hang on, isn’t 50 microseconds really, really fast? Well no, actually, it isn’t. Taking Intel’s Nehalem processor (with Quickpath) as an example, a memory latency is around the 50 ns range – three orders of magnitude faster than the 50 microsecond cut-off that I just arbitrarily invented. Memory access is also the slowest thing that a CPU can do (without factoring in peripheral cards) – in this case the CPU has to wait for about 150 cycles while that memory access takes place. So it’s very possible to have a function call that does fairly complex work and is still an order of magnitude or two faster than the gettimeofday() system call.

Time for an example. Actually, this is a variation on the example that made me start thinking about this – I had been perfectly happy with the Oracle Wait Interface until this point for 99% of cases!

OK, so a user just called you, complaining that his query is simply not ever completing. Following the usual course of action, we might take a few samples from v$session_wait (or v$session from 10g onwards) to look at the current wait state for the process, just in case it’s something obvious:

SID EVENT                          PROCESS                  STATE

---- ------------------------------ ------------------------ -------------------
 8 SQL*Net message to client      15032                    WAITED SHORT TIME
Well, that isn’t too revealing. Let’s now turn on extended sql tracing to get a list of all wait events as they transition:
SQL> oradebug setospid 15033
Oracle pid: 20, Unix process pid: 15033, image: oracle@elise03.sa.int (TNS V1-V3)
SQL> oradebug event 10046 trace name context forever, level 12
Statement processed.

After a few seconds, let’s see if there is anything in the trace file:

*** 2009-11-09 10:54:36.934
*** SESSION ID:(8.10393) 2009-11-09 10:54:36.934
*** CLIENT ID:() 2009-11-09 10:54:36.934
*** SERVICE NAME:(SYS$USERS) 2009-11-09 10:54:36.934
*** MODULE NAME:(sqlplus@elise03.sa.int (TNS V1-V3)) 2009-11-09 10:54:36.934
*** ACTION NAME:() 2009-11-09 10:54:36.934

Received ORADEBUG command (#1) 'event 10046 trace name context forever, level 12' from process 'Unix process pid: 15082, image: '

*** 2009-11-09 10:54:36.935
Finished processing ORADEBUG command (#1) 'event 10046 trace name context forever, level 12'

No, this is not truncated output – there is nothing in this file at all, apart from the actual invocation of the tracing. OK, so what’s next? How about looking at ‘top’ to see if the process if busy:

top - 10:53:15 up 2 days, 19:50,  4 users,  load average: 1.04, 0.55, 0.57
Tasks: 148 total,   2 running, 146 sleeping,   0 stopped,   0 zombie
Cpu(s): 25.8%us,  0.4%sy,  0.0%ni, 73.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3995468k total,  3877052k used,   118416k free,   115432k buffers
Swap:  6029304k total,   170388k used,  5858916k free,  3289460k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
15033 oracle    25   0 2901m 1.1g 1.1g R 99.9 28.5   3:01.79 oracle
3284 oracle    15   0 2903m  54m  46m S  1.7  1.4  60:19.33 oracle
15071 oracle    16   0 2902m  24m  21m S  1.3  0.6   0:00.04 oracle
15069 oracle    15   0 12740 1112  816 R  0.7  0.0   0:00.07 top

Crikey, our process is more or less consuming 100% of a CPU! So, we are not waiting for anything, but the user still has no results, and the process is very busy doing something. Let’s now try the next logical step – let’s truss/strace the process:

[oracle@elise03 trace]$ strace -tp 15033
Process 15033 attached - interrupt to quit
11:28:06 gettimeofday({1257766086, 104118}, NULL) = 0
11:28:06 getrusage(RUSAGE_SELF, {ru_utime={2270, 615813}, ru_stime={0, 634903}, ...}) = 0
11:28:06 gettimeofday({1257766086, 104402}, NULL) = 0
11:28:06 getrusage(RUSAGE_SELF, {ru_utime={2270, 615813}, ru_stime={0, 634903}, ...}) = 0
11:28:08 gettimeofday({1257766088, 105930}, NULL) = 0
11:28:08 getrusage(RUSAGE_SELF, {ru_utime={2272, 609510}, ru_stime={0, 634903}, ...}) = 0
11:28:08 gettimeofday({1257766088, 106186}, NULL) = 0
11:28:08 getrusage(RUSAGE_SELF, {ru_utime={2272, 609510}, ru_stime={0, 634903}, ...}) = 0
11:28:10 gettimeofday({1257766090, 110887}, NULL) = 0
11:28:10 getrusage(RUSAGE_SELF, {ru_utime={2274, 611206}, ru_stime={0, 634903}, ...}) = 0
11:28:10 gettimeofday({1257766090, 111143}, NULL) = 0
11:28:10 getrusage(RUSAGE_SELF, {ru_utime={2274, 611206}, ru_stime={0, 634903}, ...}) = 0
Not much to see here. Those system calls are emitted every two seconds, and are almost certainly just the result of Oracle updating the statistics for the database time model (v$sess_time_model), but we’re really not doing much at all in terms of system calls. Hang on a minute, the database time model – surely that will help us here?
SQL> select stat_name,value from v$sess_time_model where sid=8;
STAT_NAME                                          VALUE
--------------------------------------------- ----------
DB time                                       2802976117
DB CPU                                        2801767067
background elapsed time                                0
background cpu time                                    0
sequence load elapsed time                             0
parse time elapsed                                 74853
hard parse elapsed time                            66626
sql execute elapsed time                      2802910770
connection management call elapsed time            21308
failed parse elapsed time                              0
failed parse (out of shared memory) elapsed t          0
hard parse (sharing criteria) elapsed time         62605
hard parse (bind mismatch) elapsed time                0
PL/SQL execution elapsed time                       5377
inbound PL/SQL rpc elapsed time                        0
PL/SQL compilation elapsed time                    18303
Java execution elapsed time                            0
repeated bind elapsed time                             0
RMAN cpu time (backup/restore)                         0

19 rows selected.

SQL>

At last, we have some kind of a symptom, and confirmation that the user is indeed actually trying to do useful work rather than just spinning on the CPU. The symptom is that there is an increasing amount of time being allocated as ‘sql execute elapsed time’.  In my opinion, this is where the time model statistics fail to deliver when the opportunity is presented to it. There are just 19 statistics reported here on 11gR2, and the only help we are given from its output is that we are using a lot of DB time, a lot of DB CPU and a lot of sql execute time. So we can surmise that we are doing a very CPU intensive query, and that’s about it – no finer granularity provided and this would seem to be the logical place for such instrumentation…

OK, so we are now just confirmed – we have a nasty query that is just using CPU and never waiting for anything. Let’s not start guessing at this stage what the problem is, let’s try and find out the real answer. At this stage, we might jump to v$sql_plan_statistics_all to find out what is going on. These statistics are only updated when statistics_level is set to ALL, and even then do not update until the statement is complete. In our little example here, the query is already running – we can’t set statistics_level=all, and we can’t wait until the query completes – it might never do so! Of course, it’s probably OK to get the user to kill the query and restart with statistics_level=all, or even the ctrl-c would allow the stats to update. However, let’s assume neither of those things are possible, as this is only an example case.
So what techniques can we use to find the problem? One of them might be the new SQL Execution Monitoring in 11g (which looks very nice), if you have the required license – but that is a very specific case where the problem is a SQL execution problem. What if it is not a SQL execution problem? We need a more general method for finding the answer. That’s the subject of part two of this blog, over to Tanel for Part Two!

robyn's picture

another (possible) nonsense correlation

I was reading news stories on Reuter's this morning and came across a new study. Researchers have determined that men who work in unchallenging jobs with little control over their future tend to be less active off the job as well.Now, I don't doubt that there is a relationship between a passive work role and the amount of activity someone engages in off the job. However there are a few quotes

harald's picture

Explaining the number of Consistent Gets

Last week I received an email from a friend, who wishes to remain anonymous, with the question why Oracle needed 8 consistent gets to perform a full table scan on a table where all the rows are stored in just one data block. There are several possibilities that can cause this and that is what [...]

joc's picture

Starting Oracle Blog

Quite a long time ago I was tempted to start blogging about Oracle and then I decided just not to do that but rather I started to blog about my flying around Europe to present at Oracle conferences. However, I created the blog but never activated. The nomination for Oracle ACE changed this decision and I'll try to write about technical stuff from time to time, but don't expect that I will be so active as some of Oracle bloggers.

davidkurtz's picture

Detecting and Fixing Row Migration

In my previous posting, I discussed how migrated rows led to latch connection problems on a system. In this entry I will explain how I identified and removed the migrated rows, and correctly set PCTFREE on each table so that the problem will not recur.

Instead, you must use the ANALYZE command if you want to know about chained and migrated rows. DBMS_STATS only collects the statistics required by the cost-based optimizer. It doesn't populate the other statistics, including CHAIN_CNT and AVG_FREE_SPACE. So I wrote a simple script (available on my website) to work though a set of tables, and list the chained or migrated rows into a table. I sometimes choose to analyse only certain tables. The criteria in the comment are examples of criteria I add to the driving query to restrict the list of tables to be analysed.

#eeeeee; border: 0px solid rgb(0, 0, 0); overflow: auto; padding-left: 4px; padding-right: 4px; width: 100%;">
BEGIN
FOR x IN (
SELECT owner, table_name, num_rows
FROM all_tables t
WHERE 1=1
/*-------------------------------------------------------
AND NOT table_name IN(SELECT DISTINCT table_name FROM chained_rows)
AND num_rows >= 1000000
AND num_rows BETWEEN 100000 AND 1000000
/*--------------------------------------------------------*/
AND temporary = 'N'
) LOOP
DELETE FROM chained_rows
WHERE owner_name = x.owner
AND table_name = x.table_name;

EXECUTE IMMEDIATE 'ANALYZE TABLE '||x.owner||'.'||x.table_name
||' LIST CHAINED ROWS INTO chained_rows';
END LOOP;
END;
/

Having identified the migrated rows, I produced a simple report

#eeeeee; border: 0px solid rgb(0, 0, 0); overflow: auto; padding-left: 4px; padding-right: 4px; width: 100%;">
SELECT /*+LEADING(c)*/ c.*, t.num_rows
, c.chained_rows/t.num_rows*100 pct_chained
, t.pct_free, t.pct_used
, 100-FLOOR((100-t.pct_free)*(1-c.chained_rows/t.num_rows)) new_pct_free
FROM (
SELECT owner_name, table_name, COUNT(*) chained_rows
FROM chained_rows c
GROUP BY owner_name, table_name) c
, all_tables t
WHERE t.owner = c.owner_name
AND t.table_name = c.table_name
AND t.num_rows > 0
ORDER BY chained_rows desc, 1,2
/

The last column of the report is a calculated new value for PCTFREE. The amount of in a block that can be used to insert new rows (100-PCTFREE) space used is scaled back by the proportion of migrated rows.

NB: This approach doesn't take chained rows into account. Chained rows are too big to fit in a single block under any circumstances, as opposed to migrated rows that could have fitted in a block had there been space. However, Oracle counts both types and stores the result in the column CHAIN_CNT.

#eeeeee; border: 0px solid rgb(0, 0, 0); overflow: auto; padding-left: 4px; padding-right: 4px; width: 100%;">
Chained   Number             %    %   New
Owner Table Name Rows of Rows %Chained Free Used %Free
-------- -------------------- ---------- -------- -------- ---- ---- -----
MY_APP MY_TABLE 239667 1193233 20.1 10 40 29

The script then generates certain commands for each table. First the table is rebuilt by moving it to the same tablespace.

I specify PCTFREE as 1. Usually, the whole table does not need to be rebuilt with the new higher value for PCTFREE. Most of the rows are fully populated and generally will not grow further because they have already migrated. Setting a high value for PCTFREE will simply result in leaving a lot of free space, and result in Oracle reading more blocks for the same data. Instead, I set a low value for PCTFREE in order to pack the data into the minimum number of blocks.

#eeeeee; border: 0px solid rgb(0, 0, 0); overflow: auto; padding-left: 4px; padding-right: 4px; width: 100%;">
ALTER TABLE MY_APP.MY_TABLE MOVE TABLESPACE MYAPP_TABLE PCTFREE 1;

The new value for PCTFREE is then applied to the table. This approach is not foolproof. It is possible for existing rows, that have now been packed into the minimum number of blocks, to grow and to be migrated. However, I have found that this happens relatively rarely.

If necessary, I also reduce PCTUSED such that the sum of the new PCTFREE and the old PCTUSED does not exceed 90. The old value is shown in the comment for reference.

#eeeeee; border: 0px solid rgb(0, 0, 0); overflow: auto; padding-left: 4px; padding-right: 4px; width: 100%;">
ALTER TABLE MY_APP.MY_TABLE PCTFREE 29 /*10*/ PCTUSED 40;

Finally, all the indexes on the rebuilt tables need to be rebuilt, because they are invalidated by the table rebuild.

#eeeeee; border: 0px solid rgb(0, 0, 0); overflow: auto; padding-left: 4px; padding-right: 4px; width: 100%;">
ALTER INDEX IDX_MY_TABLE_BH25150A REBUILD TABLESPACE MYAPP_INDEX;
ALTER INDEX IDX_MY_TABLE_BH25150B REBUILD TABLESPACE MYAPP_INDEX;
ALTER INDEX IDX_MY_TABLE_BH25150C REBUILD TABLESPACE MYAPP_INDEX;
ALTER INDEX IDX_MY_TABLE_BH25150D REBUILD TABLESPACE MYAPP_INDEX;
ALTER INDEX IDX_MY_TABLE_BH25150E REBUILD TABLESPACE MYAPP_INDEX;
ALTER INDEX IDX_MY_TABLE_BH25150F REBUILD TABLESPACE MYAPP_INDEX;

Conclusion

  • The table no longer has migrated rows.
  • The existing data is packed with little free space, minimising the size of the table.
  • Free space will be left in new blocks to allow new rows to expand without migrating.
  • All of the DDL to fix the problem has been built dynamically.

(5.11.2009) See also Tanel Poder's blog: Core IT for Geeks and Pros: Detecting and Fixing Row Migration

karlarao's picture

RHEV (Red Hat Enterprise Virtualization) is out!!! :)

The company I work for, SQL*Wizard,  is a RedHat Advanced Business Partner so I was lucky to get my hands dirty on the BETA release of RHEV, also a pleasure to work with Siva Shunmugam (Sr. Solutions Architect@RedHat & RHCA). I must say, KVM is so fast… plus the RHEV manager is so cool as a management platform </p />
</p></div>
    <div class=»

alberto.dellera's picture

11gR2: materialized view logs changes

In this post we are going to discuss some 11gR2 changes to materialized view logs that are aimed at increasing the performance of the fast-refresh engine of materialized views (MVs), especially the on-commit variant.
The MV logs, in 10gr2, now comes in two flavours: the traditional (and still the default) timestamp-based one and the brand [...]

Listing files with the external table preprocessor in 11g

Using the 11g external table preprocessor to get directory listings in SQL. October 2009

harald's picture

Checkpoint presentation presented at the RAC SIG

Yesterday I presented my What’s the Point of Oracle Checkpoints presentation at the Oracle RAC SIG and although I was a bit nervous it went quite well. However somehow my assigned time slot was shortened without notice and instead of having 90 minutes for my presentation I had to cut short after just 60 minutes. [...]

To prevent automated spam submissions leave this field empty.