The hang manager

fritshoogland's picture

Recently I was looking in the trace directory in the diag dest of my (12.2.0.1) instance, and found some trace files which were different from the others:

$ ls | grep dia0
test_dia0_9711_base_1.trc
test_dia0_9711_base_1.trm
test_dia0_9711_lws_1.trc
test_dia0_9711_lws_1.trm
test_dia0_9711.trc
test_dia0_9711.trm

The dia0 ‘test_dia0_9711.trc’ file is fully expected. But what are these ‘..lws_1.trc’ and ‘..base_1.trc’ files? And ‘base’ is something that I understand, but what would ‘lws’ mean? Lunatics Without Skateboards?

First, let’s look into the normally named trace file of the dia0 process:

$ cat test_dia0_9711.trc
HM: Early Warning and/or Long Waiting Session (LWS) is enabled.
    The output can be found in trace files of the form:
    test_dia0_9711_lws_n.trc where n is a number between 1 and 5.

*** 2017-11-26T12:18:25.849759+00:00
HM: Moving the default trace file to: test_dia0_9711_base_1.trc


*** 2017-11-26T16:45:33.193729+00:00
HM: Tracing moved from base trace file: test_dia0_9711_base_1.trc

HM: This is a single non-RAC instance.  Hang detection is enabled.
HM: Hang resolution scope is OFF.
HM: Cross Boundary hang detection is enabled.
HM: Cross Cluster hang detection is enabled.
HM: CPU load is not being monitored.
HM: IO load is NOT abnormally high - 0% outliers.

HM: Tracing is now being performed to this file.
2017-11-26 16:45:33.193 :kjzdshn(): Shutting down DIAG

Ah! So LWS means ‘Long Waiting Sessions’. This seems to be functionality that produces trace files for execution which do not perform optimally.

Let’s look into the ‘base’ trace file:

*** 2017-11-26T12:18:25.849773+00:00
HM: Tracing moved from default trace file: test_dia0_9711.trc

HM: This is a single non-RAC instance.  Hang detection is enabled.
HM: Hang resolution scope is OFF.
HM: Cross Boundary hang detection is enabled.
HM: Cross Cluster hang detection is enabled.
HM: CPU load is not being monitored.
HM: IO load is NOT abnormally high - 0% outliers.

*** 2017-11-26T12:18:25.849903+00:00
HM: Early Warning and/or Long Waiting Session (LWS) is enabled.
    The output can be found in trace files of the form:
    test_dia0_9711_lws_n.trc where n is a number between 1 and 5.

*** 2017-11-26T12:18:30.870494+00:00
HM: Detected incompatible setting 'INSTANCE' for Hang Resolution Scope
    - disabling hang resolution.

I get the feeling this is originally made for RAC, but now also working for single instance. Some RAC related features seem to be turned off. It seems that the actual session information is stored in the ‘lws’ trace files.

This is a sample of what is in a lws trace file:

*** 2017-11-26T13:14:25.053171+00:00
HM: Early Warning - Session ID 58 serial# 53774 OS PID 30760 (FG) is
     'not in a wait' for 33 seconds
     last waited on 'resmgr:cpu quantum'
     p1: 'location'=0x3, p2: 'consumer group id'=0x4a2f, p3: ' '=0x0

                                                     IO
 Total  Self-         Total  Total  Outlr  Outlr  Outlr
  Hung  Rslvd  Rslvd   Wait WaitTm   Wait WaitTm   Wait
  Sess  Hangs  Hangs  Count   Secs  Count   Secs  Count Wait Event
------ ------ ------ ------ ------ ------ ------ ------ -----------
     1      0      0      0      0      0      0      0 not in wait


HM: Dumping Short Stack of pid[71.30760] (sid:58, ser#:53774)
Short stack dump:
ksedsts()+346<-ksdxfstk()+71<-ksdxcb()+912<-sspuser()+217<-__sighandler()<-sslnxadd()+38<-pfrinstr_ADDN()+103<-pfrrun_no_tool()+60<-pfrrun()+919<-plsql_run()+756<-peicnt()+288<-kkxexe()+717<-opiexe()+22531<-kpoal8()+2679<-opiodr()+1229<-ttcpip()+1257<-opitsk()+1940<-opiino()+941<-opiodr()+1229<-opidrv()+1021<-sou2o()+145<-opimai_real()+455<-ssthrdmain()+417<-main()+262<-__libc_start_main()+245

HM: Current SQL: declare
        x number := 0;
begin
        for c in 1..1000000000 loop
                x := x + c;
        end loop;
end;

I got a sample PLSQL loop running for testing load (which is really silly), and this running for 33 seconds apparently generates an ‘Early Warning’. It shows what it was doing from a wait based analysis perspective (‘not in a wait’, ‘resmgr: cpu quantum’), it shows some statistics for the hang analyser, it dumps a short stack, which is fully expected; kdsests() to __sighandler() are the routines for generating the stack dump, the actual function that was executing was sslnxadd(), which is a system specific linux add function, which is called by pfrinstr_ADDN(), which is the PLSQL add function, which is called by pfrrun_no_tool(), which is the PLSQL fast interpretation loop function, and so on. The last thing it dumps is the current SQL, which is the silly PLSQL code.

Is there anything inside the Oracle database that shows anything related to this? Yes, it seems there are:

V$HANG_INFO
V$HANG_SESSION_INFO
V$HANG_STATISTICS
DBA_HANG_MANAGER_PARAMETERS

Doing some simple tests, I could not get V$HANG_INFO and V$HANG_SESSION_INFO populated. V$HANG_STATISTICS contains 87 statistics, and displays current data, of which most in my single instance test database are 0.

One thing that confused me at first was the abbreviation ‘HM’ for hang monitor, because the very same abbreviation is used in the diagnostic repository for ‘health monitor’.

It seems the hang manager is a quite extensive framework for detecting hangs and hang related issues; I peeked at the documented and undocumented parameters containing “hang” in them, of which there are quite an impressive amount quite likely belonging to the hang manager framework:

SQL> @parms
Enter value for parameter: hang
Enter value for isset:
Enter value for show_hidden: Y
_global_hang_analysis_interval_secs                10                                                                     TRUE     FALSE      FALSE
_hang_allow_resolution_on_single_nonrac            FALSE                                                                  TRUE     FALSE      FALSE
_hang_analysis_num_call_stacks                     3                                                                      TRUE     FALSE      FALSE
_hang_base_file_count                              5                                                                      TRUE     FALSE      FALSE
_hang_base_file_space_limit                        100000000                                                              TRUE     FALSE      FALSE
_hang_blocked_session_percent_threshold            10                                                                     TRUE     FALSE      FALSE
_hang_bool_spare1                                  TRUE                                                                   TRUE     FALSE      FALSE

NAME                                               VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
-------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
_hang_bool_spare2                                  TRUE                                                                   TRUE     FALSE      FALSE
_hang_cross_boundary_hang_detection_enabled        TRUE                                                                   TRUE     FALSE      FALSE
_hang_cross_cluster_hang_detection_enabled         TRUE                                                                   TRUE     FALSE      FALSE
_hang_delay_resolution_for_libcache                TRUE                                                                   TRUE     FALSE      FALSE
_hang_detection_enabled                            TRUE                                                                   TRUE     FALSE      FALSE
_hang_enable_processstate                          TRUE                                                                   TRUE     FALSE      FALSE
_hang_hang_analyze_output_hang_chains              TRUE                                                                   TRUE     FALSE      FALSE
_hang_hiload_promoted_ignored_hang_count           2                                                                      TRUE     FALSE      FALSE
_hang_hiprior_session_attribute_list                                                                                      TRUE     FALSE      FALSE
_hang_hung_session_ewarn_percent                   34                                                                     TRUE     FALSE      FALSE
_hang_ignored_hang_count                           1                                                                      TRUE     FALSE      FALSE

NAME                                               VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
-------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
_hang_ignored_hangs_interval                       300                                                                    TRUE     FALSE      FALSE
_hang_ignore_hngmtrc_count                         1                                                                      TRUE     FALSE      FALSE
_hang_int_spare1                                   0                                                                      TRUE     FALSE      FALSE
_hang_int_spare2                                   0                                                                      TRUE     FALSE      FALSE
_hang_log_io_hung_sessions_to_alert                FALSE                                                                  TRUE     FALSE      FALSE
_hang_log_verified_hangs_to_alert                  FALSE                                                                  TRUE     FALSE      FALSE
_hang_long_wait_time_threshold                     0                                                                      TRUE     FALSE      FALSE
_hang_lws_file_count                               5                                                                      TRUE     FALSE      FALSE
_hang_lws_file_space_limit                         100000000                                                              TRUE     FALSE      FALSE
_hang_max_session_hang_time                        96                                                                     TRUE     FALSE      FALSE
_hang_monitor_archiving_related_hang_interval      300                                                                    TRUE     FALSE      FALSE

NAME                                               VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
-------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
_hang_msg_checksum_enabled                         TRUE                                                                   TRUE     FALSE      FALSE
_hang_resolution_allow_archiving_issue_termination TRUE                                                                   TRUE     FALSE      FALSE
_hang_resolution_confidence_promotion              FALSE                                                                  TRUE     FALSE      FALSE
_hang_resolution_global_hang_confidence_promotion  FALSE                                                                  TRUE     FALSE      FALSE
_hang_resolution_percent_hung_sessions_threshold   300                                                                    TRUE     FALSE      FALSE
_hang_resolution_policy                            HIGH                                                                   TRUE     FALSE      FALSE
_hang_resolution_promote_process_termination       FALSE                                                                  TRUE     FALSE      FALSE
_hang_resolution_scope                             OFF                                                                    TRUE     FALSE      FALSE
_hang_running_in_lrg                               FALSE                                                                  TRUE     FALSE      FALSE
_hang_short_stacks_output_enabled                  TRUE                                                                   TRUE     FALSE      FALSE
_hang_signature_list_match_output_frequency        10                                                                     TRUE     FALSE      FALSE

NAME                                               VALUE                                                                  ISDEFAUL ISMODIFIED ISSET
-------------------------------------------------- ---------------------------------------------------------------------- -------- ---------- ----------
_hang_statistics_collection_interval               15                                                                     TRUE     FALSE      FALSE
_hang_statistics_collection_ma_alpha               30                                                                     TRUE     FALSE      FALSE
_hang_statistics_high_io_percentage_threshold      25                                                                     TRUE     FALSE      FALSE
_hang_terminate_session_replay_enabled             TRUE                                                                   TRUE     FALSE      FALSE
_hang_trace_interval                               32                                                                     TRUE     FALSE      FALSE
_hang_verification_interval                        46                                                                     TRUE     FALSE      FALSE

For the sake of experimenting with the hang manager I locked a table using ‘lock table t2 in exclusive mode’ and another session doing an insert. The ‘base’ trace file showed the following information:

*** 2017-12-11T06:25:10.364037+00:00
HM: Hung Sessions (local detect) - output local chains
===============================================================================
Non-intersecting chains:

-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (test.test)
                   os id: 7526
              process id: 38, oracle@memory-presentation.local (TNS V1-V3)
              session id: 27
        session serial #: 32599
    }
    is waiting for 'enq: TM - contention' with wait info:
    {
                      p1: 'name|mode'=0x544d0003
                      p2: 'object #'=0x57e6
                      p3: 'table/partition'=0x0
            time in wait: 1 min 35 sec
           timeout after: never
                 wait id: 74
                blocking: 0 sessions
            wait history:
              * time between current wait and wait #1: 0.000123 sec
              1.       event: 'SQL*Net message from client'
                 time waited: 26.493425 sec
                     wait id: 73               p1: 'driver id'=0x62657100
                                               p2: '#bytes'=0x1
              * time between wait #1 and #2: 0.000025 sec
              2.       event: 'SQL*Net message to client'
                 time waited: 0.000002 sec
                     wait id: 72               p1: 'driver id'=0x62657100
                                               p2: '#bytes'=0x1
              * time between wait #2 and #3: 0.000005 sec
              3.       event: 'SQL*Net break/reset to client'
                 time waited: 0.000055 sec
                     wait id: 71               p1: 'driver id'=0x62657100
                                               p2: 'break?'=0x0
    }
    and is blocked by
 => Oracle session identified by:
    {
                instance: 1 (test.test)
                   os id: 7481
              process id: 24, oracle@memory-presentation.local (TNS V1-V3)
              session id: 55
        session serial #: 9654
    }
    which is waiting for 'SQL*Net message from client' with wait info:
    {
                      p1: 'driver id'=0x62657100
                      p2: '#bytes'=0x1
            time in wait: 1 min 39 sec
           timeout after: never
                 wait id: 6714
                blocking: 1 session
            wait history:
              * time between current wait and wait #1: 0.000006 sec
              1.       event: 'SQL*Net message to client'
                 time waited: 0.000004 sec
                     wait id: 6713             p1: 'driver id'=0x62657100
                                               p2: '#bytes'=0x1
              * time between wait #1 and #2: 0.000203 sec
              2.       event: 'SQL*Net message from client'
                 time waited: 17.115734 sec
                     wait id: 6712             p1: 'driver id'=0x62657100
                                               p2: '#bytes'=0x1
              * time between wait #2 and #3: 0.000007 sec
              3.       event: 'SQL*Net message to client'
                 time waited: 0.000004 sec
                     wait id: 6711             p1: 'driver id'=0x62657100
                                               p2: '#bytes'=0x1
    }

Chain 1 Signature: 'SQL*Net message from client'<='enq: TM - contention'
Chain 1 Signature Hash: 0x163c4cba
-------------------------------------------------------------------------------

===============================================================================

If you wait a little longer, there is another type of dia0 trace file that appears: ‘..vfy_1.trc’, which is a file containing ‘verified hangs’. They appear one time for a given hang situation, and looks like this:

One or more possible hangs have been detected on your system.
These could be genuine hangs in which no further progress will
be made without intervention, or it may be very slow progress
in the system due to high load.

Previously detected and output hangs are not displayed again.
Instead, the 'Victim Information' section will indicate that
the victim is from an 'Existing Hang' under the 'Existing or'
column.

'Verified Hangs' below indicate one or more hangs that were found
and identify the final blocking session and instance on which
they occurred. Since the current hang resolution state is 'OFF',
there will be no attempt to resolve any of these hangs unless
hang resolution is enabled.  Terminating this session may well
free up the hang. Be aware that if the column named 'Fatal BG'
is marked with 'Y', terminating the process will cause the
instance to crash.

Any hangs with a 'Hang Resolution Action' of 'Unresolvable'
will be ignored. These types of hangs will either be resolved
by another layer in the RDBMS or cannot be resolved as they may
require external intervention.

Deadlocks (also named cycles) are currently NOT resolved even if
hang resolution is enabled.  The 'Hang Type' of DLCK in the
'Verified Hangs' output identifies these hangs.

Below that are the complete hang chains from the time the hang
was detected.

The following information will assist Oracle Support Services
in further analysis of the root cause of the hang.


*** 2017-12-11T07:05:07.989085+00:00
HM: Hang Statistics - only statistics with non-zero values are listed

                     number of hangs detected 1
                        number of local hangs 1
  hangs ignored due to application contention 1
      hangs:total number of impacted sessions 2
    hangs:minimum number of impacted sessions 2
    hangs:maximum number of impacted sessions 2
                      current number of hangs 1
            current number of active sessions 1
              current number of hung sessions 2

*** 2017-12-11T07:05:07.989122+00:00
Verified Hangs in the System
                     Root       Chain Total               Hang
  Hang Hang          Inst Root  #hung #hung  Hang   Hang  Resolution
    ID Type Status   Num  Sess   Sess  Sess  Conf   Span  Action
 ----- ---- -------- ---- ----- ----- ----- ------ ------ -------------------
     1 HANG VICSELTD    1    58     2     2    LOW  LOCAL IGNRD:HngRslnScpOFF
 Hang Ignored Reason: Hang resolution is disabled.

  Inst  Sess   Ser             Proc  Wait    Wait
   Num    ID    Num      OSPID  Name Time(s) Event
  ----- ------ ----- --------- ----- ------- -----
      1     45  5781     27709    FG     137 enq: TM - contention
      1     58 15676     27700    FG     146 SQL*Net message from client

Wait event statistics for event 'SQL*Net message from client'

Self-Resolved Hang Count                        : 0
Self-Resolved Hang Total Hang Time in seconds   : 0
Self-Resolved Hang Minimum Hang Time in seconds : 0
Self-Resolved Hang Maximum Hang Time in seconds : 0
Resolved Hang Count                             : 0
Total Hung Session Count                        : 1
Total Wait Time in micro-seconds                : 59799468
Total Wait Count                                : 67
Moving Average of Wait Time in micro-seconds    : 5220981

Victim Information
                                                         Existing or  Ignored
  HangID  Inst#  Sessid  Ser Num      OSPID  Fatal BG       New Hang    Count
  ------  -----  ------  -------  ---------  --------  -------------  -------
       1      1      58    15676      27700     N           New Hang        1

It appears Oracle adds more and more useful information in trace files, which only can be fetched from the trace files. Yet it doesn’t seem that any monitoring is actually picking this up, which, to my knowledge includes Oracle’s own Enterprise Manager.

The hang manager tracing talks about outliers, upon grepping through the dia0 trace files I found an explanation what dia0 is considering an outlier:

*** 2017-11-26T16:45:33.193479+00:00
HM: Hang Statistics - only statistics with non-zero values are listed

            current number of active sessions 1
  instance health (in terms of hung sessions) 100.00%

Dumping Wait Events with non-zero hung session counts
                       or non-zero outliers wait counts
(NOTE: IO outliers have wait times greater than 32 ms.
       non-IO outliers have wait times greater than 64 s.
       IDLE wait events are not output.)

                                                     IO
 Total  Self-         Total  Total  Outlr  Outlr  Outlr
  Hung  Rslvd  Rslvd   Wait WaitTm   Wait WaitTm   Wait
  Sess  Hangs  Hangs  Count   Secs  Count   Secs  Count Wait Event
------ ------ ------ ------ ------ ------ ------ ------ -----------
     0      0      0   1366      2      0      0      1 log file parallel write
     1      0      0   2206 4294966594      1 4294966272      0 resmgr:cpu quantum
     0      0      0      1      1      0      0      1 external table read
     1      0      0      0      0      0      0      0 not in wait


*** 2017-11-26T16:45:33.193686+00:00
TotalCPUTm:4230 ms TotalRunTm:16027340 ms %CPU: 0.03%