记一次Oracle偶尔断开连接

 admin   2022-08-02 15:30   216 人阅读  0 条评论

检查日志发现是系统tcp资源枯竭,找到占用连接数多的进程,重启就好了

TNS-12518: TNS:listener could not hand off client connection
 TNS-12547: TNS:lost contact
  TNS-12560: TNS:protocol adapter error
   TNS-00517: Lost contact
    IBM/AIX RISC System/6000 Error: 32: Broken pipe


pu、内存资源充足,数据库连接数未满,在MOS 发现NOTE 1084443.1 搭述了现象很是接近,那里描述是aix 6引入新特性“ Traffic Regulation” 限制了tcp port 的范围,当client 尝试打开一个新socket时, 当发现socket达到最大上限时会提示上面的错误, 使用下面命令查看是否启用了该特性:

tcptr -show


但是我这里的环境并没有并没有启用TCP Traffic Regulation policy, 之少给了排查的方向,查看一下当前的tcp 连接数。

[oracle@anbob1:/home/oracle]# ps -ef|wc -l
    1207
    
anbob1:/home/grid> netstat -an|wc -l
   62954   <<<<<<<
   
    
anbob1:/home/grid> netstat -an|grep 61020|wc -l
   56361  <<<<<<<
  
anbob1:/> lsattr -E -l sys0 | grep maxuproc
maxuproc        16384                                Maximum number of PROCESSES allowed per user      True
[oracle@anbob1:/home/oracle]# /usr/sbin/no -a | fgrep ephemeral
       tcp_ephemeral_high = 65500
        tcp_ephemeral_low = 9000
       udp_ephemeral_high = 65500
        udp_ephemeral_low = 9000


Note:
这里的显示的连接数好高,另一节点network连接数在5000左右,看一下连接明细满屏的tcp连接都是本机的private interconnect ip(如下),而且在都是通过61020 port发出, 这时送了一口气,不会是主机上某个程序的问题吧,这样就可以把球踢给OS了, 呵当然不好, 一个负责的DBA不能止步于此,下面要判断时哪个程序打开了这个端口的socket 连接。

...
tcp4       0      0  192.168.60.11.48817    192.168.60.11.61020    CLOSE_WAIT
tcp4       0      0  192.168.60.11.24318    192.168.60.11.61020    CLOSE_WAIT
tcp4       0      0  192.168.60.11.24319    192.168.60.11.61020    CLOSE_WAIT
tcp4       0      0  192.168.60.11.48818    192.168.60.11.61020    CLOSE_WAIT
tcp4       0      0  192.168.60.11.24320    192.168.60.11.61020    CLOSE_WAIT
tcp4       0      0  192.168.60.11.48819    192.168.60.11.61020    CLOSE_WAIT
tcp4       0      0  192.168.60.11.24321    192.168.60.11.61020    CLOSE_WAIT
tcp4       0      0  192.168.60.11.48820    192.168.60.11.61020    CLOSE_WAIT
...


下面就来演示如果根据socket port 定位process

anbob1:/> netstat  -aAn|grep CLOS*|head
f1000e003b08dbb8 tcp4       0      0  192.168.60.11.60433   192.168.60.11.61020   CLOSE_WAIT
f1000e00354a8bb8 tcp4       0      0  192.168.60.11.35934   192.168.60.11.61020   CLOSE_WAIT
f1000e0006fd5bb8 tcp4       0      0  192.168.60.11.11435   192.168.60.11.61020   CLOSE_WAIT
f1000e000546dbb8 tcp        0      0  *.*                   *.*                   CLOSED
f1000e0005527bb8 tcp4       0      0  *.*                   *.*                   CLOSED
f1000e0004bf93b8 tcp        0      0  *.*                   *.*                   CLOSED
f1000e0004becbb8 tcp4       0      0  *.*                   *.*                   CLOSED
f1000e0004b2ebb8 tcp        0      0  *.*                   *.*                   CLOSED
f1000e0004b34bb8 tcp4       0      0  *.*                   *.*                   CLOSED
f1000e00139d03b8 tcp4       0      0  192.168.60.11.35935   192.168.60.11.61020   CLOSE_WAIT
anbob1:/> kdb
           START              END <name>
0000000000001000 0000000004160000 start+000FD8
F00000002FF47600 F00000002FFDF9C8 __ublock+000000
000000002FF22FF4 000000002FF22FF8 environ+000000
000000002FF22FF8 000000002FF22FFC errno+000000
F1000F0A00000000 F1000F0A10000000 pvproc+000000
F1000F0A10000000 F1000F0A18000000 pvthread+000000
read vscsi_scsi_ptrs OK, ptr = 0x0

(0)> sockinfo f1000e003b08dbb8 tcpcb
---- TCPCB ----(@ F1000E003B08DBB8)----
    seg_next......@F1000E003B08DBB8  seg_prev......@F1000E003B08DBB8  
    t_softerror... 00000000 t_state....... 00000005 (CLOSE_WAIT)
    t_timer....... 00000000 (TCPT_REXMT)
    t_timer....... 00000000 (TCPT_PERSIST)
    t_timer....... 00000000 (TCPT_KEEP)
    t_timer....... 00000000 (TCPT_2MSL)
    t_rxtshift.... 00000000 t_rxtcur...... 00000004 t_dupacks..... 00000000 
    t_maxseg...... 000041CC t_force....... 00000000 
    t_flags....... 000003E0 (RFC1323|SENT_WS|RCVD_WS|SENT_TS|RCVD_TS|COPYFLAGS)
    t_oobflags.... 00000000 ()
    t_template....@0000000000000000  t_inpcb.......@F1000E003B08DAA0  
    t_iobc........ 00000000 t_timestamp... FAC3B601 snd_una....... 47F28D30 
    snd_nxt....... 47F28D30 snd_up........ 47F28D30 snd_wl1....... 866BAB50 
    snd_wl2....... 47F28D30 iss........... 47F28B50 
    snd_wnd....... 0000000000020E04 rcv_wnd....... 0000000000020E60 
    rcv_nxt....... 866BAB51 rcv_up........ 866BAB50 irs........... 866BAB4F 
    snd_wnd_scale. 00000002 rcv_wnd_scale. 00000002 req_scale_sent 00000002 
    req_scale_rcvd 00000002 last_ack_sent. 866BAB51 timestamp_rec. 5607037C 
    timestamp_age. 000E42DC rcv_adv....... 866DB9B1 snd_max....... 47F28D30 
    snd_cwnd...... 0000000000010730        snd_ssthresh.. 000000003FFFC000 
    t_idle........ 000042DC t_rtt......... 00000000 t_rtseq....... 47F28CD7 
    t_srtt........ 00000007 t_rttvar...... 00000004 t_rttmin...... 00000002 
    max_rcvd...... 0000000000000000        max_sndwnd.... 0000000000020E60 
    t_peermaxseg.. 000041CC snd_in_pipe... 00000000 
    sack_data.....@0000000000000000         snd_recover... 00000000 
    snd_high...... 47F28D30 snd_ecn_max... 47F28D30 snd_ecn_clear. 47F28D31 
    t_splice_with.@0000000000000000         t_splice_flags 00000000 

...
                               
---- SOCKET INFO ----(@ F1000E003B08D808)----
    type........ 0001 (STREAM) 
    opts........ 0000 ()       
    linger...... 0000 state....... FFFF81A2 (ISCONNECTED|CANTRCVMORE|PRIV|NBIO)
    pcb.....@F1000E003B08DAA0  proto...@0000000004875628  
    lock....@F1000E003B959080  head....@0000000000000000  
    q0......@0000000000000000  q.......@0000000000000000  
    q0len....... 0000 qlen........ 0000 qlimit...... 0000 
    timeo....... 0000 error....... 0000 special..... 0A08 
    pgid.... 0000000000000000  oobmark. 0000000000000000 
                               
snd:cc...... 0000000000000000  hiwat... 0000000000020E60 
    mbcnt... 0000000000000000  mbmax... 0000000000083980 
    lowat... 0000000000003FFF  mb......@0000000000000000  
    sel.....@0000000000000000  events...... 0000 
    iodone.. 00000000          ioargs..@0000000000000000  
    lastpkt.@F1000E0019F10C00  wakeone. FFFFFFFFFFFFFFFF 
    timer...@0000000000000000  timeo... 00000000 
    flags....... 0000 ()       
    wakeup.. 00000000          wakearg.@0000000000000000  
    lockwtg. FFFFFFFFFFFFFFFF  
    ....              
more
proc/fd:  3250/54028           
proc/fd: fd: 54028             
              SLOT NAME     STATE      PID    PPID          ADSPACE  CL #THS
                               
pvproc+32C800 3250*osysmond ACTIVE 0B2037C 0000001 0000000C2DA42590   0 000C    ######<<<<<<<<<<
                               
                               
(0)>  hcal 0B2037C
Value hexa: 00B2037C          Value decimal: 11666300
(0)> quit
anbob1:/> ps -ef|grep 11666300
    root 11666300        1   3   Apr 15      - 12970:57 /oracle/app/11.2.0.3/grid/bin/osysmond.bin
    root  1639464 19661020   0 14:10:21  pts/5  0:00 grep 11666300


Note:
这里我们又不得不把球抱起来,这是一个oracle的程序osysmond.bin, 定位程序在aix 还可以使用lsof 但是本机没有带,需要下载就一个执行程序

 [oracle@anbob1:/home/oracle]# ftp 133.96.60.19
ftp: connect: Resource temporarily unavailable


Note:
当时FTP 都提示connect 资源都不可用,可见OS的tcp 连接肯定出现了枯竭

anbob1:/>  lsof -i:61020
osysmond. 11666300 root *738u  IPv4 0xf1000e0041ff4bb8      0t0  TCP anbob1-prv:34166->anbob1-prv:61020 (CLOSE_WAIT)
osysmond. 11666300 root *739u  IPv4 0xf1000e0007091bb8      0t0  TCP anbob1-prv:35703->anbob1-prv:61020 (CLOSE_WAIT)
osysmond. 11666300 root *740u  IPv4 0xf1000e004283bbb8      0t0  TCP anbob1-prv:32682->anbob1-prv:61020 (CLOSE_WAIT)
...


TIP:
接下来在去MOS 以”osysmond.bin netstat”为关键字搜索,第一个就是我们想要的“Bug 15934834 osysmond.bin has lot of sockets open in close_wait state” 现象完全一致是因为osysmond.bin leaks socket. 但是MOS里 Workaround None, 而且当前环境也是在11.2.0.4 修复。
“Which socket states from netstat are candidates for the rmsock command?
FIN_WAIT_1,FIN_WAIT_2, TIME_WAIT, CLOSING are all states where the application performed an active close of the socket.
How do we know a socket is no longer associated with a process?
You can use lsof to check. It is highly recommend that you use netstat in conjunction with lsof to validate the situation prior to doing anything drastic.
Which socket states are not candidates for the rmsock command?
CLOSE_WAIT, CLOSED

停掉osysmond.bin进程就可以释放那些socket, 这里我是建议重启本节点的Instance和CRS后再观察, osysmond.bin是个monitor进程,有agent 守护,当天我们尝试最小修改通过OS kiil 了osysmond.bin ,agent自动重启了该进程且instence 未受影响,kill了该进程后socket释放,listener 恢复了正常。不过事后发现osysmond.bin 调用的socket还是有上涨趋势,如果下次有停机时间我们会尝试重启CRS.
Summary:
本次的问题是因为oracle bug 导致osysmond.bin 调用tcp socket 泄漏,导致占满了OS的TCP socket 连接,随后FTP 和监听变的异常,但是偶尔有连接释放socket,所以当时的监听表现是时好时坏。
 
–2015-7-29 updated
重启crs 也无法避免该问题,在11.2.0.3中 11.2.0.3.7 patch 修复了这个bug.

Patch description:  “Grid Infrastructure Patch Set Update : 11.2.0.3.7 (16742216)”


   Created on 24 Jun 2013, 22:52:45 hrs PST8PDT


   Bugs fixed:


     16619898, 16315641, 15876003, 14275572, 13919095, 13696251, 13348650


     12659561, 14305980, 14277586, 13987807, 14625969, 13825231, 12794268


     13000491, 13498267, 15934834, 11675721, 14082976, 12771830, 14515980


可以使用shell 加crontab 定时去检查

#!/bin/sh
###############################################################
## file: kill_bug15934834.sh
## 
## this scripts to clear connect tcp socket leak by osysmond.bin process
## Temporary workaround fixed bug 15934834
## author: weejar (anbob.com)
## date: 2015-7-29
## note: run as root, after kill osysmond process, it will automatically restart
##
################################################################
# logfile
LOGFILE=/oracle/lw/kill_bug15934834.log
# number of connect socket
NUM=`netstat -an|grep 61020|wc -l`
PSINFO=`ps -ef|grep "grid/bin/osysmond.bin"|grep -v grep`
# pid of osysmond.bin
PID=`echo $PSINFO|awk '{print $2}'`
echo "zzz ***"`date '+%a %b %e %T %Z %Y'` >> $LOGFILE
echo "Number of socket osysmond:$NUM" >> $LOGFILE
if [ $NUM -ge 50000 ]
then 
 echo `ps -ef|head -n 1` >> $LOGFILE
 echo $PSINFO >> $LOGFILE
 if [[ "" !=  "$PID" ]]; then
  echo "Warning: osysmond.bin pid $PID will be kill..." >> $LOGFILE
  kill -9 $PID
  if [ $? -ne 0 ]; then
   echo "Error: osysmond.bin pid $PID kill failed!" >> $LOGFILE
  fi
 fi
fi




本文地址:https://liuchunjie.top/?id=442
版权声明:本文为原创文章,版权归 admin 所有,欢迎分享本文,转载请保留出处!

 发表评论


表情

还没有留言,还不快点抢沙发?