检查日志发现是系统tcp资源枯竭,找到占用连接数多的进程,重启就好了
TNS-12518: TNS:listener could not hand off client connection
TNS-12547: TNS:lost contact
TNS-12560: TNS:protocol adapter error
TNS-00517: Lost contact
IBM/AIX RISC System/6000 Error: 32: Broken pipe
pu、内存资源充足,数据库连接数未满,在MOS 发现NOTE 1084443.1
搭述了现象很是接近,那里描述是aix 6引入新特性“ Traffic Regulation” 限制了tcp port 的范围,当client
尝试打开一个新socket时, 当发现socket达到最大上限时会提示上面的错误, 使用下面命令查看是否启用了该特性:
tcptr -show
但是我这里的环境并没有并没有启用TCP Traffic Regulation policy, 之少给了排查的方向,查看一下当前的tcp 连接数。
[oracle@anbob1:/home/oracle]# ps -ef|wc -l 1207 anbob1:/home/grid> netstat -an|wc -l 62954 <<<<<<< anbob1:/home/grid> netstat -an|grep 61020|wc -l 56361 <<<<<<< anbob1:/> lsattr -E -l sys0 | grep maxuproc maxuproc 16384 Maximum number of PROCESSES allowed per user True [oracle@anbob1:/home/oracle]# /usr/sbin/no -a | fgrep ephemeral tcp_ephemeral_high = 65500 tcp_ephemeral_low = 9000 udp_ephemeral_high = 65500 udp_ephemeral_low = 9000
Note:
这里的显示的连接数好高,另一节点network连接数在5000左右,看一下连接明细满屏的tcp连接都是本机的private
interconnect ip(如下),而且在都是通过61020 port发出,
这时送了一口气,不会是主机上某个程序的问题吧,这样就可以把球踢给OS了, 呵当然不好,
一个负责的DBA不能止步于此,下面要判断时哪个程序打开了这个端口的socket 连接。
... tcp4 0 0 192.168.60.11.48817 192.168.60.11.61020 CLOSE_WAIT tcp4 0 0 192.168.60.11.24318 192.168.60.11.61020 CLOSE_WAIT tcp4 0 0 192.168.60.11.24319 192.168.60.11.61020 CLOSE_WAIT tcp4 0 0 192.168.60.11.48818 192.168.60.11.61020 CLOSE_WAIT tcp4 0 0 192.168.60.11.24320 192.168.60.11.61020 CLOSE_WAIT tcp4 0 0 192.168.60.11.48819 192.168.60.11.61020 CLOSE_WAIT tcp4 0 0 192.168.60.11.24321 192.168.60.11.61020 CLOSE_WAIT tcp4 0 0 192.168.60.11.48820 192.168.60.11.61020 CLOSE_WAIT ...
下面就来演示如果根据socket port 定位process
anbob1:/> netstat -aAn|grep CLOS*|head f1000e003b08dbb8 tcp4 0 0 192.168.60.11.60433 192.168.60.11.61020 CLOSE_WAIT f1000e00354a8bb8 tcp4 0 0 192.168.60.11.35934 192.168.60.11.61020 CLOSE_WAIT f1000e0006fd5bb8 tcp4 0 0 192.168.60.11.11435 192.168.60.11.61020 CLOSE_WAIT f1000e000546dbb8 tcp 0 0 *.* *.* CLOSED f1000e0005527bb8 tcp4 0 0 *.* *.* CLOSED f1000e0004bf93b8 tcp 0 0 *.* *.* CLOSED f1000e0004becbb8 tcp4 0 0 *.* *.* CLOSED f1000e0004b2ebb8 tcp 0 0 *.* *.* CLOSED f1000e0004b34bb8 tcp4 0 0 *.* *.* CLOSED f1000e00139d03b8 tcp4 0 0 192.168.60.11.35935 192.168.60.11.61020 CLOSE_WAIT anbob1:/> kdb START END <name> 0000000000001000 0000000004160000 start+000FD8 F00000002FF47600 F00000002FFDF9C8 __ublock+000000 000000002FF22FF4 000000002FF22FF8 environ+000000 000000002FF22FF8 000000002FF22FFC errno+000000 F1000F0A00000000 F1000F0A10000000 pvproc+000000 F1000F0A10000000 F1000F0A18000000 pvthread+000000 read vscsi_scsi_ptrs OK, ptr = 0x0 (0)> sockinfo f1000e003b08dbb8 tcpcb ---- TCPCB ----(@ F1000E003B08DBB8)---- seg_next......@F1000E003B08DBB8 seg_prev......@F1000E003B08DBB8 t_softerror... 00000000 t_state....... 00000005 (CLOSE_WAIT) t_timer....... 00000000 (TCPT_REXMT) t_timer....... 00000000 (TCPT_PERSIST) t_timer....... 00000000 (TCPT_KEEP) t_timer....... 00000000 (TCPT_2MSL) t_rxtshift.... 00000000 t_rxtcur...... 00000004 t_dupacks..... 00000000 t_maxseg...... 000041CC t_force....... 00000000 t_flags....... 000003E0 (RFC1323|SENT_WS|RCVD_WS|SENT_TS|RCVD_TS|COPYFLAGS) t_oobflags.... 00000000 () t_template....@0000000000000000 t_inpcb.......@F1000E003B08DAA0 t_iobc........ 00000000 t_timestamp... FAC3B601 snd_una....... 47F28D30 snd_nxt....... 47F28D30 snd_up........ 47F28D30 snd_wl1....... 866BAB50 snd_wl2....... 47F28D30 iss........... 47F28B50 snd_wnd....... 0000000000020E04 rcv_wnd....... 0000000000020E60 rcv_nxt....... 866BAB51 rcv_up........ 866BAB50 irs........... 866BAB4F snd_wnd_scale. 00000002 rcv_wnd_scale. 00000002 req_scale_sent 00000002 req_scale_rcvd 00000002 last_ack_sent. 866BAB51 timestamp_rec. 5607037C timestamp_age. 000E42DC rcv_adv....... 866DB9B1 snd_max....... 47F28D30 snd_cwnd...... 0000000000010730 snd_ssthresh.. 000000003FFFC000 t_idle........ 000042DC t_rtt......... 00000000 t_rtseq....... 47F28CD7 t_srtt........ 00000007 t_rttvar...... 00000004 t_rttmin...... 00000002 max_rcvd...... 0000000000000000 max_sndwnd.... 0000000000020E60 t_peermaxseg.. 000041CC snd_in_pipe... 00000000 sack_data.....@0000000000000000 snd_recover... 00000000 snd_high...... 47F28D30 snd_ecn_max... 47F28D30 snd_ecn_clear. 47F28D31 t_splice_with.@0000000000000000 t_splice_flags 00000000 ... ---- SOCKET INFO ----(@ F1000E003B08D808)---- type........ 0001 (STREAM) opts........ 0000 () linger...... 0000 state....... FFFF81A2 (ISCONNECTED|CANTRCVMORE|PRIV|NBIO) pcb.....@F1000E003B08DAA0 proto...@0000000004875628 lock....@F1000E003B959080 head....@0000000000000000 q0......@0000000000000000 q.......@0000000000000000 q0len....... 0000 qlen........ 0000 qlimit...... 0000 timeo....... 0000 error....... 0000 special..... 0A08 pgid.... 0000000000000000 oobmark. 0000000000000000 snd:cc...... 0000000000000000 hiwat... 0000000000020E60 mbcnt... 0000000000000000 mbmax... 0000000000083980 lowat... 0000000000003FFF mb......@0000000000000000 sel.....@0000000000000000 events...... 0000 iodone.. 00000000 ioargs..@0000000000000000 lastpkt.@F1000E0019F10C00 wakeone. FFFFFFFFFFFFFFFF timer...@0000000000000000 timeo... 00000000 flags....... 0000 () wakeup.. 00000000 wakearg.@0000000000000000 lockwtg. FFFFFFFFFFFFFFFF .... more proc/fd: 3250/54028 proc/fd: fd: 54028 SLOT NAME STATE PID PPID ADSPACE CL #THS pvproc+32C800 3250*osysmond ACTIVE 0B2037C 0000001 0000000C2DA42590 0 000C ######<<<<<<<<<< (0)> hcal 0B2037C Value hexa: 00B2037C Value decimal: 11666300 (0)> quit anbob1:/> ps -ef|grep 11666300 root 11666300 1 3 Apr 15 - 12970:57 /oracle/app/11.2.0.3/grid/bin/osysmond.bin root 1639464 19661020 0 14:10:21 pts/5 0:00 grep 11666300
Note:
这里我们又不得不把球抱起来,这是一个oracle的程序osysmond.bin, 定位程序在aix 还可以使用lsof 但是本机没有带,需要下载就一个执行程序
[oracle@anbob1:/home/oracle]# ftp 133.96.60.19 ftp: connect: Resource temporarily unavailable
Note:
当时FTP 都提示connect 资源都不可用,可见OS的tcp 连接肯定出现了枯竭
anbob1:/> lsof -i:61020 osysmond. 11666300 root *738u IPv4 0xf1000e0041ff4bb8 0t0 TCP anbob1-prv:34166->anbob1-prv:61020 (CLOSE_WAIT) osysmond. 11666300 root *739u IPv4 0xf1000e0007091bb8 0t0 TCP anbob1-prv:35703->anbob1-prv:61020 (CLOSE_WAIT) osysmond. 11666300 root *740u IPv4 0xf1000e004283bbb8 0t0 TCP anbob1-prv:32682->anbob1-prv:61020 (CLOSE_WAIT) ...
TIP:
接下来在去MOS
以”osysmond.bin netstat”为关键字搜索,第一个就是我们想要的“Bug 15934834 osysmond.bin has
lot of sockets open in close_wait state” 现象完全一致是因为osysmond.bin leaks
socket. 但是MOS里 Workaround None, 而且当前环境也是在11.2.0.4 修复。
“Which socket states from netstat are candidates for the rmsock command?
FIN_WAIT_1,FIN_WAIT_2, TIME_WAIT, CLOSING are all states where the application performed an active close of the socket.
How do we know a socket is no longer associated with a process?
You
can use lsof to check. It is highly recommend that you use netstat in
conjunction with lsof to validate the situation prior to doing anything
drastic.
Which socket states are not candidates for the rmsock command?
CLOSE_WAIT, CLOSED
”
停掉osysmond.bin进程就可以释放那些socket,
这里我是建议重启本节点的Instance和CRS后再观察, osysmond.bin是个monitor进程,有agent
守护,当天我们尝试最小修改通过OS kiil 了osysmond.bin ,agent自动重启了该进程且instence
未受影响,kill了该进程后socket释放,listener 恢复了正常。不过事后发现osysmond.bin
调用的socket还是有上涨趋势,如果下次有停机时间我们会尝试重启CRS.
Summary:
本次的问题是因为oracle bug 导致osysmond.bin 调用tcp socket 泄漏,导致占满了OS的TCP socket 连接,随后FTP 和监听变的异常,但是偶尔有连接释放socket,所以当时的监听表现是时好时坏。
–2015-7-29 updated
重启crs 也无法避免该问题,在11.2.0.3中 11.2.0.3.7 patch 修复了这个bug.
Patch description: “Grid Infrastructure Patch Set Update : 11.2.0.3.7 (16742216)”
Created on 24 Jun 2013, 22:52:45 hrs PST8PDT
Bugs fixed:
16619898, 16315641, 15876003, 14275572, 13919095, 13696251, 13348650
12659561, 14305980, 14277586, 13987807, 14625969, 13825231, 12794268
13000491, 13498267, 15934834, 11675721, 14082976, 12771830, 14515980
可以使用shell 加crontab 定时去检查
#!/bin/sh ############################################################### ## file: kill_bug15934834.sh ## ## this scripts to clear connect tcp socket leak by osysmond.bin process ## Temporary workaround fixed bug 15934834 ## author: weejar (anbob.com) ## date: 2015-7-29 ## note: run as root, after kill osysmond process, it will automatically restart ## ################################################################ # logfile LOGFILE=/oracle/lw/kill_bug15934834.log # number of connect socket NUM=`netstat -an|grep 61020|wc -l` PSINFO=`ps -ef|grep "grid/bin/osysmond.bin"|grep -v grep` # pid of osysmond.bin PID=`echo $PSINFO|awk '{print $2}'` echo "zzz ***"`date '+%a %b %e %T %Z %Y'` >> $LOGFILE echo "Number of socket osysmond:$NUM" >> $LOGFILE if [ $NUM -ge 50000 ] then echo `ps -ef|head -n 1` >> $LOGFILE echo $PSINFO >> $LOGFILE if [[ "" != "$PID" ]]; then echo "Warning: osysmond.bin pid $PID will be kill..." >> $LOGFILE kill -9 $PID if [ $? -ne 0 ]; then echo "Error: osysmond.bin pid $PID kill failed!" >> $LOGFILE fi fi fi
发表评论