프린트 하기

OS환경 : Oracle Linux 6.8 (64bit)


DB 환경 : Oracle Database 11.2.0.4 RAC


방법 : 오라클 11g R2 RAC cssd 프로세스 중지 시 노드 재기동

Oracle 11g R2에서 cssd 관련 프로세스는 cssdmonitor 및 cssdagent를 포함하여 많은 백그라운드 프로세스가 존재함

cssdagent 프로세스는 ocssd 데몬의 시작 중지, 상태를 모니터링함

cssdagent가 ocssd 데몬이 중지된 것을 발견하면 데이터무결성을 보장하기위해 노드를 셧다운함

cssdmonitor 프로세스는 노드에서 CPU의 스케줄링을 모니터링함 

cssdmonitor 프로세스가 노드 hang을 발견하면 데이터 무결성을 보장하기 위해 노드를 셧다운함

이러한 동작은 I/O 펜싱(fencing)으로알려져있음



테스트

기존 grid 상태 확인

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
$ crsctl stat res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.LISTENER.lsnr
               ONLINE  ONLINE       rac1                                         
               ONLINE  ONLINE       rac2                                         
ora.OCR_VOTE.dg
               ONLINE  ONLINE       rac1                                         
               ONLINE  ONLINE       rac2                                         
ora.ORADATA.dg
               ONLINE  ONLINE       rac1                                         
               ONLINE  ONLINE       rac2                                         
ora.ORAFRA.dg
               ONLINE  ONLINE       rac1                                         
               ONLINE  ONLINE       rac2                                         
ora.asm
               ONLINE  ONLINE       rac1                     Started             
               ONLINE  ONLINE       rac2                     Started             
ora.gsd
               OFFLINE OFFLINE      rac1                                         
               OFFLINE OFFLINE      rac2                                         
ora.net1.network
               ONLINE  ONLINE       rac1                                         
               ONLINE  ONLINE       rac2                                         
ora.ons
               ONLINE  ONLINE       rac1                                         
               ONLINE  ONLINE       rac2                                         
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       rac1                                         
ora.cvu
      1        ONLINE  ONLINE       rac1                                         
ora.rac1.vip
      1        ONLINE  ONLINE       rac1                                         
ora.rac2.vip
      1        ONLINE  ONLINE       rac2                                         
ora.racdb.db
      1        ONLINE  ONLINE       rac1                     Open                
      2        ONLINE  ONLINE       rac2                     Open                
ora.scan1.vip
      1        ONLINE  ONLINE       rac1           

모두 정상



1번노드에서 

ocssd.bin 프로세스 ID 확인

1
2
3
4
5
$ ps -ef | grep cssd
root      8805     1  0 20:38 ?        00:00:00 /app/grid/product/11.2.0/grid/bin/cssdmonitor
root      8825     1  0 20:38 ?        00:00:00 /app/grid/product/11.2.0/grid/bin/cssdagent
oracle    8837     1  0 20:38 ?        00:00:00 /app/grid/product/11.2.0/grid/bin/ocssd.bin
root     10903  2266  0 20:42 pts/0    00:00:00 grep --color=auto cssd

PID 가 8837임



ocssd.bin 프로세스로 STOP 신호 전달

1
$ kill -SIGSTOP 8837



몇 초 후 자동으로 1번노드 재기동됨

1
2
3
4
5
6
$ ping rac2
PING rac2 (192.168.137.5156(84) bytes of data.
64 bytes from rac2 (192.168.137.51): icmp_seq=1 ttl=64 time=0.243 ms
64 bytes from rac2 (192.168.137.51): icmp_seq=2 ttl=64 time=0.192 ms
 
Connection closed by foreign host.

1번노드 -> 2번노드로 ping 테스트 중 접속중인 쉘이 끊기는것을 확인함



/etc/oracle/lastgasp/cssmonit_rac1.lgl 파일 확인

1
2
3
4
$ cd /etc/oracle/lastgasp
$ more cssmonit_rac1.lgl 
OLG01,0303,0000,rac-cluster,6391cd3ebdbfef74ffcd2c0f9ec71892,rac1,cssmonit,L-2020-07-06-07:43:23.739,Rebooting after limit 28040 exceeded; disk timeout 27540
, network timeout 28040, last heartbeat from CSSD at epoch seconds 1594035775.60528130 milliseconds ago based on invariant clock value of 484284

Rebooting after limit 28040 exceeded 메세지를 확인 가능함

2020-07-06-07:43:23(실제론 오후인 19:43:23임)에 재기동됨



cssdagent 로그 파일 확인

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
$ cd $GRID_HOME/log/rac1/agent/ohasd/oracssdagent_root
$ vi oracssdagent_root.log
장애 발생 전
2020-07-06 07:41:59.632: [CLSFRAME][734287680] TM [MultiThread] is changing desired thread # to 3. Current # is 2
2020-07-06 07:42:59.634: [    AGFW][734287680] Recvd request to shed the threads
2020-07-06 07:42:59.634: [CLSFRAME][734287680] TM [MultiThread] is changing desired thread # to 2. Current # is 3
2020-07-06 07:42:59.634: [CLSFRAME][615606016]{0:15:12} Worker thread is exiting in TM [MultiThread] to meet the desired count of 2. New count is 2
2020-07-06 07:42:59.634: [CLSFRAME][734287680] TM [MultiThread] is changing desired thread # to 3. Current # is 2
 
장애 발생 후
2020-07-06 07:43:10.610: [ USRTHRD][631908096(:CLSN00111:)clsnproc_needreboot: Impending reboot at 50% of limit 28040; disk timeout 27540, network timeout 28040, last heartbeat from CSSD at epoch seconds 1594035775.60515000 milliseconds ago based on invariant clock 484284; now polling at 100 ms
2020-07-06 07:43:10.610: [ USRTHRD][631908096] clsnwork_queue: posting worker thread
2020-07-06 07:43:10.610: [ USRTHRD][631908096] clsnwork_queue: posting worker thread
2020-07-06 07:43:10.610: [ USRTHRD][619824896] clsnwork_process_work: calling sync
2020-07-06 07:43:10.617: [ USRTHRD][619824896] clsnwork_process_work: sync completed
2020-07-06 07:43:10.617: [ USRTHRD][619824896] clsnSyncComplete: posting omon
2020-07-06 07:43:10.617: [ USRTHRD][619824896] clsnwork_process_work: Unknown work type
2020-07-06 07:43:16.725: [ USRTHRD][631908096(:CLSN00111:)clsnproc_needreboot: Impending reboot at 75% of limit 28040; disk timeout 27540, network timeout 28040, last heartbeat from CSSD at epoch seconds 1594035775.60521120 milliseconds ago based on invariant clock 484284; now polling at 100 ms
2020-07-06 07:43:16.725: [ USRTHRD][631908096] clsnwork_queue: posting worker thread
2020-07-06 07:43:16.725: [ USRTHRD][631908096] clsnwork_queue: posting worker thread
2020-07-06 07:43:16.725: [ USRTHRD][619824896] clsnwork_process_work: calling sync
2020-07-06 07:43:16.729: [ USRTHRD][619824896] clsnwork_process_work: sync completed
2020-07-06 07:43:16.729: [ USRTHRD][619824896] clsnSyncComplete: posting omon
2020-07-06 07:43:16.729: [ USRTHRD][619824896] clsnwork_process_work: Unknown work type
2020-07-06 07:43:20.936: [ USRTHRD][631908096(:CLSN00111:)clsnproc_needreboot: Impending reboot at 90% of limit 28040; disk timeout 27540, network timeout 28040, last heartbeat from CSSD at epoch seconds 1594035775.60525330 milliseconds ago based on invariant clock 484284; now polling at 100 ms
2020-07-06 07:43:20.936: [ USRTHRD][631908096] clsnwork_queue: posting worker thread
2020-07-06 07:43:20.936: [ USRTHRD][631908096] clsnwork_queue: posting worker thread
2020-07-06 07:43:20.936: [ USRTHRD][619824896] clsnwork_process_work: calling sync



cssdmonitor 로그 파일 확인

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
$ cd $GRID_HOME/log/rac1/agent/ohasd/oracssdmonitor_root
$ vi oracssdmonitor_root.log 
장애 발생 전
2020-07-06 07:41:27.468: [CLSFRAME][2084333376] TM [MultiThread] is changing desired thread # to 3. Current # is 2
2020-07-06 07:42:27.469: [    AGFW][2084333376] Recvd request to shed the threads
2020-07-06 07:42:27.469: [CLSFRAME][2084333376] TM [MultiThread] is changing desired thread # to 2. Current # is 3
2020-07-06 07:42:27.469: [CLSFRAME][1971447552]{0:13:12} Worker thread is exiting in TM [MultiThread] to meet the desired count of 2. New count is 2
2020-07-06 07:42:27.469: [CLSFRAME][2084333376] TM [MultiThread] is changing desired thread # to 3. Current # is 2
 
장애 발생 후
2020-07-06 07:43:10.610: [ USRTHRD][1967245056(:CLSN00111:)clsnproc_needreboot: Impending reboot at 50% of limit 28040; disk timeout 27540, network timeout 28040, last heartbeat from CSSD at epoch seconds 1594035775.60515000 milliseconds ago based on invariant clock 484284; now polling at 100 ms
2020-07-06 07:43:10.610: [ USRTHRD][1967245056] clsnwork_queue: posting worker thread
2020-07-06 07:43:10.610: [ USRTHRD][1967245056] clsnwork_queue: posting worker thread
2020-07-06 07:43:10.610: [ USRTHRD][1964091136] clsnwork_process_work: calling sync
2020-07-06 07:43:10.618: [ USRTHRD][1964091136] clsnwork_process_work: sync completed
2020-07-06 07:43:10.618: [ USRTHRD][1964091136] clsnSyncComplete: posting omon
2020-07-06 07:43:10.618: [ USRTHRD][1964091136] clsnwork_process_work: Unknown work type
2020-07-06 07:43:16.724: [ USRTHRD][1967245056(:CLSN00111:)clsnproc_needreboot: Impending reboot at 75% of limit 28040; disk timeout 27540, network timeout 28040, last heartbeat from CSSD at epoch seconds 1594035775.60521120 milliseconds ago based on invariant clock 484284; now polling at 100 ms
2020-07-06 07:43:16.724: [ USRTHRD][1967245056] clsnwork_queue: posting worker thread
2020-07-06 07:43:16.724: [ USRTHRD][1967245056] clsnwork_queue: posting worker thread
2020-07-06 07:43:16.724: [ USRTHRD][1964091136] clsnwork_process_work: calling sync
2020-07-06 07:43:16.729: [ USRTHRD][1964091136] clsnwork_process_work: sync completed
2020-07-06 07:43:16.729: [ USRTHRD][1964091136] clsnSyncComplete: posting omon
2020-07-06 07:43:16.729: [ USRTHRD][1964091136] clsnwork_process_work: Unknown work type
2020-07-06 07:43:20.932: [ USRTHRD][1967245056(:CLSN00111:)clsnproc_needreboot: Impending reboot at 90% of limit 28040; disk timeout 27540, network timeout 28040, last heartbeat from CSSD at epoch seconds 1594035775.60525320 milliseconds ago based on invariant clock 484284; now polling at 100 ms
2020-07-06 07:43:20.932: [ USRTHRD][1967245056] clsnwork_queue: posting worker thread
2020-07-06 07:43:20.932: [ USRTHRD][1967245056] clsnwork_queue: posting worker thread
2020-07-06 07:43:20.932: [ USRTHRD][1964091136] clsnwork_process_work: calling sync



grid alert 로그 파일 확인

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
$ cd $GRID_HOME/log/rac1
$ vi alertrac1.log 
2020-07-06 07:39:39.928:
[crsd(9215)]CRS-2772:Server 'rac2' has been assigned to pool 'ora.racdb'.
<- 이 시점 이후 재기동됨
2020-07-06 07:43:45.971:
[ohasd(1435)]CRS-2112:The OLR service started on node rac1.
2020-07-06 07:43:46.017:
[ohasd(1435)]CRS-1301:Oracle High Availability Service started on node rac1.
2020-07-06 07:43:46.021:
[ohasd(1435)]CRS-8011:reboot advisory message from host: rac1, component: cssagent, with time stamp: L-2020-07-06-02:15:19.697
[ohasd(1435)]CRS-8013:reboot advisory message text: Rebooting after limit 28330 exceeded; disk timeout 28330, network timeout 28160, last heartbeat from CSSD at epoch seconds 1594016091.35028351 milliseconds ago based on invariant clock value of 17689723
2020-07-06 07:43:46.022:
[ohasd(1435)]CRS-8011:reboot advisory message from host: rac1, component: cssmonit, with time stamp: L-2020-07-06-07:43:23.739
[ohasd(1435)]CRS-8013:reboot advisory message text: Rebooting after limit 28040 exceeded; disk timeout 27540, network timeout 28040, last heartbeat from CSSD at epoch seconds 1594035775.60528130 milliseconds ago based on invariant clock value of 484284
2020-07-06 07:43:46.022:
[ohasd(1435)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 2 were announced and 0 errors occurred
2020-07-06 07:43:46.667:
[/app/grid/product/11.2.0/grid/bin/oraagent.bin(2447)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/app/grid/product/11.2.0/grid/log/rac1/agent/ohasd/oraagent_oracle/oraagent_oracle.log"
2020-07-06 07:43:49.735:
[/app/grid/product/11.2.0/grid/bin/orarootagent.bin(2451)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).
2020-07-06 07:43:51.850:
[ohasd(1435)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).
2020-07-06 07:43:51.869:
[gpnpd(2569)]CRS-2328:GPNPD started on node rac1.
2020-07-06 07:43:54.161:
[cssd(2673)]CRS-1713:CSSD daemon is started in clustered mode
2020-07-06 07:43:56.034:



/var/log/messages 확인

1
2
3
4
5
6
7
8
9
10
11
12
# vi /var/log/messages
Jul  6 20:43:34 rac1 ohasd: Starting ohasd:
Jul  6 20:43:34 rac1 logger: Oracle HA daemon is enabled for autostart.
Jul  6 20:43:34 rac1 libvirtd: 2020-07-06 11:43:34.275+00001303: info : libvirt version: 3.2.0, package: 14.el7 (Unknown, 2017-08-03-07:50:34, x86-ol7-builder-01.us.oracle.com)
Jul  6 20:43:34 rac1 libvirtd: 2020-07-06 11:43:34.275+00001303: info : hostname: rac1
Jul  6 20:43:34 rac1 libvirtd: 2020-07-06 11:43:34.275+00001303: error : virDriverLoadModuleFile:59 : failed to load module /usr/lib64/libvirt/storage-backend/libvirt_storage_backend_rbd.so /usr/lib64/libvirt/storage-backend/libvirt_storage_backend_rbd.so: undefined symbol: rbd_diff_iterate2
Jul  6 20:43:34 rac1 systemd: Started Virtualization daemon.
Jul  6 20:43:34 rac1 avahi-daemon[799]: Invalid legacy unicast query packet.
Jul  6 20:43:34 rac1 ohasd: CRS-4123: Oracle High Availability Services has been started.
Jul  6 20:43:34 rac1 systemd: Started (null).
Jul  6 20:43:34 rac1 systemd: Starting LSB: Starts the Spacewalk Daemon...
Jul  6 20:43:34 rac1 logger: exec /app/grid/product/11.2.0/grid/perl/bin/perl -I/app/grid/product/11.2.0/grid/perl/lib /app/grid/product/11.2.0/grid/bin/crswrapexece.pl /app/grid/product/11.2.0/grid/crs/install/s_crsconfig_rac1_env.txt /app/grid/product/11.2.0/grid/bin/ohasd.bin "reboot"




참조 : 

https://www.hhutzler.de/blog/eviction_cssd_hang/

https://community.oracle.com/thread/3199149

프로 오라클 데이터베이스 11g RAC 리눅스 서적