OS 환경 : Oracle Linux 8.8 (64bit)
DB 환경 : Oracle Database 19.27.0.0
방법 : 오라클 19c Private IP 정지 후 fencing, evict 확인 테스트
오라클 19c 2 node rac로 환경에서 Private IP가 통신이 되지 않을 경우 어떻게 되는지 테스트해봄
private ip가 동작하지 않을때 특정시간이 지나면 문제가 발생하는데 이는 misscount라는 파라미터 값으로 제어됨, 기본값은 30초로 이값이 넘어가면 서버가 fencing 조치(재기동)됨
misscount 값은 유연하게 조절할 수 있지만, 너무 높게 설정하면 장애 감지가 지연되고, 너무 낮으면 false positive가 발생할 수 있음
fencing이란?
oracle rac에서 특정 노드가 일정 시간(misscount) 이상 priv 통신 또는 voting disk에 접근하지 못하면,
해당 노드를 자동으로 재부팅 시키는 것을 fencing이라고함(split-brain 방지 목적)
evict란?
oracle rac에서 evict(퇴출)은 클러스터 내부에서 통신이 되지 않는 노드를 강제로 클러스터에서 제거하는 동작을 말함
일정 시간 이상 heartbeat 신호가 없거나 voting disk 접근이 실패하면, 해당 노드를 "비정상 노드"로 판단하고 퇴출(evict)하여 클러스터의 안정성을 유지함
퇴출된 노드는 일반적으로 자동 재부팅(fencing)을 통해 복구를 시도함
본문에서는 rac가 정상적으로 구동되고 있는 상황에서 2번노드의 priv ip를 ifdown 시킨 뒤 fencing 및 evict 시 어떻게 동작하는지 확인해봄
테스트
/etc/hosts 확인
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
##Public
192.168.137.161 ora19rac1
192.168.137.162 ora19rac2
##Private
10.10.10.10 ora19rac1-priv
10.10.10.20 ora19rac2-priv
##Virtual
192.168.137.61 ora19rac1-vip
192.168.137.62 ora19rac2-vip
##SCAN
192.168.137.200 ora19rac-scan
|
현재 Private IP는 10.대역임
ifconfig 확인
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
# ifconfig
ens160: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.137.161 netmask 255.255.255.0 broadcast 192.168.137.255
inet6 fe80::250:56ff:fea8:50fd prefixlen 64 scopeid 0x20<link>
ether 00:50:56:a8:50:fd txqueuelen 1000 (Ethernet)
RX packets 8411 bytes 1288393 (1.2 MiB)
RX errors 205 dropped 218 overruns 0 frame 0
TX packets 407 bytes 57223 (55.8 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens192: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.137.161 netmask 255.255.255.0 broadcast 192.168.137.255
inet6 fe80::250:56ff:fea8:7c06 prefixlen 64 scopeid 0x20<link>
ether 00:50:56:a8:7c:06 txqueuelen 1000 (Ethernet)
RX packets 33768 bytes 20472126 (19.5 MiB)
RX errors 524 dropped 563 overruns 0 frame 0
TX packets 8575 bytes 3427402 (3.2 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens192:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.137.61 netmask 255.255.255.0 broadcast 192.168.137.255
ether 00:50:56:a8:7c:06 txqueuelen 1000 (Ethernet)
ens224: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.10.10.10 netmask 255.255.255.0 broadcast 10.10.10.255
inet6 fe80::250:56ff:fea8:7bfb prefixlen 64 scopeid 0x20<link>
ether 00:50:56:a8:7b:fb txqueuelen 1000 (Ethernet)
RX packets 97634 bytes 95368531 (90.9 MiB)
RX errors 416 dropped 454 overruns 0 frame 0
TX packets 42488 bytes 36035899 (34.3 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
|
현재 Private IP는 10.대역인 ens224임
grid 상태 확인
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
|
# crsctl stat res -t
--------------------------------------------------------------------------------
Name Target State Server State details
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.LISTENER.lsnr
ONLINE ONLINE ora19rac1 STABLE
ONLINE ONLINE ora19rac2 STABLE
ora.chad
ONLINE ONLINE ora19rac1 STABLE
ONLINE ONLINE ora19rac2 STABLE
ora.net1.network
ONLINE ONLINE ora19rac1 STABLE
ONLINE ONLINE ora19rac2 STABLE
ora.ons
ONLINE ONLINE ora19rac1 STABLE
ONLINE ONLINE ora19rac2 STABLE
ora.proxy_advm
OFFLINE OFFLINE ora19rac1 STABLE
OFFLINE OFFLINE ora19rac2 STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.ASMNET1LSNR_ASM.lsnr(ora.asmgroup)
1 ONLINE ONLINE ora19rac1 STABLE
2 ONLINE ONLINE ora19rac2 STABLE
ora.DATA.dg(ora.asmgroup)
1 ONLINE ONLINE ora19rac1 STABLE
2 ONLINE ONLINE ora19rac2 STABLE
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE ora19rac2 STABLE
ora.OCR.dg(ora.asmgroup)
1 ONLINE ONLINE ora19rac1 STABLE
2 ONLINE ONLINE ora19rac2 STABLE
ora.RECO.dg(ora.asmgroup)
1 OFFLINE OFFLINE STABLE
2 ONLINE ONLINE ora19rac2 STABLE
ora.VOTE.dg(ora.asmgroup)
1 OFFLINE OFFLINE STABLE
2 ONLINE ONLINE ora19rac2 STABLE
ora.asm(ora.asmgroup)
1 ONLINE ONLINE ora19rac1 Started,STABLE
2 ONLINE ONLINE ora19rac2 Started,STABLE
ora.asmnet1.asmnetwork(ora.asmgroup)
1 ONLINE ONLINE ora19rac1 STABLE
2 ONLINE ONLINE ora19rac2 STABLE
ora.cvu
1 ONLINE ONLINE ora19rac2 STABLE
ora.ora19db.db
1 ONLINE OFFLINE CLEANING
2 ONLINE ONLINE ora19rac2 Open,HOME=/oracle/ap
p/oracle/product/19c
,STABLE
ora.ora19rac1.vip
1 ONLINE ONLINE ora19rac1 STABLE
ora.ora19rac2.vip
1 ONLINE ONLINE ora19rac2 STABLE
ora.qosmserver
1 ONLINE ONLINE ora19rac2 STABLE
ora.scan1.vip
1 ONLINE ONLINE ora19rac2 STABLE
--------------------------------------------------------------------------------
|
모두 정상임
misscount 값 확인
1
2
|
# crsctl get css misscount
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.
|
기본값인 30초로 설정되어 있음
1번 노드에서 fencing 로그 확인(3개의 세션에서 각 로그들 tail 걸어놓기)
1
2
3
4
|
# tail -300f /oracle/app/oracle/diag/crs/ora19rac1/crs/trace/ocssd.trc
# tail -300f /oracle/app/oracle/diag/crs/ora19rac1/crs/trace/crsd.trc
# tail -300f /oracle/app/oracle/diag/crs/ora19rac1/crs/trace/alert.log
# tail -300f /var/log/messages
|
현재 시간 확인
1
2
|
# date
Tue Jun 17 21:17:01 KST 2025
|
2번 노드 ens224 down 처리
1
2
|
# ifdown ens224
Connection 'ens224' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/3)
|
ifconfig 재확인
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
# ifconfig
ens160: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.137.162 netmask 255.255.255.0 broadcast 192.168.137.255
inet6 fe80::250:56ff:fea8:2052 prefixlen 64 scopeid 0x20<link>
ether 00:50:56:a8:20:52 txqueuelen 1000 (Ethernet)
RX packets 14088 bytes 2172689 (2.0 MiB)
RX errors 1487 dropped 1495 overruns 0 frame 0
TX packets 371 bytes 52501 (51.2 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens192: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.137.162 netmask 255.255.255.0 broadcast 192.168.137.255
inet6 fe80::250:56ff:fea8:2f41 prefixlen 64 scopeid 0x20<link>
ether 00:50:56:a8:2f:41 txqueuelen 1000 (Ethernet)
RX packets 36126 bytes 8021600 (7.6 MiB)
RX errors 1220 dropped 1249 overruns 0 frame 0
TX packets 9410 bytes 17001359 (16.2 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens192:2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.137.200 netmask 255.255.255.0 broadcast 192.168.137.255
ether 00:50:56:a8:2f:41 txqueuelen 1000 (Ethernet)
ens192:3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.137.62 netmask 255.255.255.0 broadcast 192.168.137.255
ether 00:50:56:a8:2f:41 txqueuelen 1000 (Ethernet)
ens224: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether 00:50:56:a8:ce:36 txqueuelen 1000 (Ethernet)
RX packets 116028 bytes 77297655 (73.7 MiB)
RX errors 1284 dropped 1312 overruns 0 frame 0
TX packets 131051 bytes 127789926 (121.8 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
|
ens224에 ip가 없어짐(내려감)
1번 노드에서 fencing 주요 로그 확인
cssd 로그 확인
1
2
3
4
5
6
7
8
9
10
11
12
|
# tail -300f /oracle/app/oracle/diag/crs/ora19rac1/crs/trace/ocssd.trc
2025-06-17 21:17:21.402 : CSSD:3726550784: [ WARNING] clssnmPollingThread: node ora19rac2 (2) at 50% heartbeat fatal, removal in 14.370 seconds
2025-06-17 21:17:28.405 : CSSD:3726550784: [ WARNING] clssnmPollingThread: node ora19rac2 (2) at 75% heartbeat fatal, removal in 7.370 seconds
2025-06-17 21:17:33.407 : CSSD:3726550784: [ WARNING] clssnmPollingThread: node ora19rac2 (2) at 90% heartbeat fatal, removal in 2.370 seconds,
2025-06-17 21:17:35.778 : CSSD:3726550784: [ INFO] clssnmPollingThread: Removal started for node ora19rac2 (2), flags 0x22040c, state 3, wt4c 0
2025-06-17 21:17:35.778 : CSSD:3726550784: [ INFO] clssnmMarkNodeForRemoval: node 2, ora19rac2 marked for removal
2025-06-17 21:17:35.785 : CSSD:3723396864: [ INFO] clssnmProcessSickNode:1 healthy Nodes found in the cluster clearing maps for sick node 1 to avoid eviction of a healthy node
2025-06-17 21:17:35.792 : CSSD:3723396864: (:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 0 nodes with leader 65535, , loses to cohort of 1 nodes led by node 2, ora19rac2, based on map type 2 evictionreason: cssd internal, winning cohort winreason: the local node is already evicted
2025-06-17 21:17:35.792 : CSSD:3723396864: [ INFO] clssnmCheckForNetworkFailure: expiring 0 evicted 1 evicting node 1 this node 1
2025-06-17 21:17:35.792 : CSSD:3723396864: [ INFO] clssnmCheckForNetworkFailure: expiring 1 evicted 1 evicting node 1 this node 2
2025-06-17 21:17:36.760 : CSSD:3736012544: [ ERROR] (:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node ora19rac2, number 2, sync 645652912, stamp 921804, fence inited 0
2025-06-17 21:17:36.761 : CSSD:3731281664: [ ERROR] (:CSSNM00238:)clssnmvDiskKillCheck: evicted by node ora19rac2, number 2 evictionreason: NHB loss, winning cohort winreason: the cohort is the only one with public network access
|
19:17:05에 priv ip를 ifdown 했고, 그로부터 15초정도 뒤에 "node ora19rac2 (2) at 50% heartbeat fatal, removal in 14.370 seconds" 메세지가 발생함
그리고 heartbeat fatal 퍼센트가 차고 시간이 점차 줄어듦, 이후 ora19rac2가 removal로 mark됨
crsd 로그 확인
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
# tail -300f /oracle/app/oracle/diag/crs/ora19rac1/crs/trace/crsd.trc
2025-06-17 21:17:35.860 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :11
2025-06-17 21:17:35.860 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :49
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :60
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :63
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :67
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :117
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :121
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :124
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :127
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :148
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :223
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :232
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :233
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :234
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :236
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :238
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :247
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :248
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :249
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :252
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :349
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :350
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :351
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Disconnecting client of command id :352
2025-06-17 21:17:35.861 :UiServer:3348539136: [ INFO] {1:34646:484} Sending message: 1850 to AGFW proxy server.
2025-06-17 21:17:35.861 : AGFW:3367450368: [ INFO] {1:34646:484} Agfw Proxy Server received the message: FENCE_CMD[Proxy] ID 20489:1850
2025-06-17 21:17:35.861 : AGFW:3367450368: [ INFO] {1:34646:484} Agfw Proxy Server sending message: RESOURCE_CLEAN[ora.ASMNET1LSNR_ASM.lsnr 1 1] ID 4100:1851 to the agent /oracle/app/grid/19c/bin/oraagent_oracle
2025-06-17 21:17:35.861 : CRSPE:3367450368: [ INFO] {1:34646:484} Skipping Fence of : ora.DATA.dg
2025-06-17 21:17:35.861 : AGFW:3367450368: [ INFO] {1:34646:484} Agfw Proxy Server sending message: RESOURCE_CLEAN[ora.LISTENER.lsnr ora19rac1 1] ID 4100:1852 to the agent /oracle/app/grid/19c/bin/oraagent_oracle
2025-06-17 21:17:35.861 : CRSPE:3367450368: [ INFO] {1:34646:484} Skipping Fence of : ora.OCR.dg
2025-06-17 21:17:35.861 : CRSPE:3367450368: [ INFO] {1:34646:484} Skipping Fence of : ora.RECO.dg
2025-06-17 21:17:35.861 : CRSPE:3367450368: [ INFO] {1:34646:484} Skipping Fence of : ora.VOTE.dg
2025-06-17 21:17:35.861 : CRSPE:3367450368: [ INFO] {1:34646:484} Skipping Fence of : ora.asm
2025-06-17 21:17:35.861 : CRSPE:3367450368: [ INFO] {1:34646:484} Skipping Fence by Type of : ora.asmnet1.asmnetwork
|
21:17:35부터(Removal 시작된 시간) 클라이언트가 연결해제됨(grid 리소스가 아닐까 싶음)
"Agfw Proxy Server received the message: FENCE_CMD[Proxy] ID 20489:1850" => ora19rac2 노드가 ora19rac1에 대해 FENCE (격리/퇴출) 명령을 실행했다는 의미
즉 ora19rac2가 ora19rac1을 죽여야 한다고 판단했고, fencing 절차에 들어감
crs alert 로그 확인
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
# tail -300f /oracle/app/oracle/diag/crs/ora19rac1/crs/trace/alert.log
2025-06-17 21:17:14.733 [OCSSD(7067)]CRS-7503: The Oracle Grid Infrastructure process 'ocssd' observed communication issues between node 'ora19rac1' and node 'ora19rac2', interface list of local node 'ora19rac1' is '10.10.10.10:59642;', interface list of remote node 'ora19rac2' is '10.10.10.20:40921;'.
2025-06-17 21:17:14.735 [OCSSD(7067)]CRS-7503: The Oracle Grid Infrastructure process 'ocssd' observed communication issues between node 'ora19rac1' and node 'ora19rac2', interface list of local node 'ora19rac1' is '10.10.10.10:59642;', interface list of remote node 'ora19rac2' is '10.10.10.20:40921;'.
2025-06-17 21:17:18.951 [OCTSSD(7430)]CRS-7503: The Oracle Grid Infrastructure process 'octssd' observed communication issues between node 'ora19rac1' and node 'ora19rac2', interface list of local node 'ora19rac1' is '10.10.10.10:16279;', interface list of remote node 'ora19rac2' is '10.10.10.20:60480;'.
2025-06-17 21:17:21.401 [OCSSD(7067)]CRS-1612: Network communication with node ora19rac2 (2) has been missing for 50% of the timeout interval. If this persists, removal of this node from cluster will occur in 14.370 seconds
2025-06-17 21:17:22.414 [EVMD(6047)]CRS-7503: The Oracle Grid Infrastructure process 'evmd' observed communication issues between node 'ora19rac1' and node 'ora19rac2', interface list of local node 'ora19rac1' is '10.10.10.10:34133;', interface list of remote node 'ora19rac2' is '10.10.10.20:47768;'.
2025-06-17 21:17:28.404 [OCSSD(7067)]CRS-1611: Network communication with node ora19rac2 (2) has been missing for 75% of the timeout interval. If this persists, removal of this node from cluster will occur in 7.370 seconds
2025-06-17 21:17:33.406 [OCSSD(7067)]CRS-1610: Network communication with node ora19rac2 (2) has been missing for 90% of the timeout interval. If this persists, removal of this node from cluster will occur in 2.370 seconds
2025-06-17 21:17:35.792 [OCSSD(7067)]CRS-1609: This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /oracle/app/oracle/diag/crs/ora19rac1/crs/trace/ocssd.trc.
2025-06-17 21:17:35.819 [OCSSD(7067)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /oracle/app/oracle/diag/crs/ora19rac1/crs/trace/ocssd.trc
2025-06-17 21:17:35.853 [OCSSD(7067)]CRS-1652: Starting clean up of CRSD resources.
2025-06-17 21:17:49.782 [CSSDMONITOR(7005)]CRS-1661: The CSS daemon is not responding. If this persists, a reboot will occur in 14029 milliseconds; details are at (:CLSN00121:) in /oracle/app/oracle/diag/crs/ora19rac1/crs/trace/ohasd_cssdmonitor_root.trc.
2025-06-17 21:17:49.782 [CSSDAGENT(7042)]CRS-1661: The CSS daemon is not responding. If this persists, a reboot will occur in 14030 milliseconds; details are at (:CLSN00121:) in /oracle/app/oracle/diag/crs/ora19rac1/crs/trace/ohasd_cssdagent_root.trc.
2025-06-17 21:17:50.054 [OCSSD(7067)]CRS-1654: Clean up of CRSD resources finished successfully.
2025-06-17 21:17:50.107 [OCSSD(7067)]CRS-1655: CSSD on node ora19rac1 detected a problem and started to shutdown.
2025-06-17 21:17:50.197 [ORAAGENT(11273)]CRS-5822: Agent '/oracle/app/grid/19c/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:4:7} in /oracle/app/oracle/diag/crs/ora19rac1/crs/trace/crsd_oraagent_oracle.trc.
2025-06-17 21:17:50.195 [ORAROOTAGENT(7792)]CRS-5822: Agent '/oracle/app/grid/19c/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:2:7} in /oracle/app/oracle/diag/crs/ora19rac1/crs/trace/crsd_orarootagent_root.trc.
2025-06-17 21:17:50.561 [CRSD(30679)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 30679
2025-06-17 21:17:52.843 [CSSDMONITOR(30793)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 30793
2025-06-17T21:17:55.069192+09:00
Errors in file /oracle/app/oracle/diag/crs/ora19rac1/crs/trace/ocssd.trc (incident=17):
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
2025-06-17 21:17:55.050 [OCSSD(7067)]CRS-8503: Oracle Clusterware process OCSSD with operating system process ID 7067 experienced fatal signal or exception code 6.
Incident details in: /oracle/app/oracle/diag/crs/ora19rac1/crs/incident/incdir_17/ocssd_i17.trc
2025-06-17 21:17:56.063 [CRSD(30679)]CRS-0804: Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-23: Error in cluster services layer Cluster services error [ [3]]. Details at (:CRSD00111:) in /oracle/app/oracle/diag/crs/ora19rac1/crs/trace/crsd.trc.
|
crs alert에도 에러로그가 남음, "communication issues between node 'ora19rac1' and node 'ora19rac2'" 노드간 통신 이슈가 있다고 나옴
"This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity" 이렇게
이 노드는 클러스터의 다른 노드와 통신할 수 없으며 클러스터 무결성을 유지하기 위해 다운된다고 표시됨
"Network communication with node ora19rac2 (2) has been missing for 50% of the timeout interval.", 75%, 90% => misscount 30초 안에서, 점차 timeout 임박 상태임을 보여주고 있음
그리고 "CSS daemon is terminating" css 데몬이 종료됨
"The CSS daemon is not responding. If this persists, a reboot will occur in 14030 milliseconds;" 14초 이내로 reboot 된다고 나와있음
"Clean up of CRSD resources finished successfully.","CSSD on node ora19rac1 detected a problem and started to shutdown." CRS 리소스 정리 완료 후, 클러스터 서비스 종료 완료됨
"Cluster Ready Service aborted" crs 데몬 종료됨
messages 로그 확인
1
2
3
4
|
# tail -300f /var/log/messages
Jun 17 21:17:55 ora19rac1 systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Jun 17 21:17:55 ora19rac1 systemd[1]: Started Process Core Dump (PID 30847/UID 0).
Jun 17 21:17:55 ora19rac1 systemd-coredump[30848]: Process 7067 (ocssd.bin) of user 54321 dumped core.
|
PID 7067인 ocssd.bin 프로세스가 core dump(메모리 덤프) 를 남기고 비정상 종료됨
grid 상태 확인
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
# crsctl stat res -t -init
--------------------------------------------------------------------------------
Name Target State Server State details
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1 ONLINE OFFLINE STABLE
ora.cluster_interconnect.haip
1 ONLINE OFFLINE STABLE
ora.crf
1 ONLINE ONLINE ora19rac1 STABLE
ora.crsd
1 ONLINE OFFLINE STABLE
ora.cssd
1 ONLINE OFFLINE INTERCONNECT FAILURE
,STABLE
ora.cssdmonitor
1 ONLINE ONLINE ora19rac1 STABLE
ora.ctssd
1 ONLINE OFFLINE STABLE
ora.diskmon
1 OFFLINE OFFLINE STABLE
ora.drivers.acfs
1 ONLINE ONLINE ora19rac1 STABLE
ora.evmd
1 ONLINE INTERMEDIATE ora19rac1 STABLE
ora.gipcd
1 ONLINE ONLINE ora19rac1 STABLE
ora.gpnpd
1 ONLINE ONLINE ora19rac1 STABLE
ora.mdnsd
1 ONLINE ONLINE ora19rac1 STABLE
ora.storage
1 ONLINE OFFLINE STABLE
--------------------------------------------------------------------------------
|
2번 노드의 priv ip를 down 시켰지만 실제 reboot으로 까진 이어지지 않고 1번노드의 일부 프로세스(리소스) cssd, crsd가 죽음
예상하기로는 30초 동안 heartbeat가 복구되지 않으면 ora19rac2 노드가 자동으로 재부팅된다고 생각했지만
실제로는 2번이 아닌 1번의 grid가 서비스가 일부 내려감
=> 몇번 더 테스트해본결과 꼭 반대노드가 내려가는것이 아닌 ifdown 한 노드가 내려가기도함
Clusterware Administration and Deployment Guide 중
Server Weight-Based Node Eviction
You can configure the Oracle Clusterware failure recovery mechanism to choose which
cluster nodes to terminate or evict in the event of a private network (cluster interconnect)
failure.
In a split-brain situation, where a cluster experiences a network split, partitioning the cluster
into disjoint cohorts, Oracle Clusterware applies certain rules to select the surviving cohort,
potentially evicting a node that is running a critical, singleton resource.
You can affect the outcome of these decisions by adding value to a database instance or
node so that, when Oracle Clusterware must decide whether to evict or terminate, it will
consider these factors and attempt to ensure that all critical components remain available.
You can configure weighting functions to add weight to critical components in your cluster,
giving Oracle Clusterware added input when deciding which nodes to evict when resolving a
split-brain situation.
서버 가중치 기반 노드 제거
사설 네트워크(클러스터 상호 연결) 장애 발생 시 Oracle Clusterware 장애 복구 메커니즘을 구성하여 종료하거나 제거할 클러스터 노드를 선택할 수 있습니다.
클러스터가 네트워크 분할로 인해 분리된 코호트로 분할되는 스플릿 브레인 상황에서 Oracle Clusterware는 특정 규칙을 적용하여 남은 코호트를 선택하고, 중요한 싱글톤 리소스를 실행 중인 노드를 잠재적으로 제거합니다.
데이터베이스 인스턴스 또는 노드에 값을 추가하여 이러한 결정의 결과에 영향을 줄 수 있습니다. Oracle Clusterware가 노드 제거 또는 종료 여부를 결정할 때 이러한 요소를 고려하고 모든 중요 구성 요소를 계속 사용할 수 있도록 합니다.
가중치 함수를 구성하여 클러스터의 중요 구성 요소에 가중치를 추가하여 스플릿 브레인 상황 해결 시 제거할 노드를 결정할 때 Oracle Clusterware에 추가 입력을 제공할 수 있습니다.
참고 : https://docs.oracle.com/en/database/oracle/oracle-database/21/cwadd/clusterware-administration-and-deployment-guide.pdf
gpt를 통해 확인해본 결과 grid는 os에 reboot 신호를 내렸지만 Linux Watchdog 또는 fencing agent가 실행중이지 않아(비활성 상태) reboot이 되지않은것이라고함
===
gpt 설명
RAC에서 실제 reboot은 Watchdog 디바이스를 통해 kernel panic을 유도하거나, IPMI/iLO 등을 통해 전원 제어로 수행됩니다.
이게 설정되어 있지 않으면, Clusterware는 죽지만 OS는 살아 있음 (=> 현재 상황과 동일)
확인 방법 :
1
2
|
lsmod | grep -i watchdog
dmesg | grep -i watchdog
|
없다면 실제 fencing은 소프트 fencing만 되고 있는 겁니다.
===
와치독 상태 확인
1
2
3
|
# lsmod | grep -i watchdog
# dmesg | grep -i watchdog
[ 0.124257] NMI watchdog: Perf NMI watchdog permanently disabled
|
현재 disabled 상태임=> 소프트 fencing만 되고있는것
추가 테스트를 위해 2번노드 priv 정상화 후 1번노드 grid 재기동
2번노드 priv 정상화
1
2
|
# ifup ens224
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/7)
|
ifconfig 확인
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
|
# ifconfig
ens160: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.137.162 netmask 255.255.255.0 broadcast 192.168.137.255
inet6 fe80::250:56ff:fea8:2052 prefixlen 64 scopeid 0x20<link>
ether 00:50:56:a8:20:52 txqueuelen 1000 (Ethernet)
RX packets 1229475 bytes 529703794 (505.1 MiB)
RX errors 145965 dropped 146179 overruns 0 frame 0
TX packets 76557 bytes 5971341 (5.6 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens192: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.137.162 netmask 255.255.255.0 broadcast 192.168.137.255
inet6 fe80::250:56ff:fea8:2f41 prefixlen 64 scopeid 0x20<link>
ether 00:50:56:a8:2f:41 txqueuelen 1000 (Ethernet)
RX packets 2064031 bytes 461398963 (440.0 MiB)
RX errors 109236 dropped 109291 overruns 0 frame 0
TX packets 138637 bytes 159441852 (152.0 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens192:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.137.61 netmask 255.255.255.0 broadcast 192.168.137.255
ether 00:50:56:a8:2f:41 txqueuelen 1000 (Ethernet)
ens192:2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.137.200 netmask 255.255.255.0 broadcast 192.168.137.255
ether 00:50:56:a8:2f:41 txqueuelen 1000 (Ethernet)
ens192:3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.137.62 netmask 255.255.255.0 broadcast 192.168.137.255
ether 00:50:56:a8:2f:41 txqueuelen 1000 (Ethernet)
ens224: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.10.10.20 netmask 255.255.255.0 broadcast 10.10.10.255
inet6 fe80::250:56ff:fea8:ce36 prefixlen 64 scopeid 0x20<link>
ether 00:50:56:a8:ce:36 txqueuelen 1000 (Ethernet)
RX packets 1958580 bytes 411829926 (392.7 MiB)
RX errors 113611 dropped 113665 overruns 0 frame 0
TX packets 133364 bytes 129696148 (123.6 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
|
정상
1번노드 grid 재기동
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
# crsctl stop crs
CRS-2796: The command may not proceed when Cluster Ready Services is not running
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors.
# crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'ora19rac1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'ora19rac1'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'ora19rac1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'ora19rac1'
CRS-2673: Attempting to stop 'ora.crf' on 'ora19rac1'
CRS-2673: Attempting to stop 'ora.evmd' on 'ora19rac1'
CRS-2677: Stop of 'ora.cssdmonitor' on 'ora19rac1' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'ora19rac1'
CRS-2677: Stop of 'ora.mdnsd' on 'ora19rac1' succeeded
CRS-2677: Stop of 'ora.evmd' on 'ora19rac1' succeeded
CRS-2677: Stop of 'ora.crf' on 'ora19rac1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'ora19rac1'
CRS-2677: Stop of 'ora.gpnpd' on 'ora19rac1' succeeded
CRS-2677: Stop of 'ora.gipcd' on 'ora19rac1' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'ora19rac1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'ora19rac1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
# crsctl start crs
|
crs 종료시 f옵션을 사용하지않으면 내려가지 않음
와치독 활성화후 다시 동일하게 테스트를 해봤지만 노드(서버)자체는 reboot 되지 않았음
아래 설정 방법은 참고용으로 지우지 않고 놔둠
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
|
와치독 활성화(양쪽노드에서 수행)
softdog 모듈 로드
# modprobe softdog
# ls /dev/watchdog
/dev/watchdog
재부팅 후 자동 로드 설정
# echo softdog > /etc/modules-load.d/watchdog.conf
systemd로 모듈 즉시 적용
# systemctl restart systemd-modules-load
권한 부여
# chgrp oinstall /dev/watchdog /dev/watchdog0
# chmod g+rw /dev/watchdog /dev/watchdog0
적용 확인
# ls -al /dev/watchdog*
crw-rw---- 1 root oinstall 10, 130 Jun 18 17:12 /dev/watchdog
crw-rw---- 1 root oinstall 247, 0 Jun 18 17:12 /dev/watchdog0
softdog 모듈 옵션을 명시적으로 등록해 적용
# echo "options softdog nowayout=1" > /etc/modprobe.d/softdog.conf
와치독 재기동
# modprobe -r softdog
# modprobe softdog
# wdctl /dev/watchdog
Device: /dev/watchdog
Identity: Software Watchdog [version 0]
Timeout: 60 seconds
Pre-timeout: 0 seconds
FLAG DESCRIPTION STATUS BOOT-STATUS
KEEPALIVEPING Keep alive ping reply 1 0
MAGICCLOSE Supports magic close char 0 0
SETTIMEOUT Set timeout (in seconds) 0 0
와치독 상태 확인
# lsmod | grep -i watchdog
# dmesg | grep -i watchdog
[ 0.112558] NMI watchdog: Perf NMI watchdog permanently disabled
=> 이렇게 나오는게 정상이라고함
crs 재시작(양쪽 노드)
# crsctl stop crs
# crsctl start crs
priv NIC down 시 fencing + reboot 작동 여부 확인
현재 시간 확인
# date
2번 노드 ens224 down 처리
# ifdown ens224
Connection 'ens224' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/3)
2번 노드 뿐만아니라 양쪽노드 모두 ifdown ens224 했을때도 동일했음
결과 동일
|
조금 더 확인해본 결과 11.2.0.2 이상의 버전부터 무조건 reboot되지는 않는다고함
1. 11.2.0.2 RAC(또는 Exadata를 사용하는 경우)부터 노드를 제거해도 시스템이 실제로 재부팅되지 않을 수 있음
이를 재부팅 없는 재시작이라고함(Reboot-less Restart). 이 경우, 클러스터웨어 스택 대부분을 재시작하여 비정상 노드가 수정되는지 확인함
참고 : https://forums.oracle.com/ords/apexds/post/oracle-rac-11g-fencing-4045
2. 11.2.0.2 이전에는 Oracle RAC 필수 구성요소(예: private interconnect, voting disk 등)의 장애 발생 시 split-brain을 방지하기 위해 서버를 빠르게 리부팅하였지만
이로 인해 이때 파일 시스템 동기화나 I/O 작업 완료를 기다리지 않게되고, 클러스터를 인식하지 못하는(non-cluster-aware) 애플리케이션들도 강제로 종료되게 되었음
그리고 또한 리부팅 과정에서 리소스가 살아있는 노드들로 리소스를 리마스터링 해야하고 재구성 되어야 함
노드 수가 많은 대규모 클러스터에서는 이 작업이 매우 비용이 클 수 있음
이 메커니즘은 11g Release 2의 첫 번째 패치셋인 11.2.0.2 버전에서 변경되었다고함
노드를 퇴출시킬지 결정한 이후,
– 클러스터웨어는 해당 노드 내에서만 장애를 정리하기 위해 문제를 일으킨 프로세스만 종료하려 시도함
– 특히 I/O를 생성하는 프로세스를 종료함
– 모든 oracle 리소스 및 프로세스를 중지시킬 수 있고 모든 I/O 생성 프로세스를 종료할 수 있다면, 클러스터웨어 리소스는 해당 노드에서 중지됨
Oracle High Availability Services Daemon(ohasd)은 CRS 스택을 재시작하기 위해 계속 시도함
CRS 스택을 시작할 수 있는 조건이 충족되면 해당 노드의 관련 클러스터 리소스는 자동으로 시작됨
– 일부 리소스를 중지할 수 없거나, 커널 모드나 I/O 관련 경로에서 멈춰 있는 프로세스가 완전히 종료되지 않을 경우,
Oracle Clusterware는 여전히 노드를 리부팅하거나 IPMI를 사용하여 해당 노드를 클러스터에서 강제로 퇴출시킴
이러한 동작 방식의 변경은 non-cluster-aware 애플리케이션(클러스터 환경을 인식하지 못하는 애플리케이션)에 특히 유용함
데이터는 해당 노드에서만 클러스터가 종료되도록 하여 보호되고, 노드 자체는 리부팅되지 않음
참고 : http://oracleinaction.com/11g-r2-rac-reboot-less-node-fencing/
결론 :
오라클 19c rac에서 priv ip가 끊기면 misscount 초과 시 fencing 및 evict가 동작함
priv ip를 ifdown 하고 그로부터 15초정도 뒤에 "node ora19rac2 (2) at 50% heartbeat fatal, removal in 14.370 seconds" 메세지가 발생함
그리고 heartbeat fatal 퍼센트가 차고 시간이 점차 줄어듦, 이후 ora19rac2가 removal로 mark됨, 이후 grid cssd, crsd등의 일부 리소스들이 내려감
서버 reboot까지 되는줄 알았지만 reboot은 되지 않았음(watchdog이나 fencing agent 미구성일 경우 reboot되지 않을수 있다고 했지만 11.2.0.2부터 동작변경에 의한것이었음)
정확히 말하면 11.2.0.2 RAC(또는 Exadata를 사용하는 경우)부터 노드를 제거해도 시스템이 실제로 재부팅되지 않을 수 있음, 이를 재부팅 없는 재시작이라고함(Reboot-less Restart)
evict 대상은 고정되지 않고 상황에 따라 반대 노드가 evict될 수도 있다고함(https://docs.oracle.com/en/database/oracle/oracle-database/19/cwadd/clusterware-administration-and-deployment-guide.pdf)
하지만 1481481.1, 1546004.1에 의하면 노드번호가 높은쪽이 evict된다고함
추가로 11.2.0.1 rac에서도 테스트 해보면 좋을듯함
테스트 후 알게된 사실인데 Public Network 장애를 유발하기 위해 본문에서 사용한 방식인 ifconfig 명령어를 사용하여 인터페이스를 down시키는 것은 권장되지 않는다고함
이 방법은 인터페이스에 여전히 주소가 설정(plumbed)되어 있어 예기치 않은 결과를 초래할 수 있다고함
실제 가용성 테스트시에는 물리적으로 케이블를 뽑거나, 스위치를 끄거나(Port Shutdown 또는 Power Off) 하는 방식을 사용해야함
참조 :
11gR2 CSS Terminates/Node Eviction After Unplugging one Network Cable in Redundant Interconnect Environment (Doc ID 1481481.1)
RAC and Oracle Clusterware Best Practices and Starter Kit (Platform Independent) (Doc ID 810394.1)
Oracle Grid Infrastructure: How to Troubleshoot cssagent/cssmonitor Evictions (Doc ID 1549496.1)
Cssd May Evict Node Even Reconnecting Interconnect Cable Within Misscount After Disconnecting (Doc ID 2066998.1)
11gR2 GI Node May not Join the Cluster After Private Network is Functional After Eviction due to Private Network Problem (Doc ID 1479380.1)
Oracle Grid Infrastructure: Understanding Split-Brain Node Eviction (Doc ID 1546004.1)
https://forums.oracle.com/ords/apexds/post/rac-12c-behavior-when-private-interconnect-goes-down-7285
https://docs.oracle.com/en/database/oracle/oracle-database/19/racad/introduction-to-oracle-rac.html#GUID-F859B8B3-16B2-49CC-B41E-39A328F8027B
https://cording-cossk3.tistory.com/244#google_vignette
https://database-heartbeat.com/wp-content/uploads/2021/10/troubleshooting-rac-node-eviction.pdf
https://honglab.tistory.com/136
https://access.redhat.com/solutions/3892631
https://docs.oracle.com/en/database/oracle/oracle-database/21/cwadd/clusterware-administration-and-deployment-guide.pdf#:~:text=partitioning%20the%20cluster%20into%20disjoint,that%20all%20critical%20components%20remain
https://db.geeksinsight.com/2012/12/27/oracle-rac-node-evictions-11gr2-node-eviction-means-restart-of-cluster-stack-not-reboot-of-node/
https://forums.oracle.com/ords/apexds/post/oracle-rac-11g-fencing-4045
http://oracleinaction.com/11g-r2-rac-reboot-less-node-fencing/
https://web.archive.org/web/20240527094421/http://oracleinaction.com/11g-r2-rac-reboot-less-node-fencing/
https://www.oracle.com/docs/tech/database/headache-free-sb-resolution.pdf
http://oracle-help.com/oracle-rac/rebootless-node-fencing-oracle-rac/
https://www.oracle.com/technical-resources/articles/enterprise-manager/haskins-rac-project-guide.html
'ORACLE > Rac' 카테고리의 다른 글
오라클 19c RAC db 수동 삭제 가이드 (0) | 2025.06.22 |
---|---|
오라클 11gR2 RAC 노드 장애 유형별 VIP Failover 동작 분석 (0) | 2025.06.20 |
오라클 12c R2 RAC to RAC ADG redo log 삭제 및 추가 및 리사이즈 하기 (0) | 2022.06.27 |
오라클 11g R2 RAC OS 커널 패치시 작업 가이드 (0) | 2022.01.15 |
오라클 19c RAC Grid 문제 시 재설치 가이드 (0) | 2022.01.05 |