OS환경 : Oracle Linux 6.8 (64bit)
DB 환경 : Oracle Database 11.2.0.4 RAC
방법 : 오라클 11g R2 RAC cssd 프로세스 중지 시 노드 재기동
Oracle 11g R2에서 cssd 관련 프로세스는 cssdmonitor 및 cssdagent를 포함하여 많은 백그라운드 프로세스가 존재함
cssdagent 프로세스는 ocssd 데몬의 시작 중지, 상태를 모니터링함
cssdagent가 ocssd 데몬이 중지된 것을 발견하면 데이터무결성을 보장하기위해 노드를 셧다운함
cssdmonitor 프로세스는 노드에서 CPU의 스케줄링을 모니터링함
cssdmonitor 프로세스가 노드 hang을 발견하면 데이터 무결성을 보장하기 위해 노드를 셧다운함
이러한 동작은 I/O 펜싱(fencing)으로알려져있음
테스트
기존 grid 상태 확인
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | $ crsctl stat res -t -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.LISTENER.lsnr ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.OCR_VOTE.dg ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.ORADATA.dg ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.ORAFRA.dg ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.asm ONLINE ONLINE rac1 Started ONLINE ONLINE rac2 Started ora.gsd OFFLINE OFFLINE rac1 OFFLINE OFFLINE rac2 ora.net1.network ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.ons ONLINE ONLINE rac1 ONLINE ONLINE rac2 -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE rac1 ora.cvu 1 ONLINE ONLINE rac1 ora.rac1.vip 1 ONLINE ONLINE rac1 ora.rac2.vip 1 ONLINE ONLINE rac2 ora.racdb.db 1 ONLINE ONLINE rac1 Open 2 ONLINE ONLINE rac2 Open ora.scan1.vip 1 ONLINE ONLINE rac1 |
모두 정상
1번노드에서
ocssd.bin 프로세스 ID 확인
1 2 3 4 5 | $ ps -ef | grep cssd root 8805 1 0 20:38 ? 00:00:00 /app/grid/product/11.2.0/grid/bin/cssdmonitor root 8825 1 0 20:38 ? 00:00:00 /app/grid/product/11.2.0/grid/bin/cssdagent oracle 8837 1 0 20:38 ? 00:00:00 /app/grid/product/11.2.0/grid/bin/ocssd.bin root 10903 2266 0 20:42 pts/0 00:00:00 grep --color=auto cssd |
PID 가 8837임
ocssd.bin 프로세스로 STOP 신호 전달
1 | $ kill -SIGSTOP 8837 |
몇 초 후 자동으로 1번노드 재기동됨
1 2 3 4 5 6 | $ ping rac2 PING rac2 (192.168.137.51) 56(84) bytes of data. 64 bytes from rac2 (192.168.137.51): icmp_seq=1 ttl=64 time=0.243 ms 64 bytes from rac2 (192.168.137.51): icmp_seq=2 ttl=64 time=0.192 ms Connection closed by foreign host. |
1번노드 -> 2번노드로 ping 테스트 중 접속중인 쉘이 끊기는것을 확인함
/etc/oracle/lastgasp/cssmonit_rac1.lgl 파일 확인
1 2 3 4 | $ cd /etc/oracle/lastgasp $ more cssmonit_rac1.lgl OLG01,0303,0000,rac-cluster,6391cd3ebdbfef74ffcd2c0f9ec71892,rac1,cssmonit,L-2020-07-06-07:43:23.739,Rebooting after limit 28040 exceeded; disk timeout 27540 , network timeout 28040, last heartbeat from CSSD at epoch seconds 1594035775.605, 28130 milliseconds ago based on invariant clock value of 484284 |
Rebooting after limit 28040 exceeded 메세지를 확인 가능함
2020-07-06-07:43:23(실제론 오후인 19:43:23임)에 재기동됨
cssdagent 로그 파일 확인
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | $ cd $GRID_HOME/log/rac1/agent/ohasd/oracssdagent_root $ vi oracssdagent_root.log 장애 발생 전 2020-07-06 07:41:59.632: [CLSFRAME][734287680] TM [MultiThread] is changing desired thread # to 3. Current # is 2 2020-07-06 07:42:59.634: [ AGFW][734287680] Recvd request to shed the threads 2020-07-06 07:42:59.634: [CLSFRAME][734287680] TM [MultiThread] is changing desired thread # to 2. Current # is 3 2020-07-06 07:42:59.634: [CLSFRAME][615606016]{0:15:12} Worker thread is exiting in TM [MultiThread] to meet the desired count of 2. New count is 2 2020-07-06 07:42:59.634: [CLSFRAME][734287680] TM [MultiThread] is changing desired thread # to 3. Current # is 2 장애 발생 후 2020-07-06 07:43:10.610: [ USRTHRD][631908096] (:CLSN00111:)clsnproc_needreboot: Impending reboot at 50% of limit 28040; disk timeout 27540, network timeout 28040, last heartbeat from CSSD at epoch seconds 1594035775.605, 15000 milliseconds ago based on invariant clock 484284; now polling at 100 ms 2020-07-06 07:43:10.610: [ USRTHRD][631908096] clsnwork_queue: posting worker thread 2020-07-06 07:43:10.610: [ USRTHRD][631908096] clsnwork_queue: posting worker thread 2020-07-06 07:43:10.610: [ USRTHRD][619824896] clsnwork_process_work: calling sync 2020-07-06 07:43:10.617: [ USRTHRD][619824896] clsnwork_process_work: sync completed 2020-07-06 07:43:10.617: [ USRTHRD][619824896] clsnSyncComplete: posting omon 2020-07-06 07:43:10.617: [ USRTHRD][619824896] clsnwork_process_work: Unknown work type 2020-07-06 07:43:16.725: [ USRTHRD][631908096] (:CLSN00111:)clsnproc_needreboot: Impending reboot at 75% of limit 28040; disk timeout 27540, network timeout 28040, last heartbeat from CSSD at epoch seconds 1594035775.605, 21120 milliseconds ago based on invariant clock 484284; now polling at 100 ms 2020-07-06 07:43:16.725: [ USRTHRD][631908096] clsnwork_queue: posting worker thread 2020-07-06 07:43:16.725: [ USRTHRD][631908096] clsnwork_queue: posting worker thread 2020-07-06 07:43:16.725: [ USRTHRD][619824896] clsnwork_process_work: calling sync 2020-07-06 07:43:16.729: [ USRTHRD][619824896] clsnwork_process_work: sync completed 2020-07-06 07:43:16.729: [ USRTHRD][619824896] clsnSyncComplete: posting omon 2020-07-06 07:43:16.729: [ USRTHRD][619824896] clsnwork_process_work: Unknown work type 2020-07-06 07:43:20.936: [ USRTHRD][631908096] (:CLSN00111:)clsnproc_needreboot: Impending reboot at 90% of limit 28040; disk timeout 27540, network timeout 28040, last heartbeat from CSSD at epoch seconds 1594035775.605, 25330 milliseconds ago based on invariant clock 484284; now polling at 100 ms 2020-07-06 07:43:20.936: [ USRTHRD][631908096] clsnwork_queue: posting worker thread 2020-07-06 07:43:20.936: [ USRTHRD][631908096] clsnwork_queue: posting worker thread 2020-07-06 07:43:20.936: [ USRTHRD][619824896] clsnwork_process_work: calling sync |
cssdmonitor 로그 파일 확인
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | $ cd $GRID_HOME/log/rac1/agent/ohasd/oracssdmonitor_root $ vi oracssdmonitor_root.log 장애 발생 전 2020-07-06 07:41:27.468: [CLSFRAME][2084333376] TM [MultiThread] is changing desired thread # to 3. Current # is 2 2020-07-06 07:42:27.469: [ AGFW][2084333376] Recvd request to shed the threads 2020-07-06 07:42:27.469: [CLSFRAME][2084333376] TM [MultiThread] is changing desired thread # to 2. Current # is 3 2020-07-06 07:42:27.469: [CLSFRAME][1971447552]{0:13:12} Worker thread is exiting in TM [MultiThread] to meet the desired count of 2. New count is 2 2020-07-06 07:42:27.469: [CLSFRAME][2084333376] TM [MultiThread] is changing desired thread # to 3. Current # is 2 장애 발생 후 2020-07-06 07:43:10.610: [ USRTHRD][1967245056] (:CLSN00111:)clsnproc_needreboot: Impending reboot at 50% of limit 28040; disk timeout 27540, network timeout 28040, last heartbeat from CSSD at epoch seconds 1594035775.605, 15000 milliseconds ago based on invariant clock 484284; now polling at 100 ms 2020-07-06 07:43:10.610: [ USRTHRD][1967245056] clsnwork_queue: posting worker thread 2020-07-06 07:43:10.610: [ USRTHRD][1967245056] clsnwork_queue: posting worker thread 2020-07-06 07:43:10.610: [ USRTHRD][1964091136] clsnwork_process_work: calling sync 2020-07-06 07:43:10.618: [ USRTHRD][1964091136] clsnwork_process_work: sync completed 2020-07-06 07:43:10.618: [ USRTHRD][1964091136] clsnSyncComplete: posting omon 2020-07-06 07:43:10.618: [ USRTHRD][1964091136] clsnwork_process_work: Unknown work type 2020-07-06 07:43:16.724: [ USRTHRD][1967245056] (:CLSN00111:)clsnproc_needreboot: Impending reboot at 75% of limit 28040; disk timeout 27540, network timeout 28040, last heartbeat from CSSD at epoch seconds 1594035775.605, 21120 milliseconds ago based on invariant clock 484284; now polling at 100 ms 2020-07-06 07:43:16.724: [ USRTHRD][1967245056] clsnwork_queue: posting worker thread 2020-07-06 07:43:16.724: [ USRTHRD][1967245056] clsnwork_queue: posting worker thread 2020-07-06 07:43:16.724: [ USRTHRD][1964091136] clsnwork_process_work: calling sync 2020-07-06 07:43:16.729: [ USRTHRD][1964091136] clsnwork_process_work: sync completed 2020-07-06 07:43:16.729: [ USRTHRD][1964091136] clsnSyncComplete: posting omon 2020-07-06 07:43:16.729: [ USRTHRD][1964091136] clsnwork_process_work: Unknown work type 2020-07-06 07:43:20.932: [ USRTHRD][1967245056] (:CLSN00111:)clsnproc_needreboot: Impending reboot at 90% of limit 28040; disk timeout 27540, network timeout 28040, last heartbeat from CSSD at epoch seconds 1594035775.605, 25320 milliseconds ago based on invariant clock 484284; now polling at 100 ms 2020-07-06 07:43:20.932: [ USRTHRD][1967245056] clsnwork_queue: posting worker thread 2020-07-06 07:43:20.932: [ USRTHRD][1967245056] clsnwork_queue: posting worker thread 2020-07-06 07:43:20.932: [ USRTHRD][1964091136] clsnwork_process_work: calling sync |
grid alert 로그 파일 확인
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | $ cd $GRID_HOME/log/rac1 $ vi alertrac1.log 2020-07-06 07:39:39.928: [crsd(9215)]CRS-2772:Server 'rac2' has been assigned to pool 'ora.racdb'. <- 이 시점 이후 재기동됨 2020-07-06 07:43:45.971: [ohasd(1435)]CRS-2112:The OLR service started on node rac1. 2020-07-06 07:43:46.017: [ohasd(1435)]CRS-1301:Oracle High Availability Service started on node rac1. 2020-07-06 07:43:46.021: [ohasd(1435)]CRS-8011:reboot advisory message from host: rac1, component: cssagent, with time stamp: L-2020-07-06-02:15:19.697 [ohasd(1435)]CRS-8013:reboot advisory message text: Rebooting after limit 28330 exceeded; disk timeout 28330, network timeout 28160, last heartbeat from CSSD at epoch seconds 1594016091.350, 28351 milliseconds ago based on invariant clock value of 17689723 2020-07-06 07:43:46.022: [ohasd(1435)]CRS-8011:reboot advisory message from host: rac1, component: cssmonit, with time stamp: L-2020-07-06-07:43:23.739 [ohasd(1435)]CRS-8013:reboot advisory message text: Rebooting after limit 28040 exceeded; disk timeout 27540, network timeout 28040, last heartbeat from CSSD at epoch seconds 1594035775.605, 28130 milliseconds ago based on invariant clock value of 484284 2020-07-06 07:43:46.022: [ohasd(1435)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 2 were announced and 0 errors occurred 2020-07-06 07:43:46.667: [/app/grid/product/11.2.0/grid/bin/oraagent.bin(2447)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/app/grid/product/11.2.0/grid/log/rac1/agent/ohasd/oraagent_oracle/oraagent_oracle.log" 2020-07-06 07:43:49.735: [/app/grid/product/11.2.0/grid/bin/orarootagent.bin(2451)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running). 2020-07-06 07:43:51.850: [ohasd(1435)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running). 2020-07-06 07:43:51.869: [gpnpd(2569)]CRS-2328:GPNPD started on node rac1. 2020-07-06 07:43:54.161: [cssd(2673)]CRS-1713:CSSD daemon is started in clustered mode 2020-07-06 07:43:56.034: |
/var/log/messages 확인
1 2 3 4 5 6 7 8 9 10 11 12 | # vi /var/log/messages Jul 6 20:43:34 rac1 ohasd: Starting ohasd: Jul 6 20:43:34 rac1 logger: Oracle HA daemon is enabled for autostart. Jul 6 20:43:34 rac1 libvirtd: 2020-07-06 11:43:34.275+0000: 1303: info : libvirt version: 3.2.0, package: 14.el7 (Unknown, 2017-08-03-07:50:34, x86-ol7-builder-01.us.oracle.com) Jul 6 20:43:34 rac1 libvirtd: 2020-07-06 11:43:34.275+0000: 1303: info : hostname: rac1 Jul 6 20:43:34 rac1 libvirtd: 2020-07-06 11:43:34.275+0000: 1303: error : virDriverLoadModuleFile:59 : failed to load module /usr/lib64/libvirt/storage-backend/libvirt_storage_backend_rbd.so /usr/lib64/libvirt/storage-backend/libvirt_storage_backend_rbd.so: undefined symbol: rbd_diff_iterate2 Jul 6 20:43:34 rac1 systemd: Started Virtualization daemon. Jul 6 20:43:34 rac1 avahi-daemon[799]: Invalid legacy unicast query packet. Jul 6 20:43:34 rac1 ohasd: CRS-4123: Oracle High Availability Services has been started. Jul 6 20:43:34 rac1 systemd: Started (null). Jul 6 20:43:34 rac1 systemd: Starting LSB: Starts the Spacewalk Daemon... Jul 6 20:43:34 rac1 logger: exec /app/grid/product/11.2.0/grid/perl/bin/perl -I/app/grid/product/11.2.0/grid/perl/lib /app/grid/product/11.2.0/grid/bin/crswrapexece.pl /app/grid/product/11.2.0/grid/crs/install/s_crsconfig_rac1_env.txt /app/grid/product/11.2.0/grid/bin/ohasd.bin "reboot" |
참조 :
프로 오라클 데이터베이스 11g RAC 리눅스 서적'ORACLE > Rac' 카테고리의 다른 글
오라클 19c RAC에서 ASM 리소스 제거 방법 (0) | 2021.05.10 |
---|---|
Oracle 10g RAC 운영시 rcp, rsh, rlogin 이 필요한지 여부 (0) | 2020.07.16 |
오라클 11g R2 RAC 파라미터 변경 방법 (5) | 2020.06.22 |
오라클 11g R2 RAC 환경 운영중에 /etc/passwd 파일이 삭제된다면? (0) | 2020.06.19 |
오라클 11g R2 asm rac 환경에서 ocr 백업 확인 및 백업 하기 (0) | 2020.01.04 |