OS환경 : Oracle Linux 8.4 (64bit)
DB 환경 : Oracle Database 19.12.0.0 RAC ADG
에러 : Error: ORA-16737: the redo transport service for member "ORAADG" has an error
19c RAC Active Data Guard 환경 구성 완료 후 정상적이었다가 네트워크 끊김 이후 발생한 에러
(저녁 9시에 구성 완료 후 모든 동작 확인 후 새벽 3시에 공유기 자동 재기동이후에 아래 에러가 발생하기 시작한듯함(alert log에 새벽 3시 6분부터 에러가 발생하여 원인을 네트워크 문제로 보고있음)
장애를 오전 7시쯤 발견함
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
$ dgmgrl sys/oracle
DGMGRL> show configuration
Configuration - DR_ORADB
Protection Mode: MaxPerformance
Members:
ORADB - Primary database
Error: ORA-16810: multiple errors or warnings detected for the member
ORAADG - Physical standby database
Fast-Start Failover: Disabled
Configuration Status:
ERROR (status updated 25 seconds ago)
DGMGRL> show database oradb
Database - ORADB
Role: PRIMARY
Intended State: TRANSPORT-ON
Instance(s):
ORADB1
Error: ORA-16737: the redo transport service for member "ORAADG" has an error
ORADB2
Error: ORA-16737: the redo transport service for member "ORAADG" has an error
Database Status:
ERROR
|
alert log 확인(기존 RAC 1번, 2번 노드)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
|
기존 RAC 1번 노드 alert log
2022-04-12T03:00:30.793697+09:00
Thread 1 advanced to log sequence 26 (LGWR switch), current SCN: 1497095
Current log# 2 seq# 26 mem# 0: +DATA/ORADB/ONLINELOG/group_2.328.1101466081
2022-04-12T03:00:33.027449+09:00
ARC0 (PID:274245): Archived Log entry 41 added for T-1.S-25 ID 0xaa2c70a2 LAD:1
2022-04-12T03:01:43.308530+09:00
TT02 (PID:283027): Attempting LAD:2 network reconnect (3135)
2022-04-12T03:01:54.198557+09:00
ALTER SYSTEM SET remote_listener=' oel19db-scan:1521' SCOPE=MEMORY SID='ORADB1';
2022-04-12T03:02:22.893398+09:00
TT02 (PID:283027): Error 12154 received logging on to the standby
TT02 (PID:283027): Error 12154 attaching to RFS for reconnect
2022-04-12T03:02:22.895825+09:00
Errors in file /oracle/app/oracle/diag/rdbms/oradb/ORADB1/trace/ORADB1_tt02_283027.trc:
ORA-03135: connection lost contact
TT02 (PID:283027): Error 3135 for LNO:2 to 'ORAADG'
2022-04-12T03:02:22.907674+09:00
Errors in file /oracle/app/oracle/diag/rdbms/oradb/ORADB1/trace/ORADB1_tt02_283027.trc:
ORA-03135: connection lost contact
2022-04-12T03:02:22.921666+09:00
Errors in file /oracle/app/oracle/diag/rdbms/oradb/ORADB1/trace/ORADB1_tt02_283027.trc:
ORA-03135: connection lost contact
2022-04-12T03:02:47.310120+09:00
TT00 (PID:274247): Attempting LAD:2 network reconnect (3135)
TT00 (PID:274247): Error 12154 received logging on to the standby
TT00 (PID:274247): Error 12154 attaching to RFS for reconnect
TT00 (PID:274247): krsg_gap_ping: Error 3135 when pinging ORAADG
2022-04-12T03:02:50.932149+09:00
ALTER SYSTEM SET remote_listener=' oel19db-scan:1521' SCOPE=MEMORY SID='ORADB1';
2022-04-12T03:05:24.605604+09:00
Heavy swapping observed on system
2022-04-12T03:05:24.606812+09:00
WARNING: Heavy swapping observed on system in last 5 mins.
Heavy swapping can lead to timeouts, poor performance, and instance eviction.
2022-04-12T03:07:50.335525+09:00
TT00 (PID:274247): Error 12154 received logging on to the standby
TT00 (PID:274247): Attempting LAD:2 network reconnect (12154)
TT00 (PID:274247): LAD:2 network reconnect abandoned
2022-04-12T03:12:52.833120+09:00
TT00 (PID:274247): Error 12154 received logging on to the standby
TT00 (PID:274247): Attempting LAD:2 network reconnect (12154)
TT00 (PID:274247): LAD:2 network reconnect abandoned
2022-04-12T03:17:55.378484+09:00
TT00 (PID:274247): Error 12154 received logging on to the standby
TT00 (PID:274247): Attempting LAD:2 network reconnect (12154)
TT00 (PID:274247): LAD:2 network reconnect abandoned
2022-04-12T03:22:58.065059+09:00
TT00 (PID:274247): Error 12154 received logging on to the standby
TT00 (PID:274247): Attempting LAD:2 network reconnect (12154)
TT00 (PID:274247): LAD:2 network reconnect abandoned
2022-04-12T03:28:00.544107+09:00
TT00 (PID:274247): Error 12154 received logging on to the standby
TT00 (PID:274247): Suppressing further error logging of LAD:2
TT00 (PID:274247): Suppressing further error logging of LAD:2
2022-04-12T03:33:02.628384+09:00
TT00 (PID:274247): Error 12154 received logging on to the standby
2022-04-12T03:38:04.685578+09:00
TT00 (PID:274247): Error 12154 received logging on to the standby
2022-04-12T03:43:06.843737+09:00
TT00 (PID:274247): Error 12154 received logging on to the standby
.
.
기존 RAC 2번 노드 alert log
2022-04-12T03:01:42.403168+09:00
TT03 (PID:251051): Attempting LAD:2 network reconnect (3135)
2022-04-12T03:01:54.711323+09:00
ALTER SYSTEM SET remote_listener=' oel19db-scan:1521' SCOPE=MEMORY SID='ORADB2';
2022-04-12T03:02:33.139854+09:00
TT03 (PID:251051): Error 12154 received logging on to the standby
TT03 (PID:251051): Error 12154 attaching to RFS for reconnect
2022-04-12T03:02:33.142831+09:00
Errors in file /oracle/app/oracle/diag/rdbms/oradb/ORADB2/trace/ORADB2_tt03_251051.trc:
ORA-03135: connection lost contact
TT03 (PID:251051): Error 3135 for LNO:3 to 'ORAADG'
2022-04-12T03:02:33.160873+09:00
Errors in file /oracle/app/oracle/diag/rdbms/oradb/ORADB2/trace/ORADB2_tt03_251051.trc:
ORA-03135: connection lost contact
2022-04-12T03:02:33.162845+09:00
Errors in file /oracle/app/oracle/diag/rdbms/oradb/ORADB2/trace/ORADB2_tt03_251051.trc:
ORA-03135: connection lost contact
2022-04-12T03:02:47.249443+09:00
TT00 (PID:242639): Attempting LAD:2 network reconnect (3135)
TT00 (PID:242639): Error 12154 received logging on to the standby
TT00 (PID:242639): Error 12154 attaching to RFS for reconnect
TT00 (PID:242639): krsg_gap_ping: Error 3135 when pinging ORAADG
2022-04-12T03:02:50.665238+09:00
ALTER SYSTEM SET remote_listener=' oel19db-scan:1521' SCOPE=MEMORY SID='ORADB2';
2022-04-12T03:04:56.867717+09:00
Heavy swapping observed on system
2022-04-12T03:04:56.868257+09:00
WARNING: Heavy swapping observed on system in last 5 mins.
Heavy swapping can lead to timeouts, poor performance, and instance eviction.
2022-04-12T03:07:49.774685+09:00
TT00 (PID:242639): Error 12154 received logging on to the standby
TT00 (PID:242639): Attempting LAD:2 network reconnect (12154)
TT00 (PID:242639): LAD:2 network reconnect abandoned
2022-04-12T03:12:52.394424+09:00
TT00 (PID:242639): Error 12154 received logging on to the standby
TT00 (PID:242639): Attempting LAD:2 network reconnect (12154)
TT00 (PID:242639): LAD:2 network reconnect abandoned
2022-04-12T03:17:54.982946+09:00
TT00 (PID:242639): Error 12154 received logging on to the standby
TT00 (PID:242639): Attempting LAD:2 network reconnect (12154)
TT00 (PID:242639): LAD:2 network reconnect abandoned
2022-04-12T03:22:57.563342+09:00
TT00 (PID:242639): Error 12154 received logging on to the standby
TT00 (PID:242639): Attempting LAD:2 network reconnect (12154)
TT00 (PID:242639): LAD:2 network reconnect abandoned
2022-04-12T03:28:00.167125+09:00
TT00 (PID:242639): Error 12154 received logging on to the standby
TT00 (PID:242639): Suppressing further error logging of LAD:2
TT00 (PID:242639): Suppressing further error logging of LAD:2
2022-04-12T03:33:02.753909+09:00
TT00 (PID:242639): Error 12154 received logging on to the standby
2022-04-12T03:38:05.346744+09:00
TT00 (PID:242639): Error 12154 received logging on to the standby
2022-04-12T03:43:07.965143+09:00
TT00 (PID:242639): Error 12154 received logging on to the standby
.
.
|
새벽 3시 이전까지는 별다른 특이사항이 없다가 3시 이후에 에러가 발생함
이후 3시33분부터 5분간격으로(300초) TT00 (PID:242639): Error 12154 received logging on to the standby 메세지가 발생함
alert log 확인(추가 RAC 1번, 2번 노드)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
|
추가 RAC 1번 노드 alert log
2022-04-12T03:00:10.855316+09:00
rfs (PID:272979): Selected LNO:6 for T-1.S-26 dbid 2854730844 branch 1101466076
2022-04-12T03:00:11.849928+09:00
Recovery of Online Redo Log: Thread 1 Group 6 Seq 26 Reading mem 0
Mem# 0: +DATA/ORAADG/ONLINELOG/group_6.329.1101748153
Mem# 1: +DATA/ORAADG/ONLINELOG/group_6.327.1101748155
2022-04-12T03:00:12.529152+09:00
ARC0 (PID:243499): Archived Log entry 24 added for T-1.S-25 ID 0xaa2c70a2 LAD:1
2022-04-12T03:06:50.766512+09:00
ALTER SYSTEM SET remote_listener=' oel19adg-scan:1521' SCOPE=MEMORY SID='ORAADG1';
2022-04-12T03:07:27.003269+09:00
Process termination requested for pid 243364 [source = rdbms], [info = 2] [request issued by pid: 243292, uid: 54321]
2022-04-12T05:03:08.804733+09:00
rfs (PID:253922): Possible network disconnect with primary database
2022-04-12T05:03:08.809996+09:00
rfs (PID:253956): Possible network disconnect with primary database
2022-04-12T05:03:08.832144+09:00
rfs (PID:272979): Possible network disconnect with primary database
2022-04-12T05:03:08.833267+09:00
rfs (PID:273115): Possible network disconnect with primary database
2022-04-12T06:56:19.979120+09:00
rfs (PID:583434): krsr_rfs_atc: Identified database type as 'PHYSICAL STANDBY': Client is ASYNC (PID:462286)
2022-04-12T06:56:19.980618+09:00
rfs (PID:583436): krsr_rfs_atc: Identified database type as 'PHYSICAL STANDBY': Client is ASYNC (PID:500890)
rfs (PID:583436): Primary database is in MAXIMUM PERFORMANCE mode
2022-04-12T06:56:19.994610+09:00
rfs (PID:583434): Primary database is in MAXIMUM PERFORMANCE mode
추가 RAC 2번 노드 alert log
2022-04-12T03:04:08.745065+09:00
Heavy swapping observed on system
2022-04-12T03:04:08.748412+09:00
WARNING: Heavy swapping observed on system in last 5 mins.
Heavy swapping can lead to timeouts, poor performance, and instance eviction.
2022-04-12T03:07:14.768122+09:00
ALTER SYSTEM SET remote_listener=' oel19adg-scan:1521' SCOPE=MEMORY SID='ORAADG2';
2022-04-12T03:07:27.935486+09:00
Process termination requested for pid 223594 [source = rdbms], [info = 2] [request issued by pid: 223531, uid: 54321]
|
새벽 3시 이전까지는 별다른 특이사항이 없다가 3시 이후에 에러가 발생함
redo가 standby로 제대로 보내지지 않는 상태로 계속 놔뒀더니 아래 에러가 발생함
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
DGMGRL> show configuration
Configuration - DR_ORADB
Protection Mode: MaxPerformance
Members:
ORADB - Primary database
Error: ORA-16810: multiple errors or warnings detected for the member
ORAADG - Physical standby database
Warning: ORA-16809: multiple warnings detected for the member
Fast-Start Failover: Disabled
Configuration Status:
ERROR (status updated 27 seconds ago)
DGMGRL> show database oraadg
Database - ORAADG
Role: PHYSICAL STANDBY
Intended State: APPLY-ON
Transport Lag: 1 hour(s) 32 minutes 32 seconds (computed 1 second ago)
Apply Lag: 1 hour(s) 32 minutes 33 seconds (computed 1 second ago)
Average Apply Rate: 7.00 KByte/s
Real Time Query: ON
Instance(s):
ORAADG1 (apply instance)
ORAADG2
Database Warning(s):
ORA-16853: apply lag has exceeded specified threshold
ORA-16855: transport lag has exceeded specified threshold
Database Status:
WARNING
|
기존에는 ORADB에만 redo 전송이 되지 않는다고 ORA 메세지가 나왔는데
시간이 흐른 후에는 ORAADG에 전송 지연이 지정된 임계값을 초과했다고 하는 ORA 메세지가 나옴
v$archive_dest error 로그 확인
1
2
3
4
5
6
7
8
9
10
11
|
SQL>
set lines 200 pages 1000
select dest_id,status,error from v$archive_dest;
DEST_ID STATUS ERROR
---------- --------- -----------------------------------------------------------------
1 VALID
2 ERROR ORA-12154: TNS:could not resolve the connect identifier specified
3 INACTIVE
.
.
|
ORA-12154: TNS:could not resolve the connect identifier specified가 발생하였음
tnsnames.ora 파일에 문제가 있다 싶어서 sqlplus 및 tnsping 시도해보았지만 모두 정상
1
2
3
4
5
6
7
8
9
10
11
12
|
기존 RAC 1번 노드, 2
$ sqlplus sys/oracle@oraadg as sysdba
SQL>
정상 접속됨
$ sqlplus sys/oracle@oraadg1 as sysdba
SQL>
정상 접속됨
$ sqlplus sys/oracle@oraadg2 as sysdba
SQL>
정상 접속됨
|
해결 방법 : database 재기동
primary, standby db 재기동
1
2
3
4
5
6
7
8
9
10
11
|
기존 RAC 1번 노드
$ srvctl stop database -d ORADB
추가 RAC 1번 노드
$ srvctl stop database -d ORAADG
기존 RAC 1번 노드
$ srvctl start database -d ORADB
추가 RAC 1번 노드
$ srvctl start database -d ORAADG
|
재기동 후 상태 확인
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
|
DGMGRL> show configuration
Configuration - DR_ORADB
Protection Mode: MaxPerformance
Members:
ORADB - Primary database
ORAADG - Physical standby database
Fast-Start Failover: Disabled
Configuration Status:
SUCCESS (status updated 9 seconds ago)
DGMGRL> VALIDATE NETWORK CONFIGURATION FOR ALL ;
Connecting to instance "ORADB1" on database "ORADB" ...
Connected to "ORADB"
Checking connectivity from instance "ORADB1" on database "ORADB to instance "ORAADG1" on database "ORAADG"...
Succeeded.
Checking connectivity from instance "ORADB1" on database "ORADB to instance "ORAADG2" on database "ORAADG"...
Succeeded.
Connecting to instance "ORADB2" on database "ORADB" ...
Connected to "ORADB"
Checking connectivity from instance "ORADB2" on database "ORADB to instance "ORAADG1" on database "ORAADG"...
Succeeded.
Checking connectivity from instance "ORADB2" on database "ORADB to instance "ORAADG2" on database "ORAADG"...
Succeeded.
Connecting to instance "ORAADG1" on database "ORAADG" ...
Connected to "ORAADG"
Checking connectivity from instance "ORAADG1" on database "ORAADG to instance "ORADB1" on database "ORADB"...
Succeeded.
Checking connectivity from instance "ORAADG1" on database "ORAADG to instance "ORADB2" on database "ORADB"...
Succeeded.
Connecting to instance "ORAADG2" on database "ORAADG" ...
Connected to "ORAADG"
Checking connectivity from instance "ORAADG2" on database "ORAADG to instance "ORADB1" on database "ORADB"...
Succeeded.
Checking connectivity from instance "ORAADG2" on database "ORAADG to instance "ORADB2" on database "ORADB"...
Succeeded.
Oracle Clusterware on database "ORADB" is available for database restart.
Oracle Clusterware on database "ORAADG" is available for database restart.
|
정상화됨
db 재기동 이전에 했던 작업들
브로커 configuration disable 후 dg_broker_start 파라미터 false 변경 후 true(기존 RAC 1번 노드. 추가 RAC 1번 노드)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
기존 RAC 1번 노드
$ dgmgrl sys/oracle
DGMGRL> disable configuration
Disabled.
$ sqlplus / as sysdba
SQL> alter system set dg_broker_start=false sid='*';
System altered.
추가 RAC 1번 노드
SQL> alter system set dg_broker_start=false sid='*';
System altered.
기존 RAC 1번 노드, 추가 RAC 1번 노드
SQL> alter system set dg_broker_start=true sid='*';
System altered.
기존 RAC 1번 노드
$ dgmgrl sys/oracle
DGMGRL> enable configuration
Enabled.
|
상태 확인(기존 RAC 1번 노드)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
기존 RAC 1번 노드
DGMGRL> show configuration
Configuration - DR_ORADB
Protection Mode: MaxPerformance
Members:
ORADB - Primary database
Error: ORA-16810: multiple errors or warnings detected for the member
ORAADG - Physical standby database
Fast-Start Failover: Disabled
Configuration Status:
ERROR (status updated 25 seconds ago)
DGMGRL> show database oradb
Database - ORADB
Role: PRIMARY
Intended State: TRANSPORT-ON
Instance(s):
ORADB1
Error: ORA-16737: the redo transport service for member "ORAADG" has an error
ORADB2
Error: ORA-16737: the redo transport service for member "ORAADG" has an error
Database Status:
ERROR
|
에러 계속 발생함
리스너 재기동(기존 RAC 1번, 2번 노드, 추가 RAC 1번, 2번 노드)
1
2
3
4
5
6
7
8
9
10
11
|
기존 RAC 1번 노드
$ srvctl stop listener -node oel19db1 -l listener
srvctl stop listener -node oel19db2 -l listener
srvctl start listener -node oel19db1 -l listener
srvctl start listener -node oel19db2 -l listener
추가 RAC 1번 노드
$ srvctl stop listener -node oel19adg1 -l listener
srvctl stop listener -node oel19adg2 -l listener
srvctl start listener -node oel19adg1 -l listener
srvctl start listener -node oel19adg2 -l listener
|
상태 확인(기존 RAC 1번 노드)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
기존 RAC 1번 노드
DGMGRL> show configuration
Configuration - DR_ORADB
Protection Mode: MaxPerformance
Members:
ORADB - Primary database
Error: ORA-16810: multiple errors or warnings detected for the member
ORAADG - Physical standby database
Fast-Start Failover: Disabled
Configuration Status:
ERROR (status updated 13 seconds ago)
DGMGRL> show database oradb
Database - ORADB
Role: PRIMARY
Intended State: TRANSPORT-ON
Instance(s):
ORADB1
Error: ORA-16737: the redo transport service for member "ORAADG" has an error
ORADB2
Error: ORA-16737: the redo transport service for member "ORAADG" has an error
Database Status:
ERROR
|
에러 계속 발생함
이후 srvctl 로 db 전체 재기동후 정상화됨
참조 : 808783.1, 2329386.1, 1432367.1, 1130523.1, 1367311.1, 362656.1, 2804791.1, 1631552.1