重庆思庄Oracle、KingBase、PostgreSQL、Redhat认证学习论坛

标题: Instance Hang with Events 'log file parallel write' 'LGWR any worker group' [打印本页]

作者: 刘泽宇 时间: 2025-5-25 10:54
标题: Instance Hang with Events 'log file parallel write' 'LGWR any worker group'
现象：
One of Real Application Cluster instances was hanging with 'log file parallel write', 'LGWR worker group ordering', etc.

2025-02-06T10:26:33.245053+08:00
LG00 (ospid: 32135) waits for event 'log file parallel write' for 79 secs.
2025-02-06T10:26:35.686384+08:00
LG01 (ospid: 32150) waits for event 'LGWR worker group ordering' for 81 secs.
2025-02-06T10:26:40.966712+08:00
LGWR (ospid: 32131) waits for event 'LGWR any worker group' for 79 secs.
2025-02-06T10:26:50.044534+08:00
LGWR (ospid: 32131) waits for event 'LGWR any worker group' for 88 secs.

...

2025-02-06T10:31:22.155761+08:00
LG00 (ospid: 32135) waits for event 'log file parallel write' for 176 secs.
2025-02-06T10:31:24.796786+08:00
LGWR (ospid: 32131) waits for event 'LGWR all worker groups' for 179 secs.
2025-02-06T10:31:33.260291+08:00
LGWR (ospid: 32131) waits for event 'LGWR all worker groups' for 187 secs.
2025-02-06T10:31:42.212506+08:00
LGWR (ospid: 32131) waits for event 'LGWR all worker groups' for 196 secs.
2025-02-06T10:31:47.138127+08:00
LG00 (ospid: 32135) waits for event 'log file parallel write' for 201 secs.
2025-02-06T10:31:51.260376+08:00
LGWR (ospid: 32131) waits for event 'LGWR all worker groups' for 205 secs.

...

2025-02-06T10:55:44.022923+08:00
LGWR (ospid: 32131) waits for event 'log file parallel write' for 198 secs.
2025-02-06T10:55:54.062963+08:00
LGWR (ospid: 32131) waits for event 'log file parallel write' for 208 secs.
2025-02-06T10:56:03.942882+08:00
LGWR (ospid: 32131) waits for event 'log file parallel write' for 218 secs.
2025-02-06T10:56:13.974936+08:00
LGWR (ospid: 32131) waits for event 'log file parallel write' for 228 secs.

Following were reported in OS log:

Feb 6 10:24:57 <Host Name> kernel: qla2xxx [0000:3b:00.1]-801c:16: Abort command issued nexus=16:0:3 -- 1 2002.
Feb 6 10:24:57 <Host Name> kernel: qla2xxx [0000:3b:00.1]-8009:16: DEVICE RESET ISSUED nexus=16:0:0 cmd=ffff8df3a124e548.
...
Feb 6 10:25:44 <Host Name> kernel: qla2xxx [0000:3b:00.1]-8009:16: DEVICE RESET ISSUED nexus=16:0:0 cmd=ffff8df3aad27548.
Feb 6 10:25:44 <Host Name> kernel: qla2xxx [0000:3b:00.1]-800e:16: DEVICE RESET SUCCEEDED nexus:16:0:0 cmd=ffff8df3aad27548.
Feb 6 10:25:44 <Host Name> kernel: qla2xxx [0000:3b:00.1]-8009:16: DEVICE RESET ISSUED nexus=16:0:1 cmd=ffff8ddf73eacd48.
Feb 6 10:25:44 <Host Name> kernel: qla2xxx [0000:3b:00.1]-800e:16: DEVICE RESET SUCCEEDED nexus:16:0:1 cmd=ffff8ddf73eacd48.
Feb 6 10:25:44 <Host Name> kernel: sd 16:0:0:1: [sdc] tag#1 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Feb 6 10:25:44 <Host Name> kernel: sd 16:0:0:1: [sdc] tag#1 CDB: Write(16) 8a 00 00 00 00 00 00 0b a3 90 00 00 00 80 00 00
Feb 6 10:25:44 <Host Name> kernel: print_req_error: I/O error, dev sdc, sector 762768
Feb 6 10:25:44 <Host Name> kernel: device-mapper: multipath: Failing path 8:32.
Feb 6 10:25:44 <Host Name> kernel: sd 16:0:0:0: alua: port group 3e8 state A non-preferred supports TolUsNA
Feb 6 10:25:44 <Host Name> multipathd: sdc: mark as failed
Feb 6 10:25:44 <Host Name> multipathd: 3600a09803831465875245171634e6143: remaining active paths: 3
Feb 6 10:25:44 <Host Name> kernel: sd 16:0:0:1: alua: port group 3e8 state A non-preferred supports TolUsNA
Feb 6 10:25:44 <Host Name> kernel: sd 16:0:1:1: alua: port group 3e9 state N non-preferred supports TolUsNA
Feb 6 10:25:44 <Host Name> kernel: sd 16:0:1:0: alua: port group 3e9 state N non-preferred supports TolUsNA
Feb 6 10:25:44 <Host Name> kernel: sd 15:0:0:0: alua: port group 3e8 state A non-preferred supports TolUsNA
Feb 6 10:25:44 <Host Name> multipathd: 3600a09803831465875245171634e6143: sdc - tur checker reports path is up
Feb 6 10:25:44 <Host Name> kernel: device-mapper: multipath: Reinstating path 8:32.
Feb 6 10:25:44 <Host Name> multipathd: 8:32: reinstated
Feb 6 10:25:44 <Host Name> multipathd: 3600a09803831465875245171634e6143: remaining active paths: 4

原因：
'log file parallel write' is the last event of database end after which I/O subsystem is the next one to respond for I/O requests.

In this case the cause is a fault on the module of hardware HBA card on the machine.
And it affected multipath shared disks used by the RAC cluster.

As a related result, trace file of background process LG00 reported:

*** 2025-02-06T10:31:22.723141+08:00 (CDB$ROOT(1))
Warning: log write elapsed time 27423ms, size 1KB

*** 2025-02-06T10:32:25.697906+08:00 (CDB$ROOT(1))
Warning: log write elapsed time 30628ms, size 6KB

*** 2025-02-06T10:35:33.090846+08:00 (CDB$ROOT(1))
Warning: log write elapsed time 31122ms, size 1KB

...

处理方法：
1. Stop database instance/GI on the node.

2. Replace the hardware HBA module.

3. Start GI / database instance.

欢迎光临重庆思庄Oracle、KingBase、PostgreSQL、Redhat认证学习论坛 (http://bbs.cqsztech.com/)