Instance Hang with Events 'log file parallel write' 'LGWR any worker group'

刘泽宇 · 发表于 2025-5-25 10:54:11

现象：
One of Real Application Cluster instances was hanging with 'log file parallel write', 'LGWR worker group ordering', etc.

2025-02-06T10:26:33.245053+08:00
LG00 (ospid: 32135) waits for event 'log file parallel write' for 79 secs.
2025-02-06T10:26:35.686384+08:00
LG01 (ospid: 32150) waits for event 'LGWR worker group ordering' for 81 secs.
2025-02-06T10:26:40.966712+08:00
LGWR (ospid: 32131) waits for event 'LGWR any worker group' for 79 secs.
2025-02-06T10:26:50.044534+08:00
LGWR (ospid: 32131) waits for event 'LGWR any worker group' for 88 secs.

...

2025-02-06T10:31:22.155761+08:00
LG00 (ospid: 32135) waits for event 'log file parallel write' for 176 secs.
2025-02-06T10:31:24.796786+08:00
LGWR (ospid: 32131) waits for event 'LGWR all worker groups' for 179 secs.
2025-02-06T10:31:33.260291+08:00
LGWR (ospid: 32131) waits for event 'LGWR all worker groups' for 187 secs.
2025-02-06T10:31:42.212506+08:00
LGWR (ospid: 32131) waits for event 'LGWR all worker groups' for 196 secs.
2025-02-06T10:31:47.138127+08:00
LG00 (ospid: 32135) waits for event 'log file parallel write' for 201 secs.
2025-02-06T10:31:51.260376+08:00
LGWR (ospid: 32131) waits for event 'LGWR all worker groups' for 205 secs.

...

2025-02-06T10:55:44.022923+08:00
LGWR (ospid: 32131) waits for event 'log file parallel write' for 198 secs.
2025-02-06T10:55:54.062963+08:00
LGWR (ospid: 32131) waits for event 'log file parallel write' for 208 secs.
2025-02-06T10:56:03.942882+08:00
LGWR (ospid: 32131) waits for event 'log file parallel write' for 218 secs.
2025-02-06T10:56:13.974936+08:00
LGWR (ospid: 32131) waits for event 'log file parallel write' for 228 secs.

Following were reported in OS log:

Feb 6 10:24:57 <Host Name> kernel: qla2xxx [0000:3b:00.1]-801c:16: Abort command issued nexus=16:0:3 -- 1 2002.
Feb 6 10:24:57 <Host Name> kernel: qla2xxx [0000:3b:00.1]-8009:16: DEVICE RESET ISSUED nexus=16:0:0 cmd=ffff8df3a124e548.
...
Feb 6 10:25:44 <Host Name> kernel: qla2xxx [0000:3b:00.1]-8009:16: DEVICE RESET ISSUED nexus=16:0:0 cmd=ffff8df3aad27548.
Feb 6 10:25:44 <Host Name> kernel: qla2xxx [0000:3b:00.1]-800e:16: DEVICE RESET SUCCEEDED nexus:16:0:0 cmd=ffff8df3aad27548.
Feb 6 10:25:44 <Host Name> kernel: qla2xxx [0000:3b:00.1]-8009:16: DEVICE RESET ISSUED nexus=16:0:1 cmd=ffff8ddf73eacd48.
Feb 6 10:25:44 <Host Name> kernel: qla2xxx [0000:3b:00.1]-800e:16: DEVICE RESET SUCCEEDED nexus:16:0:1 cmd=ffff8ddf73eacd48.
Feb 6 10:25:44 <Host Name> kernel: sd 16:0:0:1: [sdc] tag#1 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Feb 6 10:25:44 <Host Name> kernel: sd 16:0:0:1: [sdc] tag#1 CDB: Write(16) 8a 00 00 00 00 00 00 0b a3 90 00 00 00 80 00 00
Feb 6 10:25:44 <Host Name> kernel: print_req_error: I/O error, dev sdc, sector 762768
Feb 6 10:25:44 <Host Name> kernel: device-mapper: multipath: Failing path 8:32.
Feb 6 10:25:44 <Host Name> kernel: sd 16:0:0:0: alua: port group 3e8 state A non-preferred supports TolUsNA
Feb 6 10:25:44 <Host Name> multipathd: sdc: mark as failed
Feb 6 10:25:44 <Host Name> multipathd: 3600a09803831465875245171634e6143: remaining active paths: 3
Feb 6 10:25:44 <Host Name> kernel: sd 16:0:0:1: alua: port group 3e8 state A non-preferred supports TolUsNA
Feb 6 10:25:44 <Host Name> kernel: sd 16:0:1:1: alua: port group 3e9 state N non-preferred supports TolUsNA
Feb 6 10:25:44 <Host Name> kernel: sd 16:0:1:0: alua: port group 3e9 state N non-preferred supports TolUsNA
Feb 6 10:25:44 <Host Name> kernel: sd 15:0:0:0: alua: port group 3e8 state A non-preferred supports TolUsNA
Feb 6 10:25:44 <Host Name> multipathd: 3600a09803831465875245171634e6143: sdc - tur checker reports path is up
Feb 6 10:25:44 <Host Name> kernel: device-mapper: multipath: Reinstating path 8:32.
Feb 6 10:25:44 <Host Name> multipathd: 8:32: reinstated
Feb 6 10:25:44 <Host Name> multipathd: 3600a09803831465875245171634e6143: remaining active paths: 4

原因：
'log file parallel write' is the last event of database end after which I/O subsystem is the next one to respond for I/O requests.

In this case the cause is a fault on the module of hardware HBA card on the machine.
And it affected multipath shared disks used by the RAC cluster.

As a related result, trace file of background process LG00 reported:

*** 2025-02-06T10:31:22.723141+08:00 (CDB$ROOT(1))
Warning: log write elapsed time 27423ms, size 1KB

*** 2025-02-06T10:32:25.697906+08:00 (CDB$ROOT(1))
Warning: log write elapsed time 30628ms, size 6KB

*** 2025-02-06T10:35:33.090846+08:00 (CDB$ROOT(1))
Warning: log write elapsed time 31122ms, size 1KB

...

处理方法：
1. Stop database instance/GI on the node.

2. Replace the hardware HBA module.

3. Start GI / database instance.

帐号		自动登录	找回密码
密码			注册

[Oracle] Instance Hang with Events 'log file parallel write' 'LGWR any worker group'