今天巡检,发现 一个机器gi 对应的目录/u01空间达到60G,其他机器正常,
通过du -sk 进一步发现,$GRID_HOME/cdata(该目录默认是存放olr和ocr的自动备份和手工备份)下的ocr自动备份有异常,很多number{n}.ocr文件,
-rw------- 1 grid:oinstall 6766592 Nov 10 22:17 week.ocr
-rw------- 1 grid:oinstall 6766592 Nov 10 22:17 day.ocr
-rw------- 1 grid:oinstall 6766592 Nov 11 02:17 day_.ocr
-rw------- 1 grid:oinstall 6766592 Nov 11 02:17 backup02.ocr
-rw------- 1 grid:oinstall 6766592 Nov 11 06:17 backup01.ocr
-rw------- 1 grid:oinstall 6766592 Nov 11 10:17 backup00.ocr
-rw------- 1 root system 7094272 Nov 11 18:21 91530747.ocr
-rw------- 1 root system 7426048 Nov 11 22:21 20587917.ocr
-rw------- 1 root system 7426048 Nov 12 02:21 29546896.ocr
看出ocr的自动备份产生的新的备份文件名称为number{n}.ocr的文件,也就是自动备份出现异常,是个BUG? , 使用ocrconfig -showbackup列出的备份文件还是正常的文件, 对比正常系统的文件的状态和属性,发现文件的属组不一样,难道是在安装过程中出现问题,就是rootcrs.pl(root.sh)在修改文件权限的时候出现问题;
Due to bug 9446443, automatic OCR backups are incorrectly owned which is preventing CRSD from overwriting them.
Expected ownership and permission on Linux - all 7 of them:
-rw------- 1 root root 11640832 Aug 30 08:46 backup00.ocr
-rw------- 1 root root 11640832 Aug 30 04:46 backup01.ocr
-rw------- 1 root root 11640832 Aug 30 00:46 backup02.ocr
-rw------- 1 root root 11640832 Aug 30 00:46 day_.ocr
-rw------- 1 root root 11640832 Aug 29 00:46 day.ocr
-rw------- 1 root root 11640832 Aug 26 00:45 week_.ocr
-rw------- 1 root root 11640832 Aug 19 00:44 week.ocr
有一个BUG,bug 9446443 is fixed in 11.2.0.2, 12.1.
It's recommended to apply patch to fix the issue, but if patch is unavailable, workaround is to change ownership and permission of all 7 automatic backup files manually. OCR should be owned by root, but depend on platform, group may or may not be root - you can check any randomly named backup file to identify what ownership and permission it should have; in example below:
-rw------- 1 root root 7143424 Aug 30 09:40 38455890.ocr
With this, please change all 7 automatic backup files to be owned by root:root with permission "-rw-------"
根据文档介绍,再结合自己的坏境的情况,查看对应crs的操作日志:
2016-03-16 06:24:59.079: [UiServer][12081]{1:19564:21073} Done for ctx=11191c2f0
2016-03-16 06:25:54.968: [ OCRRAW][3599]th_delete_backupfile: Failed to delete the backup file [/grid/product/11.2.0/gridhome_1/cdata/c4bidb-cluster/backup02.ocr] Retval:[-2]
2016-03-16 06:25:54.968: [ OCRSRV][3599]th_delete_backupfile: Failed to delete the backup file:[backup02.ocr] Location:[/grid/product/11.2.0/gridhome_1/cdata/c4bidb-cluster]
2016-03-16 06:25:55.026: [ OCRRAW][3599]proprbkp_rename: Failed to rename the backup file [/grid/product/11.2.0/gridhome_1/cdata/c4bidb-cluster/backup01.ocr] Retval:[1]
2016-03-16 06:25:55.026: [ OCRSRV][3599]th_rename_backupfile: Failed to rename the backup file:[backup01.ocr] Location:[/grid/product/11.2.0/gridhome_1/cdata/c4bidb-cluster]. Retval:[49]
2016-03-16 06:25:55.030: [ OCRRAW][3599]proprbkp_rename: Failed to rename the backup file [/grid/product/11.2.0/gridhome_1/cdata/c4bidb-cluster/backup00.ocr] Retval:[1]
2016-03-16 06:25:55.030: [ OCRSRV][3599]th_rename_backupfile: Failed to rename the backup file:[backup00.ocr] Location:[/grid/product/11.2.0/gridhome_1/cdata/c4bidb-cluster]. Retval:[49]
2016-03-16 06:25:55.033: [ OCRRAW][3599]proprbkp_rename: Failed to rename the backup file [/grid/product/11.2.0/gridhome_1/cdata/c4bidb-cluster/16654495.ocr] Retval:[1]
2016-03-16 06:25:55.033: [ OCRSRV][3599]th_rename_backupfile: Failed to rename the backup file:[16654495.ocr] Location:[/grid/product/11.2.0/gridhome_1/cdata/c4bidb-cluster]. Retval:[49]
2016-03-16 06:25:55.036: [ OCRSRV][3599]th_manipulate_backups: Failed to rename the temporary backup file [16654495.ocr].
日志上在对ocr自动备份的过程中,需要删除老文件,创建新的文件,但是crs操作失败,而产生性的默认文件名来代替
通过上面的列出,应该确定是由于文件权限导致问题,不是本文中提到的BUG,单纯是权限问题;
解决方法是修改默认备份文件名的权限为root:system,且手工删除number{n}.ocr的文件, 观察每4小时的备份正常,且集群状态正常;
这个问题,根因,就是操作失误,本来在一台新机器上进行安装,结果,在连接到正在运行的主机上操作,
比如 chown -R grid:oinstall /u01/app ,chmod 755 /u01/app
之后,就crs出现问题了。通过一些处理,crs可以正常了,但其他一些目录没有修改,导致存在隐患。
|