I’ve been going crazy fighting an ORA-00202 error for the last couple of weeks.
Whenever we ran a Level 0 backup, some of the monitoring processes were getting a timeout while waiting on the control file. The worse part is the error wasn’t consistent. About 90 minutes into a 3 hour backup we would get this error message thrown in the alert.log.
The message in the trace file seemed to suggest one of the control files was corrupt. Our initial research showed that when we brought the database down in immediate mode, that “diff” reported the control files were indeed different.
My first inclination was that somehow rman was corrupting the control file during the backup. I know, pretty unlikely, but that’s where the symptoms pointed. I created a TAR and provided some info to OCS about my problem. I also learned about the debug option of rman (which is pretty cool, although I couldn’t tell you how to read the file). OCS confirmed that rman wasn’t corrupting the control file (like they would say it was anyway). They suggested it was our monitoring software that was causing a problem. Since I have this software running against multiple databases and this is the only one we’re having a problem with, I didn’t think that made much sense either.
I started investigating the workload on the server. This instance is not used very heavily and is on a pretty beefy server. While the backup was running, the vmstat statistics showed that the filesystem cache kept increasing and the available memory was decreasing. Eventually, the system exhibited classic swapping symptoms, but without the typical indicators.
“Hmm”, I thought, “that shouldn’t happen because of direct I/O.”
Just for kicks, I verified that direct I/O was on by shutting down the database, starting it up in NOMOUNT, starting strace on the dbwriter process, and opening the database. I know that if the datafiles are opened with the O_DIRECT flag, you’re using direct I/O.
But wait, there was no O_DIRECT flag in the trace file.
Aha. I searched Metalink and sure enough, 220.127.116.11 doesn’t include direct I/O out of the box. I already had the filesystemio_options=setall, but apparently, you have to apply two patches on top of 18.104.22.168 for Linux. I applied the two patches in a new $OH, started the database in the new $OH, and retested my backups. Sure enough, the ORA-00202 error went away. My theory is that this "semi-swapping" caused extreme slowdowns on the system which caused the monitoring software to experience long waits for the controlfile. I’m still not clear on why the control files were different when I brought the db down, but the database could be started with either the “good” or the “bad” control file.