Saturday, March 25, 2006

Working on my ORA-00202

I’ve been going crazy fighting an ORA-00202 error for the last couple of weeks.

Whenever we ran a Level 0 backup, some of the monitoring processes were getting a timeout while waiting on the control file. The worse part is the error wasn’t consistent. About 90 minutes into a 3 hour backup we would get this error message thrown in the alert.log.

The message in the trace file seemed to suggest one of the control files was corrupt. Our initial research showed that when we brought the database down in immediate mode, that “diff” reported the control files were indeed different.

My first inclination was that somehow rman was corrupting the control file during the backup. I know, pretty unlikely, but that’s where the symptoms pointed. I created a TAR and provided some info to OCS about my problem. I also learned about the debug option of rman (which is pretty cool, although I couldn’t tell you how to read the file). OCS confirmed that rman wasn’t corrupting the control file (like they would say it was anyway). They suggested it was our monitoring software that was causing a problem. Since I have this software running against multiple databases and this is the only one we’re having a problem with, I didn’t think that made much sense either.

I started investigating the workload on the server. This instance is not used very heavily and is on a pretty beefy server. While the backup was running, the vmstat statistics showed that the filesystem cache kept increasing and the available memory was decreasing. Eventually, the system exhibited classic swapping symptoms, but without the typical indicators.

“Hmm”, I thought, “that shouldn’t happen because of direct I/O.”

Just for kicks, I verified that direct I/O was on by shutting down the database, starting it up in NOMOUNT, starting strace on the dbwriter process, and opening the database. I know that if the datafiles are opened with the O_DIRECT flag, you’re using direct I/O.

But wait, there was no O_DIRECT flag in the trace file.

Aha. I searched Metalink and sure enough, 9.2.0.7 doesn’t include direct I/O out of the box. I already had the filesystemio_options=setall, but apparently, you have to apply two patches on top of 9.2.0.7 for Linux. I applied the two patches in a new $OH, started the database in the new $OH, and retested my backups. Sure enough, the ORA-00202 error went away. My theory is that this "semi-swapping" caused extreme slowdowns on the system which caused the monitoring software to experience long waits for the controlfile. I’m still not clear on why the control files were different when I brought the db down, but the database could be started with either the “good” or the “bad” control file.

3 comments:

Anonymous said...

We are still sitting at 9.2.0.6 where I work ( hpux rac on risc ). I looked briefly at 9.2.0.7 but at least for my environment saw too many things that right away had problems in 9.2.0.7. Oracle support and the amount of qa involved are not where we as customers want them to be. There seems to be some kind of huge disconnect between the development and maintenance organizations ( and qa or lack thereof ).

So I am sitting now and will think about 9.2.0.8 but we might be on 10.2 before then.

I guess it is better than selling pizzas but come on Tom and the other oracle employees please help bring an awareness up in the organization how bad this is getting.

Sorry rant over now.

Anonymous said...

jeff,

do you have the patch numbers? We'd like to look into this.

Are you saying direct io doesn't work in 9207 (or db's patched from 9204-6 to 9207)?

Jeff Hunter said...

Yeah, they're 4276957 and 419959.