Tuesday, October 11, 2005

Wicked ORA-27054, Part I

We're doing some testing with RMAN and 10.2.0.1 on Linux. Our standard backup strategy is to backup to NFS mounted disk and backup to tape at a later time. With 9.2 I could mount the NFS filesystems without any particular options and RMAN would run just fine.

On 10.2.0.1, we setup the backup and immediately got an error (ORA-27054: NFS file system where the file is created or resides is not mounted with correct options). So we submit a TAR and find out that 10.2 requires NFS filesystems to be mounted with the following options:

rsize=32768,wsize=32768,hard,noac

The backup works now, except it takes 1 hour 9 minutes to backup a 2G database. We indicate this in the TAR and the analyst basically says "new problem, new tar". So we create a new TAR for the performance issues, but I post a followup on the existing TAR:


Me: Is there any flag or something that we can change to turn off the checking that results in the ORA-27054?
OCS: I have seen the following being used on different problems (not related to RMAN), sometimes it works, sometimes it doesn't, I would not be able to explain why since it is not on my skills, also this parameter is not documented so I would not know what other effects could cause.

Set init.ora parameter:
_filesystemio_options=directio

You can also try:
_filesystemio_options=none

Hope this helps.

Well, at least he's being honest about it. Of course, we all know that filesystemio_options is not hidden in 10.2.

Me: This is a production system. We'd rather not "try" something if we don't know if it will solve our problem or not. Since you say you're not qualified, how do we get to someone who can tell us the answer?

OCS: Create a new tar.


OK, I get it. You're just trying to close as many TARs as possible. Glad to see my support dollars "at work".

So we pursue our "performance" problem with the other TAR and reference the first TAR. The analyst suggests we contact the vendor of our NAS to find out what they suggest. We explain this is plain NFS and it doesn't really have a vendor. They suggest getting an strace on the rman process. Now we're getting somewhere.

While they are looking at the TAR, one of my DBAs does some more testing. He backs up my 2G database to local disk in 4 minutes. He then copies the backup pieces to the NFS mounted disk mounted with the options specified by Oracle Support. 1 hour 9 minutes. Ah ha! Poor performance even with plain old cp.

I know what Oracle Support is going to say. I'd probably say it myself. "The fact that plain old cp is slow tells me your NFS is setup wrong." And I wouldn't disagree. The conundrum is that these are the options Oracle told us to use...

to be continued...solution

11 comments:

Unknown said...

It all depends on who gets your TAR. Some of the CS is god awful!

I feel your pain.

Jeff Hunter said...

I don't hold it against them, its the way most customer service operates. Answer question, move on. Problem is, they're definition of "answer" is different than mine most times.

There are lots of good people in Support. For example, I had one analyst stick with me through a complex multiplexing issue. The result ultimately (after about six weeks investigation on both sides) was "don't do that", but we both agreed that was the best solution in this particular situation. (Thanks Cameron).

Joel Garry said...

I believe the "hard" option means wait indefinitely, as opposed to "soft," which means return an error. Either way, your response is slow because NFS isn't able to respond to you right away. So I don't think Oracle is going to figure out your problem.

A lot depends on your NFS, and which protocol it is using - UDP bad. You say plain NFS - which plain NFS?

Jeff Hunter said...

Either way, your response is slow because NFS isn't able to respond to you right away

I don't disagree with you, except those are the options Oracle has specified I [b]MUST[/b] use.

Anonymous said...

The way Oracle Support works is, once they got past the interview process, they get to answer customer TARs. Whether they are qualified or not is besides the point - no other vendors "qualifies" their support personnel, AFAIK. To be fair, the majority of TARs fall into RTFM, or you are too f**king stupid to be using a computer.

Robert Vollman said...

"OK, I get it. You're just trying to close as many TARs as possible."

My advice:
1. Make your TAR look as easy and quick as possible, because the senior people usually get first dibs, and they'll choose the easy ones.
2. Be polite and trust the person who is working on it, in order to get his best effort.
3. If, despite this, he isn't up to the task, ask it to be escalated.
4. Don't open a new TAR. Hang onto the original TAR until it is handled properly.

CS is a funny game.

They SHOULD measure customer satisfaction, the rate at which their people are improving, and how many TARs their customers DON'T need to open because previous issues have been handled so well.

Instead they effectively measure how many TARs are opened. And more TARs is treated as a good thing. So their incentive is to do work that is sufficiently low-quality that you have to keep opening TARs.

I personally feel OCS should be treated as a last resort, and for work that can't be done yourself (ie fixing bugs). The best solution is to have highly trained DBAs and developers (and outside consultants when necessary) and try to fix or workaround everything yourself.

Jeff Hunter said...

The best solution is to have highly trained DBAs and developers (and outside consultants when necessary) and try to fix or workaround everything yourself.

I can't disagree with that. Some things, like this, you just can't get around.

Anonymous said...

Jeff, recently browsing some forums
http://lists.suse.com/archive/suse-oracle/2005-Aug/0154.html
http://forums.oracle.com/forums/thread.jspa?messageID=1087472&tstart=0
i saw the similar problems by other people.Most likely seems to me the suggestion from Fabrizio Magni: the requirement to mount "hard" is introduced due to new "switch database to copy" command. It seems as well, the "hard,intr" option is the workaround to get rid of backup performance. I couldn't currently test it on the proper hardware ( on my two virtual machines the backup time of 800 Mb database with "hard,intr" and only "hard" options was 6 resp. 9 Min.), but maybe this would work well in your environment.

Anonymous said...

Oops, just overlooked your "solved" part. My apologies.

Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...

I get that you're not using tcp,ver=3 options for some reasons. I wonder if that causes your problems. Also noac is usually only recommended for RAC database not single instance. I don't know your linux kernel version so I am not really sure.
cp is single threaded so it is not a real test of your NAS storage performance by any means.