Tuesday, July 29, 2008

10.2.0.4 Post Mortem

In the end, my 10.2.0.4 SGA problem ended up being the _db_block_cache_protect parameter. Seems as though setting that parameter in 10.2.0.4 maps the SGA to /tmp/shm instead of real memory. The immediate cause of my:

ORA-27102: out of memory

Linux-x86_64 Error: 28: No space left on device

was that I didn't have enough space mounted at /tmp/shm. When I did allocate enough space, I got an ORA-07445 [skgmfixup_scaffolding()+129].

Anybody that deals with Oracle Support knows that when you get a different error message, that's essentially the kiss of death for your TAR. True to form, a new TAR was created with my new error message and the attempts to close the original TAR started. In the end, they would rather provide a workaround (don't use _db_block_cache_protect, 10.2 is much more stable) than a solution.

Unfortunately, the only help I can provide you is to not use _db_block_cache_protect in 10.2.0.4 with big SGAs.

6 comments:

Noons said...

The interesting thing is that this parameter's comment in x$ksppi is:
"protect database blocks (true only when debugging)".

Which seems to clearly indicate it's not for general consumption.
Why did 10.2.0.4 have it on in the first place is worrying...

Jeff Hunter said...

Oh, I had it explicitly set on. We had multiple db block cache corruptions in the past and this has become a standard parameter.

Anonymous said...

Have you tried the _dont_throw_skgmfixup_scaffolding_error=true parameter? ; )

Just out of curiosity have you tried creating a similar database using 11g? I would assume that you will have the same issues, but it never hurts to try.

Gandolf989

Noons said...

Oops, sorry I'm a bit late on this one.

What caused those block cache corruptions in the first place?

Earlier bug or problem with memory/hardware?

Jeff Hunter said...

What caused those block cache corruptions in the first place?

Ah yes, the initial block corruptions. Way back when we first moved to X86_64 we were on 9.2.0.X. We would intermittently get ORA-00600 or ORA-07445 error messages that Oracle Support tracked down to corruptions in the db block cache. When the dbwr wrote the block out to disk, dbwr recognized the block as corrupt, threw the error, and turfed the instance. We applied patch upon patch both to the database and the kernel until we finally had about 6 patches on top of 9.2.0.8 and another 20 or so patches to the OS. During the diagnostics process, Oracle suggested we turn on block checksum checking using the three aforementioned parameters and that almost eliminated the problem. The parameters stayed in the init.ora as we upgraded to 10.2 because we had no confidence that these bugs were fixed.

anthony said...

I read the article on the 10.2.0.4 NUMA and db_block_checking saga and i have a similar but peculiar problem. Maybe I can offer some twist as well. I have similar issues. My issue though has to do with pre page sga- true. I can in fact boot any size SGA so far as I get the huge pages proportionately higher than sga. The ratio is a mystery to me at this moment but hoevers around 65% of the SGA. That is if the huge page is 35% higher than size of SGA.

Take a look at my test results below


Test Results


With pre_page_ture
SGA=12g,13g, 14g
Huge Pages= 16GB

I get the ora-00443 background process "PMON" did not start



With pre_page_ture
SGA=11g and below
Huge Pages= 16GB

Works okay




With pre_page_ture
SGA=10g , 9g 8g
Huge Pages= 11GB
I get the ora-00443 background process "PMON" did not start

Problem resolved when sga dropped to 7g



Final series of tests

With pre_page_sga =true
SGA=21g
Huge Pages= 24GB
I get the ora-00443 background process "PMON" did not start


With pre_page_sga =false
SGA=21g
Huge Pages= 24GB

No issues


With pre_page_sga =true
SGA=21g,23g 24gb
Huge Pages= 30GB

No issues.

I have a Suse 10 WITH 98g of memory and able to bring up 70G of sga with huge pages
With a single trand for shared segment…and the best thing is it is 10.2.0.4!!!wow the sage continues. Why should one SGA with lower memory start with multiple shared segment ....

In all cases the NUMA optimization is set off. With NUMA optimization set to true any size SGA can be booted and use huge pages regardless of the size of the huge pages provided it is bigger than the SGA…even by a few Mbytes


I am battling this one with oracle now as we speak.