Tuesday, July 29, 2008

10.2.0.4 Post Mortem

In the end, my 10.2.0.4 SGA problem ended up being the _db_block_cache_protect parameter. Seems as though setting that parameter in 10.2.0.4 maps the SGA to /tmp/shm instead of real memory. The immediate cause of my:

ORA-27102: out of memory

Linux-x86_64 Error: 28: No space left on device

was that I didn't have enough space mounted at /tmp/shm. When I did allocate enough space, I got an ORA-07445 [skgmfixup_scaffolding()+129].

Anybody that deals with Oracle Support knows that when you get a different error message, that's essentially the kiss of death for your TAR. True to form, a new TAR was created with my new error message and the attempts to close the original TAR started. In the end, they would rather provide a workaround (don't use _db_block_cache_protect, 10.2 is much more stable) than a solution.

Unfortunately, the only help I can provide you is to not use _db_block_cache_protect in 10.2.0.4 with big SGAs.

Tuesday, July 22, 2008

The 10.2.0.4 16g Solution

If you've been following along, I've had quite a tough time with 16GB SGAs on 10.2.0.4. With help from Kevin Closson and Oracle Support, we found that the init.ora parameters "_db_block_cache_protect=true", "_db_block_check_for_debug=true", and "db_block_checking=true" cause 10.2.0.4 to not be able to allocate a 16G SGA. Not a problem in 10.2.0.3 mind you, but definitely a problem in 10.2.0.4.

In the end, it wasn't NUMA after all.

Click the 10.2.0.4 label below to follow all threads on this issue.

Tuesday, July 15, 2008

The great NUMA debate, part V

There was some questions whether I turned NUMA off or not one one of my previous tests. I repeated the test with similar results.

If I looked at my /proc/cmdline, I got:
ro root=LABEL=/ console=ttyS1 pci=nommconf elevator=deadline selinux=0 numa=off

Then, I looked at my boot messages and saw:
NUMA turned off

Lastly, my numactl --hardware showed:
available: 1 nodes (0-0)
node 0 size: 34815 MB
node 0 free: 31507 MB

I still couldn't start a 16GB SGA. Interestingly enough, I couldn't start a 4G SGA either! I had to go back to booting without numa=off. The saga continues...

Click the 10.2.0.4 label below to follow all threads on this issue and the eventual solution.

Wednesday, July 02, 2008

The moment you've all been waiting for...

No, no, not the Brangelina twins announcement, the numactl output.

$ ps -ef | grep oracle

oracle 13771 2418 0 Jun24 pts/0 00:00:00 -ksh
oracle 13772 13771 0 09:24 pts/0 00:00:00 ksh -i
oracle 13775 13772 0 09:24 pts/0 00:00:00 ps -ef
oracle 13776 13772 0 09:24 pts/0 00:00:00 grep oracle

$ numactl --hardware

available: 4 nodes (0-3)
node 0 size: 10239 MB
node 0 free: 7854 MB
node 1 size: 8191 MB
node 1 free: 5725 MB
node 2 size: 8191 MB
node 2 free: 6757 MB
node 3 size: 8191 MB
node 3 free: 6671 MB

OK, so maybe not everybody was waiting for that. Oracle Support requested an strace of the startup command, so I had to bring the db down anyway. The strace was a good idea, they'll be able to see the system calls being made and such. Maybe we'll get some progress yet...

If you're following along at home: Part I, Part II, Part III

More info from the comments:

I shutdown everything again:
available: 4 nodes (0-3)
node 0 size: 10239 MB
node 0 free: 7854 MB
node 1 size: 8191 MB
node 1 free: 4846 MB
node 2 size: 8191 MB
node 2 free: 6744 MB
node 3 size: 8191 MB
node 3 free: 6646 MB

No surprises there (except, of course, for the strangeness KC pointed out before). Then I decided to startup nomount with a 10.2.0.3 $OH and a 16g SGA:

available: 4 nodes (0-3)
node 0 size: 10239 MB
node 0 free: 7660 MB
node 1 size: 8191 MB
node 1 free: 4315 MB
node 2 size: 8191 MB
node 2 free: 6708 MB
node 3 size: 8191 MB
node 3 free: 6646 MB

Very unexpected. It looks like whatever this is showing is not affected by a 16g SGA.

I changed my $OH back to 10.2.0.4 and started a 12g SGA:
available: 4 nodes (0-3)
node 0 size: 10239 MB
node 0 free: 25 MB
node 1 size: 8191 MB
node 1 free: 257 MB
node 2 size: 8191 MB
node 2 free: 6740 MB
node 3 size: 8191 MB
node 3 free: 6642 MB

This is kind of more what I expected. You see a 12g SGA is basically leaving nodes 0 and 1 free at next to nothing. If we believe the previous output, that's my 12g of RAM?

Click the 10.2.0.4 label below to follow all threads on this issue and the eventual solution.