Monday, August 17, 2009

SGA Latch Contention

We had been getting a pretty persistent "WARNING: inbound connection timed out (ORA-3136)" error thrown to the alert.log on one of our 10.2.0.4 dbs. This error wasn't received once or twice, but about 40 times within a 2 minute window.

I did the research on it and found some sources that suggested my SQLNET.INBOUND_CONNECT_TIMEOUT needed to be adjusted upwards from the default value of 60 to 120 or something higher. I was pretty sure that wasn't the case as it was already set at 120. For the sake of progress, I bumped the timeout to 300 (5 minutes) and monitored closely.

About two days later, we received the same warning message in the alert.log file, except this time it happened about 90 times in a two minute period. Aha, a change. So that told me that something was going wrong during the authentication phase since increasing the connection timeout made the problem worse.

During the last occurrence, I also noticed that the mman process was consuming 100% of the CPU. At the time I wasn't sure if it was a symptom or a byproduct of the problem. However, as soon as the problem went away, the ora_mman process went back down to nearly 0%.

I setup a script to automatically do three systemstate dumps a minute apart when the mman process went to 100%. Oracle Support was able to tell me that during this time period, a bunch of my SQL had been invalidated as was waiting on a latch to be loaded back into the shared_pool. They also indicated that this massive amount of reparsing could happen because somebody did a DDL on a popular object or somebody flushed the shared pool.

I was pretty sure nobody flushed the shared pool, but that got me to thinking what would happen if Oracle shrank my shared pool due to the automatic memory management? I checked v$sga_resize_ops and found that around the times of my warning message, the shared_pool was being re-sized down in size. I brought this up to the support analyst and he suggested I set the shared_pool to the maximum size Oracle had re-sized it to.

That was three days ago, and we haven't had a warning since.

The theory is that while the shared_pool was being re-sized in the SGA, Oracle grabbed a latch. The memory resize operation took a while and while Oracle held that latch, nobody could login. Interesting theory, we'll see if it holds.

3 comments:

Bjoern S. said...

Hi Jeff,

a very interesting post.

We happen to have these ORA-3136 errors in the alert_log too, and we also use the ASMM option of Oracle 10g.

However, I didn't investigate these messages deeper until now. Thanks for the pointer, I'll see if I can verify your findings on our machines.

Regards,
Bjoern

Log Buffer said...

"Jeff Hunter for his part, encountered and bested SGA Latch Contention."

Log Buffer #159

Ines said...

Fantastic!