Thursday, December 29, 2005

Funky redo behaviour in 10g

I'll admit it, I don't have that much exposure to 10g just yet. I'm excited about it, I just don't have much experience on it.

We put up a development database on 10.2 a couple months back just so we could start getting experience on 10g. As we add more and more development schemas to this database, we're getting a taste for the new features.

We've been getting a disproportionate number of "Checkpoint not complete" messages lately. Sure, there's pretty heavy write activity going on at the time, but not enough to fill all my redo logs. I have six groups of 128M logfiles with two members each. During the heavy write activity, my logs get switched a few times in a minute. However, I know my log_archive_dest can keep up with the write activity, so there should be no problem archiving them that fast. From the alert.log:
Thu Dec 29 19:05:37 2005
Thread 1 cannot allocate new log, sequence 9945
Private strand flush not complete
Current log# 4 seq# 9944 mem# 0: /u01/oradata/demo1/redo04_01.log
Current log# 4 seq# 9944 mem# 1: /u02/oradata/demo1/redo04_02.log
Thread 1 advanced to log sequence 9945
Current log# 5 seq# 9945 mem# 0: /u01/oradata/demo1/redo05_01.log
Current log# 5 seq# 9945 mem# 1: /u02/oradata/demo1/redo05_02.log
Thu Dec 29 19:06:28 2005
Thread 1 advanced to log sequence 9946
Current log# 1 seq# 9946 mem# 0: /u01/oradata/demo1/redo01_01.log
Current log# 1 seq# 9946 mem# 1: /u02/oradata/demo1/redo01_02.log
Thu Dec 29 19:06:42 2005
Thread 1 advanced to log sequence 9947
Current log# 2 seq# 9947 mem# 0: /u01/oradata/demo1/redo02_01.log
Current log# 2 seq# 9947 mem# 1: /u02/oradata/demo1/redo02_02.log
Thu Dec 29 19:07:00 2005
Thread 1 cannot allocate new log, sequence 9948
Checkpoint not complete
Current log# 2 seq# 9947 mem# 0: /u01/oradata/demo1/redo02_01.log
Current log# 2 seq# 9947 mem# 1: /u02/oradata/demo1/redo02_02.log
Thread 1 advanced to log sequence 9948

After the write activity is over, I look at v$logfile and v$log and see that several groups are still ACTIVE.

I then doubled the size of the logfiles and reran the big job that caused the "Checkpoint not complete" error. While the job didn't cause a "Checkpoint not complete" error, it did make three of the log groups ACTIVE, which I kind of expected. However, 20-30 minutes later, I looked at v$logfile and v$log to see almost all my log groups were either ACTIVE or CURRENT.

system@demo1.us> l
1 select l.group#, lf.member, l.bytes/1024/1024 mb, l.status, l.archived
2 from v$logfile lf, v$log l
3* where l.group# = lf.group#

GROUP# MEMBER MB STATUS ARC
------ ------------------------------- ----- ---------- ---
1 /u01/oradata/demo1/redo01_01.log 256 CURRENT NO
3 /u01/oradata/demo1/redo03_01.log 256 ACTIVE YES
3 /u02/oradata/demo1/redo03_02.log 256 ACTIVE YES
4 /u01/oradata/demo1/redo04_01.log 256 ACTIVE YES
4 /u02/oradata/demo1/redo04_02.log 256 ACTIVE YES
5 /u01/oradata/demo1/redo05_01.log 256 ACTIVE YES
5 /u02/oradata/demo1/redo05_02.log 256 ACTIVE YES
2 /u01/oradata/demo1/redo02_01.log 256 INACTIVE YES
2 /u02/oradata/demo1/redo02_02.log 256 INACTIVE YES
1 /u02/oradata/demo1/redo01_02.log 256 CURRENT NO
6 /u01/oradata/demo1/redo06_01.log 256 ACTIVE YES
6 /u02/oradata/demo1/redo06_02.log 256 ACTIVE YES
-----
Total 3072

12 rows selected.

Elapsed: 00:00:00.05

During light load, I would expect one group CURRENT and maybe one group ACTIVE. The other funny thing is the logs show they are archived, which if I look at the log_archive_dest, they are. I have to do some more research, but I suspect there's an init.ora parameter or something that isn't set correctly...

8 comments:

'Thought & Humor' said...

You have a riveting web log
and undoubtedly must have
atypical & quiescent potential
for your intended readership.

Happy New Year,
Dr. Howdy

'Thought & Humor'

Howard J. Rogers said...

Checkpoint incompletes and lots of ACTIVEs can be due to a sluggish DBWR as well as ARCH behaviour. How are your free buffer waits?

Jeff Hunter said...

Statspack tells me it's not even one of my top wait events.

Anonymous said...

There is probably some internal tuning parameter that controls flushing of the IMU buffer to the redo log buffer. It is probably set wrong or there is a bug involving the conversation between the dbwr and the strand control. Check _db_writer_flush_imu and _imu_pools. The former is hopefully true, and the latter is likely 3. Play with the latter.

Aman Sharma said...

wish you a great and so much 2006 sir.Happy new year to you.
with best wishes and regards
Aman Sharma.

Ramn said...

Howard,

"Checkpoint incompletes and lots of ACTIVEs can be due to a sluggish DBWR as well as ARCH behaviour."

I understand how sluggish DBWR can cause ACTIVES in this case. Can you please explain how ARCH can affect things. Thanks.

Ram.

Dave said...

because if arch is archiving a log and is taking a while then suddenly another log needs archiving then you will wait

Anonymous said...

I have the same behaviour on two of my databases, did you solve this or is this a bug. Which oracle version are you using?