A couple days ago my network link to my standby databases went dark in the middle of the night. My dbs took notice and started generating log shipping errors in the alert.log. About 10 minutes later, I got a page from my standby db that said his gap was more than the acceptable level and could I please do something. No problem, I just turned log_shipping off until the link came back up. The link came back up later in the day, I resumed log shipping, and most of my dbs recognized the gap and started resending logs.
The operative word being "most".
One of my dbs was shipping an archived redo log when the link went down. When I resumed log shipping, Oracle thought he was still shipping a log and thought the gap started at the next log sequence number. He shipped the current logs just fine, but just wouldn't resolve the gap.
"No worries", I thought, "I'll just rcp the missing file and register them with the standby control file and I'll be on my merry way."
Sure enough, the standby started recovering immediately and caught up in a short time. I thought everything was dandy until I got a page on Monday morning (2 AM, thank you very much) that my log_archive_dest was about to fill up on the primary.
Our standard protocol for freeing up the log_archive_dest is to run an archivelog backup with DELETE INPUT specified. I kicked off the backup, killed the monitor that watched the log_archive_dest, and went back to bed. I set the alarm for 1 hour later just to make sure the backup was done.
When I got back up, the archivelog backup wasn't done and the log_archive_dest was at 99%. I then looked in the directory and saw logs that were 3 and 4 days old which I knew was not correct. I killed the backup, deleted the 4 day old logs, crosschecked, and restarted the backup.
When I got into the office, my main task was to figure out why this happened. When I looked at my RMAN message log, I noticed about every log that was being backed up was accompanied by a warning:
RMAN-08137: WARNING: archive log not deleted as it is still needed
Hmm, that's not cool. Maybe my standby was really out of sync and I just didn't know it. Sometimes the primary will ship logs and the standby will accept but not apply them because it is missing a log. But that wasn't my case, the standby was within one or two logs from the primary.
When I looked at my v$archivelog_history, the "applied" flag indicated that the primary thought a bunch of logs were not applied to the standby. I knew different, as the standby was somewhat current.
At that point, I needed assistance. I filed a TAR and within about 2 hours found out it is a bug (bug 4538727) that is fixed in the 10.2.0.4 patchset. Since the patch just came out, I haven't gotten a chance to apply it to anything so I asked for workarounds.
"None", they say.
Theoretically, I supposed I could recreate my control files and my standby control files. I'm not really excited about doing that.
Until I figure out what to do, I'll just have to use RMAN to delete my logs "backed up 1 times" and crosscheck to manage my log_archive_dest space.
My Resolution, sort of.