Tuesday, May 24, 2005

Backup Top 11

I've got to admit, I'm pretty anal about backup and recovery. I've done some things without a proper backup in place and gotten burned because something went wrong. You only have to do that a couple times before you understand why you need a proper backup/recovery methodology. Before I let a new hire on my production systems, I make sure they understand how my backup/recovery works. In addition, I make them prove they know it by having them diagnose and recover a development database that I have crashed. I don't know everything about backup and recovery, but these are some lessons I have learned.

11. Send it off site
If you're going through the trouble of backing your data up to magnetic media, send it off site. In case of a fire (or god forbid another type of disaster) in your main data center you will be able to recover to a point in time. The best method is to duplicate your media and send the duplicate off site. If you can't duplicate your media, send it off site to a company that can return it to you 24x7. Yes, it's a pain duplicating your tapes and sending them off site, but when your power goes out at 16:04 (like in 8/2003 in the US) the banks will be closed and you won't be able to get to your media until the next business day.

10. Export/import is not a primary backup methodology.
Export and Import are fine tools for transferring data between databases as long as you can prevent changes from occurring. People usually setup export as their backup for one of two reasons; they feel "safe" because they have a copy of their data on disk or because it is easy to configure. The drawback is that using export for backup will only give you a single point-in-time to which you can recover. It is very unlikely that your database will crash the moment the export finishes. There are a couple of valid reasons why you want to use export as a secondary backup method, but as your primary backup methodology, forget it. If you do use exp/imp for a secondary backup method, make sure you use the consistent=y option to get a consistent copy of your data and use compress=n to make sure all your data doesn't go into one extent.

9. The needs of the business will dictate the backup methodology and frequency.
The rules under which your business operates will dictate the appropriate backup methodology. If your database must be up 24x7, there is no doubt you will need to run some sort of online backup (hot backup). If you have a short window for your Mean Time To Recovery (MTTR) it may mean a full backup at least once a day.

8. Have at least two backups available at all times.
I have talked to so many people that recycle their backup media every day. They stick the tape in the drive and backup their database. The tape remains in the drive for the next day when the backup overwrites the previous day's backup. When the database crashes during the backup, you have no way to recover.

7. Use RMAN over in-house developed scripts.
In-house developed scripts are error prone. Sure, there are a number of scripts out there that work in certain circumstances. RMAN works in every environment. Learn it, understand it, use it. Yes, it's different. The ability to run incremental backups alone makes the learning worthwhile.

6. Use a recovery catalog over the controlfile.
Both your recovery catalog and your control files store metadata about your database. The amount of history you can keep in the controlfile is limited by the control_file_keep_time init.ora parameter. In addition, the controlfile only stores information about the current incarnation of the database. If you ever need to restore before a resetlogs, you must use a recovery catalog. Last but not least, a recovery catalog lets you store your backup and recovery scripts whereas the controlfile does not.

5. Know how to do basic restore scenarios.
Your restore/recover methodology must include three basic recovery scenarios; recover the entire database, recover a tablespace, and recover a datafile. Most other recoveries will be a modification of these three basic types.

4. Not running in archivelog mode is a disaster waiting to happen.
When your database runs in archivelog mode you have two advantages. First and foremost, you can recover to any point in time. Second, you can recover from user DML errors by using logminer.

3. Cold backups are worthless (on a production system).
Cold (offline) backups offer several disadvantages but no advantages to a backup plan. The database has to be closed for a cold backup. The cache is empty when the database comes back up. All the SQL has to be reparsed. There are times you may want a to use a cold backup for maintenance reasons (major upgrade, moving database to a new host), but they offer no advantages in a daily backup plan.

2. Write it down.
Who knows what kind of disaster you will have. Maybe your building will be the next WTC. Maybe you got hit by a bus at 02:12 while you were coming in to do the restore. Maybe you've been at work for 18 hours and your brain is fried. Your thinking during a recovery should be limited. Write down your recovery scenarios so your manager could perform them.

1. Practice, Practice, Practice.
Practice your restore scenarios. You can be confident in your recovery because you just did it three months ago. If your test recovery doesn't work, it's better to know before you need it.

15 comments:

DaPi said...

Jeff,

I'd add another paragraph along the lines of:

"Keep It Somewhere Else"
It's possible to lose all or part of a machine room, so: Cycle tapes off-site somehow (during a transition period, we hired a safe deposit box in a bank). Send a copy of your archlogs as far away from the server as you can - anything from a secure PC at the other end of the floor, to a server in a different city. It's the small organisations that have to be the most imaginative.

DaPi said...

ahhh yes, it's a top 10 - I'd replace the cold backup paragraph with "Keep It Somewhere Else"

Meant to say: GREAT STUFF - with you all the way. I'm not anal about backups - PARANOID is the word.

Jeff Hunter said...

Yeah, I can agree with that.

Thomas Kyte said...

10. Export/import -- and don't forget consistent=y and compress=n else the odds you can import are near zero (tables change while you are exporting, export parent at noon and child at 1pm, you better do both "as of" the same point in time!)

and don't forget to import every now and again, just to see if it actually works.


And remember the only thing a dba is not allowed to get wrong is recovery (not backup, recovery) we can fix everything else, but if you botch that, forget it :)

DaPi said...

"Yeah, I can agree with that."

Me being paranoid?

Jeff Hunter said...

Thanks for the input guys. I have incorporated the changes.

Peter K said...

I would also recommend that with the testing (practice, practice, practice) that you also document the recovery scenarios so that you have a step by step document to guide you when the real thing happen.

Also do not skim on recovery as this is probably the critical component as Tom said, everything else can be fixed. Make sure that you do have the right resources (hardware and otherwise).

Axr2 said...

Jeff,
Point 6 - use RMAN catalog over target controlfile. I've always wondered about this one. I'm not sure I'm convinced.

No matter what your controlfile_keep_time is set to (non-trivial value), no matter how many target incarnations you have - everything is restorable without a catalog. As long as you've been consistently backing up your controlfile (and rest of the stuff), you're good. There are documented ways (slightly painful) to restore everything without using the catalog.

Am I missing something obvious here?
Thanks.

Jeff Hunter said...

I haven't tested it, but I suppose if you have the control file from before the resetlogs you can recover a previous incarnation.

Daniel Fink said...

I can't recall the exact quote, but a line from Tim Gorman's (and Gary Dodge's) book on Data Warehousing was a seminal moment in my Oracle career. "The responsibility of the dba is not to backup the database, but to restore it." SNAP on goes the lightbulb! I can recall many times where a database was 'lost' because of a bad backup that no one ever tested. In 7.3 there was a bug that corrupted archived redo logs and the only way to test them was to apply them to a backup...often this was too late.

Start with "What are the business's data recovery requirements?" and work from there. And then test, test, test and finally test. In fact, you should test 43 times (42 plus 1 for good measure).

vexed said...

One crucial item seems to be omitted from the list: make sure you can recover from your backups! I've seen many scenarios where a user was taking thorough backups only to discover that when he needed to restore one, the backups themselves were corrupted. Or he was missing archive logs. And so forth.

I make it a practice to perform a routine recovery of a development database using real backup files to ensure these backup files can actually be used to restore the database.

Joel Garry said...

12. Don't use the cheapest tapes.

13. Understand what compress=N really does. If you are propagating a huge initial, that's just silly.

14. It is ironic that the procedure with the least tolerance for error is the most dependent on error-prone highly skilled manual labor - and is expected to be performed by the least skilled staff.

15. It's recovery. You need to figure out if your data warehouse, test/qa databases etc. really will need recovery. As opposed to just reloading from other sources.

16. Don't be imaginative. It's recovery. Think "rock fall on computer. Mongo fix." Iterate through range of rock sizes from cosmic ray to asteroid.

Herodt said...

Off site.

and keep the tape drive way up off of the floor.

Came into the office one morning to 2 feet 8 inches of water in the server room. No water pipes in the server room, but sprinkler pipe in an adjoining room had broken and flooded the entire area.
And the drain was plugged with paper from the file storage room where the pipe broke.

8 servers under water.. UPS's still working, blinking green lights under water. Tape drives under water. Off site tape saved our butts.. but had to wait DAYS for new tape drives to show up.
New servers were there in 24hours, tape drives got back ordered.

In the new server room. Tape array is 4 feet off the ground and stand alone tapes are in the top spot of the racks.

BackupDigger said...

Top 5 list for MySQL backups (I guess it will work for all database backups).

http://www.zmanda.com/mysql-backup-considerations.html

iron garden gates said...
This comment has been removed by a blog administrator.