After reading a while ago about the problem in the U.S where Database corruption causes 100000 air force investigation to be lost, it got me thinking about stressing the importance of not only a database backup plan but also a recovery plan.
I’ve worked as a DBA long enough to understand if you’re not interested in ensuring that the recovery of data is 100% achievable then your in the wrong job. Situations like the one the U.S air force faced are all too common but yet so easily avoidable.
I’d say most DBA’s have come across this situation once in their career, in fact I experienced this first hand less than a year ago. My employer at the time was in the middle of upgrading its CRM platform which was being carried out by an external partner, one of the DBA’s within my department was tasked with ensuring all the development SQL instances we’re backed up, he’d duly setup the relevant SQL Agent jobs and tested they worked, job done. Not quite, he didn’t check these jobs were successful on a daily basis or add this to a monitoring solution we’d developed which automatically checked for job failures etc, but his attitude was “if I don’t get a failures emails then all is good in the world”, this might of been slightly ok but sadly for him he’d not even set the notification of failures email option on the agent jobs so clearly no alert emails were sent in the event of a failure.
Roll forward six months and disaster strikes, a newly installed SAN went wrong badly wrong and all the data store was gone. All the live system were restored seamlessly but then the external partners wanted the development environment restoring. I can’t really describe the panic stricken face my colleague had but it wasn’t good.
That all said it’s ok having database backups but do they work? I feel is something most DBA’s seem to blindly put their careers in the hands of SQL Server and don’t bother checking. Not myself, I’ve always been quite particular in this area wanting to know 100% if disaster strikes I’m covered.
So I developed a process which extracts all the last full database backups, for all instances on our estate (or selected instances if we choose) and then restores each database onto a separate SQL Instance and also run CHECKDB against the restored database looking for any further problems.
This might sound anal but its the only way you can 100% percent guarantee you ain’t going to lose six months worth of development work and then most likely your job.
Check, Check and then have something automated (non human) to check its the only way!