> testing that all of it is valid is not trivial If it was easy then people woul...

bigiain · on Nov 25, 2021

Being 100% confident you've got a bit for bit copy of what you think you have, doesn't mean you have a restore process that works.

Unless you've tested and proven to your self you can bring up a working system from your backups, you've only done the first half of your disaster recover work.

For me, that means restoring onto your cold spare hardware (either identical or similar-enough for you), shutting down your system pulling it's drives and replacing them with blank ones and restoring onto that (but be careful you can be sure you don't need that specific hardware/firmware/peripherals - cause in 18 months time when someone backs a truck into your office and load all your electronics into it you might not be able to get an identical system), or restoring a running system onto newly provisioned instances on cloud-provider-of-choice.

Prove to yourself at least, that you can have business continuity in a know amount of time once you pull the pin on your DR/restore-from-backups plan.

kqr · on Nov 25, 2021

Yup, this is it. And you don't need to do it for every single backup. Before running the backup, do four coin flips. If they all end up heads (adjust the number of flips based on or how much time you can afford to spend on backup testing) you test the full recovery process of the impending backup.

Of course, it will depend on several parameters, but in my experience, doing a thorough test for a random subset of items is often more economical than a half-assed test of all items.

As a bonus, the random sampling will let you infer things about the totality of all items. (As opposed to any other scheme for selecting which items to test.) So once you've run 27 tests and only one failed, you know at least 85 % of your backups work. At 1/20th of the cost of testing them all, this is a good deal on information.