Do any cloud providers create backups on top of replication though?
Backing up databases (terabytes) is feasible, they're not that big.
But an entire cloud storage for photo and video for millions, we're talking maybe exabytes. The notion of making "separate" backups seems cost-prohibitive.
I am curious, though -- for services like Dropbox or Google Drive, how many replicas are there of your files? I know there must be redundancy in case a disk fails, but do they keep 2 instances of your data or more? And are the instances spread out geographically, some or all?
The short version is that the backups are a mix of short-term local backups + backups across the network on distributed filesystems + offline tape backup.
I can't speak to Drive, but the tape backup is certainly used for GMail. (There's a case study about Gmail having to restore from backup in the above link.)
"Do any cloud providers create backups on top of replication though?"
We do.[1]
For exactly 1.75x our normal pricing, we will replicate your entire account, nightly, to a geo-redundant site which is not open for normal customer use (and, therefore, has a lower risk profile). This GR site is the he.net core datacenter, in Fremont.
It's also worth pointing out that replication to rsync.net buys you malware/ransomware protection since your account is snapshotted, by ZFS, nightly, and those snapshots are immutable (read-only).
I've been wondering about ransomware protection through snapshots. Presumably (and I do know more or less nothing about it) the malware aspect of it is present on the system significantly before the ransomware aspect is triggered - so restoring to yesterday's backup just puts you back in the position to get pwn3d again? How do companies get around this?
I don't see the size as the big problem. If one cloud can handle the size, the backup storage system can too. For me, the real issue is timing. How do you backup a cloud full of constantly changing data? Do you draw a line in the sand, an image of the cloud state at a particular moment? Is that even possible? You have to do backups of smaller chunks, individual accounts, but eventually that just looks like another software-managed internal array structure rather than a true duplicate. Your backup system is just as susceptible to deletion error as the cloud it lives within.
"For me, the real issue is timing. How do you backup a cloud full of constantly changing data? Do you draw a line in the sand, an image of the cloud state at a particular moment?"
I can only describe what we do, and, of course, there is an enormous scale difference here, but ...
Every single rsync.net account is it's own ZFS filesystem which means every single account gets snapshotted[1] nightly on a schedule. This means that the enormous operation of "backing up" all of rsync.net happens in small, manageable, chunks.
Of course, the ZFS snapshots of a customer account are not "backups" per se, but if a customer choose "geo-redundant" storage in another facility, it is those very same snapshots that we (zfs) send over. Those are, indeed, backups.[2]
The most interesting part, in my opinion, is that the daily/weekly/monthly snapshots are immutable. So you can publish your rsync.net credentials or suffer a disgruntled employee or ransomware attack, etc., and those snapshots remain safe - they are read-only.
So it does not assert that the filesystem is in a consistent state before it snapshots. IE we could be midway through applying a database transaction; or there could be a reference in one file to another that doesn't exist (yet).
It could be argued that one needs to do a site wide "write cache flush", stop, and snapshot. Not that I think for one second the service provider should (or could) be in a position to detect when a good time to snapshot might be....
It can't be "just" as susceptible as what happened here.
If you even do the straightforward "don't freeze time anywhere, just rolling copy all the files onto a hard medium", you may end up with logically inconsistent data, in that files from later in the backup may actually not have existed alongside ones from earlier ones, but if you write down the timestamps periodically, you end up with a way better end state than "I lost all the files".
At worst, you end up with "I may have lost the last 3 minutes worth of files from the last 3 minutes of the last backup I did".
> Do any cloud providers create backups on top of replication though?
Yes - I work at one of the FAANGs for a team that's doing precisely this. We develops an internal disaster recovery tool that creates backups of data files that can't be touched by the creating application, and that can be read back in a disaster event to recover the data.
Just using AWS S3 as an example, you can copy a bucket to another bucket (preferably in a different region) with a policy of retaining all versions- preventing the problem of a delete whether malicious or accidental, from being reflected in the backup copy.
There is also the option of using Glacier for lower cost long term storage.
If you are asking does regular S3 provide a true backup capability out of the box, it does not, or at least did not last time I actively looked at this about 2 years ago.
We don't think AWS does periodic offline backups of stuff stored in S3 so that they don't find themselves in this exact embarrassing scenario? Regardless of whether it's user-facing for AWS users, I'd hope they do, or certainly that they did before they got a good sense of the long-term reliability of S3 as a big system.
I think they do all sorts of backup stuff - and as a consumer of S3 you need to evaluate how resilient their backing up is. S3 is pretty murky about permanence guarantees so I'd always look at making sure there is a separately maintained replication script to some medium I have control over if that data is irreplaceable and the costs of losing it are significant to the business.
Judging those costs and making that call is a complex matter of course.
What would such a service then do as an end-user beyond what `aws s3 sync s3://my/bucket /mnt/my/backup/drive/` does?
(Honest question, not provocation. To me either you trust AWS or any party to never lose your data [unwise], or you basically ask them to offer a way to rsync, which AWS does)
I think that's essentially what such a service would do. You might throw in some periodic less frequent syncs (maybe you sync down to the main backup every day and sync that backup to a secondary backup weekly or monthly) and maybe some of those successive syncs are done to hosts that are usually disconnected from the network to add in a firebreak.
Well, versioned bucket replication is essentially backup when you configure it in a way that the writer to the first bucket can't do any operations on the second bucket. Since bucket replication doesn't replicate deletion markers you essentially end up with your data being duplicated instead of just replicated writes.
A thing stored in google cloud is generally erasure coded, which provides redundancy against the failure of individual devices or hosts, and another copy is stored separately in a geographically separate place. So you might think of it as there being 1.7 copies of your file in each of at least two places.
> Do any cloud providers create backups on top of replication though?
Any serious database does offer backups in addition to replication, and that kind of validates OP's point. As for object stores there are a variety of techniques [0].
> But an entire cloud storage for photo and video for millions, we're talking maybe exabytes. The notion of making "separate" backups seems cost-prohibitive.
Cold-stores like the one Facebook built would be apt [1].
> I know there must be redundancy in case a disk fails, but do they keep 2 instances of your data or more? And are the instances spread out geographically, some or all?
See [0]. Backblaze do share quite a bit too about the space, too [2][3]. With minio, one could run their S3-like system on-premise [4].
Basecamp used to, I have no idea if they still do. Databases were replicated, and also backed up remotely. User uploads were replicated in the storage system, and also backed up to S3.
Unfortunately Dropbox's revision history isn't reliable enough for a backup. If you notice, their marketing carefully avoids use of the word "backup" despite the features seemingly implying it.
I had a client where a glitch in the Dropbox sync caused 300k+ files to be deleted when we added a new PC. Dropbox support was unable to successfully undo all the file changes, and I had to get a ticket with CS escalated to a special team to get everything restored. Even when it was finished, they could not give a guarantee that every single file deleted was restored.
Maybe things got better since they've launched Dropbox Rewind, but given that revision history has been a feature for years and it still didn't work right, I no longer trust them as a backup.
I'm completely soured on dropbox since they dialed the greed up with the arbitrary device count limit and nagging suggestions to upgrade that can't be turned off if you're near the max. I don't care enough to stop using it but I will never give them money after this.
Backing up databases (terabytes) is feasible, they're not that big.
But an entire cloud storage for photo and video for millions, we're talking maybe exabytes. The notion of making "separate" backups seems cost-prohibitive.
I am curious, though -- for services like Dropbox or Google Drive, how many replicas are there of your files? I know there must be redundancy in case a disk fails, but do they keep 2 instances of your data or more? And are the instances spread out geographically, some or all?