This sounds to me like the classic “replication is not backups” situation where ...

bscphil · on Aug 18, 2020

> It also sounds like someone made the decision not to backup the raw images because they were “big” - that is actually the one thing they should have backed up because all of the smaller files can be regenerated from the raw ones.

Ironically my experience has been exactly the opposite. It's the demosaic-ed, fully developed copies of my photos that are larger and harder to preserve than the original RAWs. And these files can't be trivially regenerated from the RAWs, either. Let me explain.

The most important issue is that the process of taking a raw image and turning it into an edited RGB-pixel image is not obvious, at all. Tons of steps have to happen in this process, and there's currently no way to describe this process in a way that's compatible with a single open standard or even multiple pieces of software. At all. The steps can be broken down roughly into a series of "instructions", but what those instructions mean (e.g. to Lightroom) is entirely opaque and even secret in the case of closed source programs. Even open source programs, like RawTherapee and Darktable, use entirely different and incompatible approaches, algorithms, and instructions.

What this means on a practical level is that you can preserve your raw files and the instructions for replicating the edits as carefully as you like, but without the exact same piece of software (often even the same version of the program with the same settings and defaults set), your edits are as good as gone forever.

As a result I've had to take fairly drastic steps to make sure my photos are safe. For my older photos, I keep an entire VirtualBox image with Windows and Lightroom backed up along with my raw images, so that I can be sure of restoring exactly the same output files if necessary. (This more-or-less has to be a hacked version of Lightroom because you can't take chances with licensing problems preventing the program from running now that Lightroom is subscription based.) And actually for newer photos, I've moved away from Lightroom to RawTherapee. Even though I feel it's usually inferior, I feel safer since I know the pipeline from raw to burned edit is essentially public. I keep a backed up copy of the RawTherapee source code, but even if that somehow failed someone could make a RawTherapee compatible raw converter from scratch.

So that's why edited images are actually harder to preserve: whereas with raw you can just save them in 2-3 different physical locations and storage devices, to keep the edits you have to take several additional steps and there are more points of failure. Why not just keep the static edited images too? Well, I do that for my most essential photos. But that gets into the other issue, that the edited images are actually larger than the original raws.

If the point is preserving my edits, not just having a copy that's good enough for Facebook, the images have to be lossless. This pretty much means PNG or TIFF, and in fact the latter seems to be necessary since PNG doesn't handle metadata anywhere nearly as well. Unfortunately, while the compression used on raw files tends to be pretty good (which is further aided by the images being mosaiced), the compression algorithms compatible with TIFF are pretty terrible. Add that to the fact that you almost certainly want 16 bit images, in order to preserve as much raw detail as possible (in case you want high quality prints or need to do further editing), and you end up with whopping huge TIFFs. I regularly see my TIFF output files 4-5 times larger than the corresponding raws.

In short, managing backups for a high quality photography workflow is actually a good bit more difficult than it seems at first sight - and how it looks at first is not that easy either!

wtracy · on Aug 18, 2020

By "raw" I think GP meant the files originally provided by the users, not actual RAW files.

bscphil · on Aug 18, 2020

That might be the case. Admittedly I don't know much about this cloud platform, and the article doesn't fill in a lot of details, but I would naively have assumed the target audience for something like this would be pro and semi-pro users, who would want to store RAW files, since amateurs and those just looking to share photos would get more out of Flickr, or even Google Photos / Facebook / Twitter / etc.

novok · on Aug 18, 2020

Also I don't think the point of canon's service was to store heavily edited versions of photos, if you were able to edit them at all on their service.

hatsunearu · on Aug 18, 2020

> Even open source programs, like RawTherapee and Darktable, use entirely different and incompatible approaches, algorithms, and instructions.

`git checkout v3.0.2`

`./build.sh`

No idea what you're talking about. As long as you have your RAWs and your sidecar file you can trivially reproduce the picture, at least that's how it works in darktable.

Hell, the old code that is deprecated and "hidden" in the newest versions is actually all still there and if you import your stuff in the newest version it is practically guaranteed to look exactly the same. In fact there is a CI system that checks for delta-Es in each release.

bscphil · on Aug 19, 2020

I mean that RawTherapee and Darktable don't use the same code as each other at all. The instructions (usually sidecar files) the tell each program what to do to generate the output are completely incompatible.

> the old code that is deprecated and "hidden" in the newest versions is actually all still there and if you import your stuff in the newest version

I'd expect this to be the case with any good photo developer, but it doesn't pay to take chances. There have been significant bugs in the past with DNG files, for example, and different bugs could be introduced in the future, or bugs you didn't know about could be fixed, and so on. There's a lot of stuff that could go wrong. It's good to be able to just tar the source code.

hatsunearu · on Aug 19, 2020

> I mean that RawTherapee and Darktable don't use the same code as each other at all. The instructions (usually sidecar files) the tell each program what to do to generate the output are completely incompatible.

I'll be honest, this is probably the most ridiculous argument against anything I've ever heard on hacker news. This is like saying C/gcc sucks because Javascript exists. What the fuck?

> I'd expect this to be the case with any good photo developer, but it doesn't pay to take chances. There have been significant bugs in the past with DNG files, for example, and different bugs could be introduced in the future, or bugs you didn't know about could be fixed, and so on. There's a lot of stuff that could go wrong. It's good to be able to just tar the source code.

So... it's a non issue? I don't even...

bscphil · on Aug 19, 2020

> I'll be honest, this is probably the most ridiculous argument against anything I've ever heard on hacker news. This is like saying C/gcc sucks because Javascript exists. What the fuck?

Well, sorry you feel that way. /s

The difference is that C and Javascript are widely implemented open standards. Lightroom, Darktable and RawTherapee are using three different opaque, undocumented approaches to developing raw images. Neither one has been reimplemented in other software, and it would be extraordinarily difficult to actually do that, because of all the specificity and quirks. You basically need the original software, which means making sure you can still compile it, making sure you have a platform it can run on, and so on. This is more complexity than most people ordinarily expect when they talk about "backing up photos", and that's exactly my point.

> So... it's a non issue? I don't even...

I should probably not even bother responding, since you already showed with your original comment that you didn't bother to even read my post, but I really don't get this. I explained a very specific issue: that there's a lot of complexity to developing raw photos, which means you have to take extra steps to make sure your edits are properly backed up. Open source software has a distinct advantage because you can archive a copy of the software yourself, but it doesn't change the fact that you do need the original software. And in fact you might need the same version, that's what the example of DNG conversion is supposed to illustrate.

This is ... the opposite of a non-issue. In fact it's a very specific issue that I took quite a bit of time to explain in detail. Having to back up your software, or possibly even an entire VM along with your photos to make sure your edits are preserved, goes way beyond what your average person, or even many photographers think they have to do to keep proper backups.

glenneroo · on Aug 19, 2020

To add to this, I have a love/hate relationship with Adobe updating their RAW Engine version, as opening up old RAW images (with edits stored in XMP sidecar format) will look completely different due to the engine interpreting edits differently, or in extreme cases, when Adobe decides to add or remove features, which has happened to me numerous times during the last ~15 years since going from RawShooter directly to Lightroom. I usually delete exports for non-critical projects to save space, but any paid jobs are backed up multiple times because I've been bitten before by not being able to render the exact same TIFF for print as 5+ years ago.

hatsunearu · on Aug 19, 2020

Actually this highlights a different issue--for photo editing, there isn't one "right" way to do things, like for instance text editing in like Microsoft Word or some shit. darktable has a very different approach compared to Lightroom for instance.

hatsunearu · on Aug 19, 2020

> Neither one has been reimplemented in other software, and it would be extraordinarily difficult to actually do that, because of all the specificity and quirks.

Why is this a problem? That's like saying "I need a good GPU to run this game, this sucks!" Well, no shit I guess?

And it's not even close to that because you can always just buy/compile/compile the right version. And for darktable you don't even need that, backward compatibility is guaranteed. Also darktable has a rudimentary Lightroom to darktable conversion tool (never used it).

> This is more complexity than most people ordinarily expect when they talk about "backing up photos", and that's exactly my point.

Not sure about the other ones, but for darktable, it's as simple as keeping the XML alive.

> I explained a very specific issue: that there's a lot of complexity to developing raw photos, which means you have to take extra steps to make sure your edits are properly backed up. Open source software has a distinct advantage because you can archive a copy of the software yourself, but it doesn't change the fact that you do need the original software. And in fact you might need the same version, that's what the example of DNG conversion is supposed to illustrate.

If you don't like having a non-destructive copy, save a high quality JPEG with your edits. Also, I can't think of any other applications (not just photo editing) that you can easily have future-proof non destructive copies of whatever you need to save.

For example: for audio, you can always just save a WAV or FLAC of your work, but you're still relying on the fact that your DAW workflow is still gonna exist 10 years in the future if you try to save your project as opposed to a mastered copy.

Code is also similar. You can save a binary which will probably work, but if you want to save your repo, unless it has good package managers you might have issues later down the line, like that left-pad incident a while back.

Hell, even for something completely different: language drifts over time, and reading Shakespeare is kind of hard unless you know how to read Middle English. And we don't seem to have a problem with forward compatibility, it's just a bit of a pain in the ass. I read it in High School and I didn't go insane.

This isn't a photography issue, this isn't a software issue, this is a core part of human experience.

> This is ... the opposite of a non-issue. In fact it's a very specific issue that I took quite a bit of time to explain in detail. Having to back up your software, or possibly even an entire VM along with your photos to make sure your edits are preserved, goes way beyond what your average person, or even many photographers think they have to do to keep proper backups.

You don't need a VM, at least for darktable. You're making an issue that hasn't appeared but is potentially possible into a real issue for yourself. Maybe it will become an issue, but just figuring out how to install the old version is gonna be good enough. You do you mane. I'll stick with running dt natively and trusting the forward compatibility and OpenCL acceleration.

__s · on Aug 18, 2020

For storing lossless rgb you might consider FLIF given you've demonstrated some flexibility in your setup, https://flif.info it supports 16-bit channels

bscphil · on Aug 19, 2020

Thanks for the suggestion! I tested it on one file. Support for the format might turn out to be a problem: to use the flif command line tool, I had to convert my input file from TIFF to PNG first, which means I'd have to be extremely careful not to lose metadata if I started doing this for real. It was also pretty slow.

File sizes (DNG lossless, other formats 16 bits):

* DNG: 18.5 MiB

* TIFF: 84.2 MiB

* FLIF: 55.1 MiB

So while FLIF is unsurprisingly much better than TIFF, it's still ~3 times larger than the DNG, which means that a pretty noticeable amount of additional disk space would have to be sacrificed to store the final edited versions of all my photos.

hatsunearu · on Aug 19, 2020

This is lossless as in it losslessly preserves an frame buffer, but it isn't lossless in that it can be edited in RAW losslessly.

teawrecks · on Aug 18, 2020

I'm not sold. In the cases where you lose the raw vs a post processed image, one of those is significantly more lost information. It may be non trivial to reproduce a post process image from a raw image, but the other way around is usually impossible.

In one case you've lost original sensor information, in the other case you've lost some parameters and possibly the algorithm used. The former is infinitely more difficult to synthesize from scratch.

foo_barrio · on Aug 18, 2020

I think that's a really shallow way to look at it. For a professional photographer it's the final version of the file that is actually sold, not the RAW or film negative. These final files are often not some 300x200 web preview. These are full sized images. You can argue the final edit has more value and is the one to preserve. Ansel Adams broke the process down into 3 parts with the initial capture being but the first. So losing the final edits is losing 66% of the total work.

I'm only a hobbyist photog and the bday parties and weddings I've shot result in thousands of images. I can't imagine having to shoot multiple events every week for years. For a pro, the edits represent thousands of hours of work. That the work can be redone theoretically might mean very little to a working photographer.

bscphil · on Aug 19, 2020

> For a pro, the edits represent thousands of hours of work. That the work can be redone theoretically might mean very little to a working photographer.

This is exactly the issue I had in mind, well said.

owenmarshall · on Aug 18, 2020

My wife shoots weddings. A wedding lasts an hour, the reception maybe three or four. She’ll spend the better part of a week editing the whole event.

Raw images seem to be very important until the very moment they’ve been edited (or, hell, rejected from editing). Then their worth is very little.

ISL · on Aug 19, 2020

Some images require hours of painstaking editing. There are some photographers who will spend days or more editing an image. The edited image is as much an instantaneous snapshot of the photographer as it is the scene in front of the lens. Looking at my Darktable catalog, it is easy to estimate that I spend at least 250 hr/yr editing images.

Raw files, and sidecars, retain their value because sometimes one wishes to revisit an image/edit. If I had to pick one or the other, I might prefer to lose the edited images. That is because I believe I have more/better days of my editing life ahead of me and because I underestimate how much editing I have done. Moreover, there are plenty of RAW images awaiting the day they come to life.

In commercial work, like weddings, the final product is the edit. If there is one thing that must be retained, that is it.

ISL · on Aug 18, 2020

So far, I have found that Darktable's edits have survived changes in versions. Someday there will probably be breaking changes, but I haven't yet felt it personally.

hatsunearu · on Aug 18, 2020

every single "dead" code is still in marketable and isn't accessible unless you try to import a sidecar from a previous version. LR might be shitty but the fact that the guy says darktable has this problem is just silly.

jrumbut · on Aug 18, 2020

This, like reproducibility in machine learning workflows, is one of those things that needs to be baked in at the start or will become a complete nightmare.

pmarreck · on Aug 19, 2020

This was a fascinating, but also disturbing, read; it does sound like your solution (quasi legal as it might be) is the one that makes the most sense if you do this professionally

chrismorgan · on Aug 18, 2020

The original story spoke of the rumours of malware involvement, but in the update which forms the first half of the linked article, Canon says there was no malware involved.

(“Replication isn’t backup” still applies, of course.)

benatkin · on Aug 18, 2020

accidental malware (I get that the name implies malicious and deliberate but that never stops definitions from expanding)

crazygringo · on Aug 18, 2020

Do any cloud providers create backups on top of replication though?

Backing up databases (terabytes) is feasible, they're not that big.

But an entire cloud storage for photo and video for millions, we're talking maybe exabytes. The notion of making "separate" backups seems cost-prohibitive.

I am curious, though -- for services like Dropbox or Google Drive, how many replicas are there of your files? I know there must be redundancy in case a disk fails, but do they keep 2 instances of your data or more? And are the instances spread out geographically, some or all?

tjohns · on Aug 18, 2020

Google talks a bit about their general backup strategy in the SRE book:

https://landing.google.com/sre/sre-book/chapters/data-integr...

The short version is that the backups are a mix of short-term local backups + backups across the network on distributed filesystems + offline tape backup.

I can't speak to Drive, but the tape backup is certainly used for GMail. (There's a case study about Gmail having to restore from backup in the above link.)

rsync · on Aug 18, 2020

"Do any cloud providers create backups on top of replication though?"

We do.[1]

For exactly 1.75x our normal pricing, we will replicate your entire account, nightly, to a geo-redundant site which is not open for normal customer use (and, therefore, has a lower risk profile). This GR site is the he.net core datacenter, in Fremont.

It's also worth pointing out that replication to rsync.net buys you malware/ransomware protection since your account is snapshotted, by ZFS, nightly, and those snapshots are immutable (read-only).

[1] rsync.net

RantyDave · on Aug 20, 2020

I've been wondering about ransomware protection through snapshots. Presumably (and I do know more or less nothing about it) the malware aspect of it is present on the system significantly before the ransomware aspect is triggered - so restoring to yesterday's backup just puts you back in the position to get pwn3d again? How do companies get around this?

slim · on Aug 19, 2020

No you don't. What would you do if all the files of your customers that did not pay for backups are lost ?

nix23 · on Aug 19, 2020

>customers that did not pay for backups are lost

But they pay for it.

sandworm101 · on Aug 18, 2020

>> we're talking maybe exabytes

I don't see the size as the big problem. If one cloud can handle the size, the backup storage system can too. For me, the real issue is timing. How do you backup a cloud full of constantly changing data? Do you draw a line in the sand, an image of the cloud state at a particular moment? Is that even possible? You have to do backups of smaller chunks, individual accounts, but eventually that just looks like another software-managed internal array structure rather than a true duplicate. Your backup system is just as susceptible to deletion error as the cloud it lives within.

Mandatory xkcd: https://xkcd.com/1737/

rsync · on Aug 18, 2020

"For me, the real issue is timing. How do you backup a cloud full of constantly changing data? Do you draw a line in the sand, an image of the cloud state at a particular moment?"

I can only describe what we do, and, of course, there is an enormous scale difference here, but ...

Every single rsync.net account is it's own ZFS filesystem which means every single account gets snapshotted[1] nightly on a schedule. This means that the enormous operation of "backing up" all of rsync.net happens in small, manageable, chunks.

Of course, the ZFS snapshots of a customer account are not "backups" per se, but if a customer choose "geo-redundant" storage in another facility, it is those very same snapshots that we (zfs) send over. Those are, indeed, backups.[2]

The most interesting part, in my opinion, is that the daily/weekly/monthly snapshots are immutable. So you can publish your rsync.net credentials or suffer a disgruntled employee or ransomware attack, etc., and those snapshots remain safe - they are read-only.

[1] Accessible, browseable, in ~/.zfs/snapshot

[2] GR storage costs 1.75x normal pricing.

RantyDave · on Aug 20, 2020

So it does not assert that the filesystem is in a consistent state before it snapshots. IE we could be midway through applying a database transaction; or there could be a reference in one file to another that doesn't exist (yet).

It could be argued that one needs to do a site wide "write cache flush", stop, and snapshot. Not that I think for one second the service provider should (or could) be in a position to detect when a good time to snapshot might be....

JulianWasTaken · on Aug 18, 2020

It can't be "just" as susceptible as what happened here.

If you even do the straightforward "don't freeze time anywhere, just rolling copy all the files onto a hard medium", you may end up with logically inconsistent data, in that files from later in the backup may actually not have existed alongside ones from earlier ones, but if you write down the timestamps periodically, you end up with a way better end state than "I lost all the files".

At worst, you end up with "I may have lost the last 3 minutes worth of files from the last 3 minutes of the last backup I did".

returningfory2 · on Aug 18, 2020

> Do any cloud providers create backups on top of replication though?

Yes - I work at one of the FAANGs for a team that's doing precisely this. We develops an internal disaster recovery tool that creates backups of data files that can't be touched by the creating application, and that can be read back in a disaster event to recover the data.

xyzzy_plugh · on Aug 18, 2020

Are those backups for you or for your customers? The parent is referring to cloud providers, e.g. AWS S3.

returningfory2 · on Aug 18, 2020

zxcvbn4038 · on Aug 18, 2020

I’m sure this is dated but here is a write up in Google’s backup system from High Scalability - http://highscalability.com/blog/2014/2/3/how-google-backs-up...

kevstev · on Aug 18, 2020

Just using AWS S3 as an example, you can copy a bucket to another bucket (preferably in a different region) with a policy of retaining all versions- preventing the problem of a delete whether malicious or accidental, from being reflected in the backup copy.

There is also the option of using Glacier for lower cost long term storage.

If you are asking does regular S3 provide a true backup capability out of the box, it does not, or at least did not last time I actively looked at this about 2 years ago.

JulianWasTaken · on Aug 18, 2020

We don't think AWS does periodic offline backups of stuff stored in S3 so that they don't find themselves in this exact embarrassing scenario? Regardless of whether it's user-facing for AWS users, I'd hope they do, or certainly that they did before they got a good sense of the long-term reliability of S3 as a big system.

munk-a · on Aug 18, 2020

I think they do all sorts of backup stuff - and as a consumer of S3 you need to evaluate how resilient their backing up is. S3 is pretty murky about permanence guarantees so I'd always look at making sure there is a separately maintained replication script to some medium I have control over if that data is irreplaceable and the costs of losing it are significant to the business.

Judging those costs and making that call is a complex matter of course.

JulianWasTaken · on Aug 18, 2020

What would such a service then do as an end-user beyond what `aws s3 sync s3://my/bucket /mnt/my/backup/drive/` does?

(Honest question, not provocation. To me either you trust AWS or any party to never lose your data [unwise], or you basically ask them to offer a way to rsync, which AWS does)

munk-a · on Aug 18, 2020

I think that's essentially what such a service would do. You might throw in some periodic less frequent syncs (maybe you sync down to the main backup every day and sync that backup to a secondary backup weekly or monthly) and maybe some of those successive syncs are done to hosts that are usually disconnected from the network to add in a firebreak.

Johnny555 · on Aug 19, 2020

They publish durability numbers for S3:

designed to provide 99.999999999% durability of objects over a given year

oneplane · on Aug 18, 2020

Well, versioned bucket replication is essentially backup when you configure it in a way that the writer to the first bucket can't do any operations on the second bucket. Since bucket replication doesn't replicate deletion markers you essentially end up with your data being duplicated instead of just replicated writes.

jeffbee · on Aug 18, 2020

A thing stored in google cloud is generally erasure coded, which provides redundancy against the failure of individual devices or hosts, and another copy is stored separately in a geographically separate place. So you might think of it as there being 1.7 copies of your file in each of at least two places.

ignoramous · on Aug 18, 2020

> Do any cloud providers create backups on top of replication though?

Any serious database does offer backups in addition to replication, and that kind of validates OP's point. As for object stores there are a variety of techniques [0].

> But an entire cloud storage for photo and video for millions, we're talking maybe exabytes. The notion of making "separate" backups seems cost-prohibitive.

Cold-stores like the one Facebook built would be apt [1].

> I know there must be redundancy in case a disk fails, but do they keep 2 instances of your data or more? And are the instances spread out geographically, some or all?

See [0]. Backblaze do share quite a bit too about the space, too [2][3]. With minio, one could run their S3-like system on-premise [4].

[0] https://maisonbisson.com/post/object-storage-prior-art-and-l...

[1] https://engineering.fb.com/core-data/under-the-hood-facebook...

[2] https://news.ycombinator.com/item?id=10540361

[3] https://news.ycombinator.com/item?id=17550837

[4] https://news.ycombinator.com/item?id=12392081

wlll · on Aug 18, 2020

Basecamp used to, I have no idea if they still do. Databases were replicated, and also backed up remotely. User uploads were replicated in the storage system, and also backed up to S3.

dec0dedab0de · on Aug 18, 2020

I haven't looked at dropbox in a while, but they at least used to have an option for keeping revisions. It did come with an extra cost.

TheKarateKid · on Aug 18, 2020

Unfortunately Dropbox's revision history isn't reliable enough for a backup. If you notice, their marketing carefully avoids use of the word "backup" despite the features seemingly implying it.

I had a client where a glitch in the Dropbox sync caused 300k+ files to be deleted when we added a new PC. Dropbox support was unable to successfully undo all the file changes, and I had to get a ticket with CS escalated to a special team to get everything restored. Even when it was finished, they could not give a guarantee that every single file deleted was restored.

Maybe things got better since they've launched Dropbox Rewind, but given that revision history has been a feature for years and it still didn't work right, I no longer trust them as a backup.

jimmaswell · on Aug 19, 2020

I'm completely soured on dropbox since they dialed the greed up with the arbitrary device count limit and nagging suggestions to upgrade that can't be turned off if you're near the max. I don't care enough to stop using it but I will never give them money after this.

danudey · on Aug 18, 2020

Not to mention no delayed deletion.

Ideally, a "delete" should mark images as unavailable and queue them for deletion at a later date (e.g. in 30 days). This provides protection against accidental deletion by users, accidental account deletions/deactivations, paid accounts terminated due to lack of payment, automated software mistakes (such as this), and so on.

The last company I worked at, 80% of our database was inactive, useless, or redundant data, all kept to protect against all kinds of issues (such as reused usernames, subpoenas, annual reporting, etc). We could always zero out PII from an account, but we never removed a record we didn't have to. It led to a lot more RAID migrations than I would have liked to have done, but it certainly made the application easier to manage. Plus, we never worried about fragmentation or holes in our data files, cascading deletes, etc.

_nato_ · on Aug 18, 2020

There was an event when a startup I was at asked Basho (they were the company behind Riak db) about backing up our data. Backing was a little side-feature that was possible to rig up, but I recall they looked at this inquiry as if I had two heads -- as if to say, it's replicated, why jump the shark? There was a bug with one of the Riak releases, and all the data was lost. (When we scaled up with this buggy Riak release, the empty node assumed master roll, and all the child nodes went, ah... the new state has no data, let's all delete records 0..k. Fun times.)

pravda · on Aug 18, 2020

BTsync did that to me!

One computer had a hard drive failure, BTSync deleted all the files from the computer that didn't.

Doubleplus ungood.

nix23 · on Aug 19, 2020

Thats why syncronisation/replication is NOT a Backup.

ibeckermayer · on Aug 18, 2020

What’s the difference between replication and backups? Is the distinction that backups must be stored on separate infrastructure, whereas replication might still be 1 or 2 points of failure?

lukevp · on Aug 18, 2020

Replication is about having data level redundancy to protect from drive failure. Backups are about having point in time snapshots of the system state, and about having them tiered from a location perspective. The 3-2-1[1] principle says to have 3 total copies, 2 of which are local but on different devices, and 1 of which is offsite. This gives you tiers of recoverability.

It’s important from a backup perspective that it’s point in time as well, otherwise as soon as you get ransomware that encrypts your file you now have replicated those changes everywhere.

[1] https://www.backblaze.com/blog/the-3-2-1-backup-strategy/

mumblemumble · on Aug 18, 2020

I would add to that, that the offsite one should also be offline.

skoskie · on Aug 18, 2020

That’s a lot harder to pull off. What methods do you use to accomplish this?

site-packages1 · on Aug 18, 2020

Where I worked, weekly point in time backups were required. Those backups were put onto tape drives, those tape drives were set on a pallet and then driven by truck to an offline second location. IMO _that’s_ how it’s done properly.

zhengyi13 · on Aug 18, 2020

When we used Iron Mountain at my last job, there were two interesting additional concerns:

1) That nobody in our company actually knew where they were physically stored (insider risk?), but

2) That we had assurance that the physical storage was far enough away that a physical disaster in our area wouldn't also touch where the offline storage was located.

There's a concomitant jump in RTO if that's what you do this, but hopefully that's well understood among the stakeholders.

site-packages1 · on Aug 18, 2020

Definitely a commitment jump, and not always necessary. We ran our own data centers, off the topic of backups, and it was wild to me that they were so disaster proof. There were different power lines from different power companies coming in in case either of the power companies had an outage, and some giant diesel power generator that could last a long time while fully powering the data center.

mumblemumble · on Aug 18, 2020

Well, once upon a time, when I worked at a smaller company, I kept a revolving collection of tapes at home.

A slightly larger company also used tapes, and stored them at an undisclosed (to me) offsite location.

A much larger company kept it all in datacenters, but the "offline" backups were disconnected from the WAN when they weren't actively being used.

reaperducer · on Aug 18, 2020

That’s a lot harder to pull off. What methods do you use to accomplish this?

Depending on how much data you're backing up, sneakernet works.

When I was still in the office, my company had me rotate a set of backup hard drives between the office lockup and a strongbox in my house. The notion was that it was unlikely that both the office building and my house 12 miles away would both burn down at the same time.

Of course, now I'm working from home, so all of the eggs are in one basket again.

chiph · on Aug 18, 2020

External USB drives for me. One set of drives is stored in a fire safe at home, one set of drives is stored at the office, and one set is stored at a relative's house in another state. The ones at home get refreshed most often, the ones stored at the office get refreshed about once a quarter, and the one at the relative's house get refreshed during holidays and family get-togethers.

The drives are encrypted (Truecrypt) since they will be outside my physical control. The ones at the office I am prepared to abandon should I get fired/laid-off.

randoramax · on Aug 18, 2020

S3 bucket with object lock is as offline and convenient as it gets

wtracy · on Aug 18, 2020

If you're going to pitch an AWS service as "offline", then pitch Glacier.

reaperducer · on Aug 18, 2020

I don't know much about S3, but isn't S3, by definition, online, not offline?

csydas · on Aug 18, 2020

AWS S3 (and a few compliant providers) offer immutable options, both in governance/compliance mode.

Allegedly, compliance mode is unalterable by any account period, I guess the equivalent of the immutable attribute without an overriding account. I'm not immediately familiar with any literature on attacks on this feature, but I've also not searched hard; however, I know from my clients that it's an accepted form of WORM, and that cloud storages like S3 are considered in the same vein as tape when immutability is in play.

I suppose it will be a case in the future that proves the efficacy of AWS/Azure/s3 providers, but for now, a lot of regulatory policies for 3-2-1 allow for such storage to fulfill the "2" part of the 3-2-1 rule.

zxcvbn4038 · on Aug 18, 2020

I’ve always wondered if Amazon backs up S3. I don’t think they explicitly say but I get the impression that it is the user’s responsibility to replicate to a second region to guard against data loss so I am guessing not. Object Lock wouldn’t protect against an S3 failure.

count · on Aug 18, 2020

They say it's designed to provide significant durability : https://docs.aws.amazon.com/AmazonS3/latest/dev/DataDurabili... (99.999999999% durable over a year per object, and able to sustain data loss in 2 facilities).

Given the peeks that amazon has provided into the scale of S3, I don't know if you CAN 'back it up'.

danielheath · on Aug 19, 2020

As far as I can tell it’s replicated (by default - see “reduced redundancy storage”), not backed up.

ashtonkem · on Aug 18, 2020

Or at least in an append only data store.

mumblemumble · on Aug 18, 2020

Though, tread carefully there. If the thing that makes it append only is that you're just not using its destructive update features, then the store isn't really append only. Just because everyone agreed to limit the ways they interact with the data store doesn't mean you can trust buggy code, sloppy programming, and attackers to honor that agreement.

dec0dedab0de · on Aug 18, 2020

I see a bunch of explanations, but I don't think any of them really drove home the reason for why you usually need both.

Imagine if you had a document that stored useful information. You had this document automatically replicated to another system in a different time zone every time there was a change.

You think you're doing great, if there is an outage in the one system you just get your important document from the other system.

Then one day someone accidentally copies and pastes the wrong data into the file. Now what do you do? If you goto your replicated copy it also has the bad data.

The answer is you should have had backups too. so you could go back an hour,a day or a week or even much longer. Depending on how much data you have, how often it changes, how important it is, and how quickly someone would notice bad data.

kgermino · on Aug 18, 2020

I think the biggest factor is timing (although backups should also be on a second system).

In short, if anything that happens to the files is immediately copied to the "backup" then you don't actually have a chance to recover from any software problems. Whereas, if you make a copy of the data every night and keep the last 30 copies you can find an issue like this and go back in time to retrieve the files from before it started.

throwaway9d0291 · on Aug 18, 2020

Replication is intended to create a 1:1 copy of the data. If data is removed from the primary system, replication will ensure it's removed from the replica.

A backup can be a 1:1 copy but it should be set up such that something going wrong with the primary (e.g. cryptolocker malware), can't affect it (since "something went wrong with the primary" is what the backup is intended to resolve).

To do that you could take the backup offline or use features like filesystem snapshots to ensure that changes can be rolled back.

jrochkind1 · on Aug 18, 2020

The problem is that most replication setups will replicate deletes (and other modifications) as well. So if misbehaving software deletes (or corrupts) things, those deletes (or corruptions) will be replicated, and the replication does not give you a way to actually recover from a mistaken delete.

ResidentSleeper · on Aug 18, 2020

Backups protect against a wider class of problems. For example, a software vulnerability/bug or a human error could result in the deletion of an object in all replicas (because deletions are replicated too), but you'd usually need another incident to occur for you to lose the backup as well.

SAI_Peregrinus · on Aug 18, 2020

If your drive fails (or a single computer fails) replication saves your data, and keeps things running seamlessly.

If you `sudo rm -rf / --no-preserve-root` your drive, replication deletes everything, while backups let you restore to the last time you took a backup.

njharman · on Aug 18, 2020

Replication includes replicating deletes (and overwrites, and subtilier changes). Replication is READ / WRITE.

Backups are WRITE ONCE. They persist if the original is deleted or modified.

Separate infrastructure and geographical distribution are orthogonal.

geerlingguy · on Aug 18, 2020

Classically, backups are taken then stored offline (e.g. on archive tape, on optical media, or on drives that are disconnected and put in storage between backups). Otherwise, if they are not 'cold' backups, they are still disconnected so if the main storage blows up, it won't impact the data in the backups.

Replication usually involves storage that is powered up and connected somehow to the same systems as the main storage; and any data that's corrupted on the main storage would propagate to the replicated storage.

gameswithgo · on Aug 18, 2020

A backup would be resistant to someone accidentally deleting all the files, for instance. Replication would not.

t_minus_3 · on Aug 18, 2020

replication is not backup and backups need to be replicated too.. and it goes on and on..