Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Supposedly they’re having to basically restore everyone from backups because a system designed to delete old data was a bit more efficient than it should have been: https://reddit.com/r/sysadmin/comments/u14qqq/_/i4a0mk8/?con...


Reminder: never delete data for real as your first step. Always mark it deleted along with a time stamp saying when. Then you can hide deleted itemsfrom everything. When a maintenance script goes haywire you can fix the problem quickly. Have a daily job that really deletes records marked deleted after 30 days.

If that is too complicated to retrofit then have any mass cleanup script move the records to a CSV file or temporary table.

Never ever ever be in a situation where a rogue script or bad SQL WHERE clause means restoring from backups.


I agree that data must never be "deleted and forever gone" unless you've already been very sure about it a few time.

But I would like to warn people about certain implementations of database "soft deletes" that I'm not a fan of. To be clear, I'm talking about the idea of having a "deleted" and/or a "date_deleted" column and using those columns in the WHERE clause to filter out rows that shouldn't be visible.

That pattern complicates the table structure, queries, and indexes. It increases table and index size, thus more data has to be sifted through (either table data or index data) to ensure only non-deleted entries are returned. More data to go through means slower queries. It's also really easy for people to write SQL that accidentally leaves the "deleted" column out of the WHERE clause. Then old, irrelevant data is being returned.

Accidentally deleting data that needs to be undeleted is usually rare so I don't think people should optimize for it. We should optimize for things that happen frequently.

I have dealt with the rare "Oops! I deleted important data!" by restoring from backups and it has worked fine. I think it may be too strong to say you should never be in a position to restore data from a backup. In fact, I think it's important to streamline the restore process.

For cases where we know ahead of time that we want to query deleted data I'll move deleted data to another database table that exists solely for maintaining history. For example, an ORDER table will have a DELETED_ORDER table, or an ORDER_HISTORY table. The HISTORY tables can also record data overwritten from updates.

These tables take up disk space, but never affect the structure or size of the original table and its indexes. Queries to the original table don't need to be modified to account for soft deletes.

To guarantee that things go to the delete/history tables, I'll usually put a trigger on the original table to move data over to the history tables. This way no application-specific code is needed.


> Accidentally deleting data that needs to be undeleted is usually rare so I don't think people should optimize for it.

That's very use case dependent.

We've made it easy for people to undelete data they've accidentally deleted simply because they used to do it so often and the only people who could get it back were our tech team. We're a devops org so part of our job is of course to support the systems we build, but our time is better spent on building solutions to business problems than to repeatedly providing support for issues that come up all the time. Part of building those systems is of course engineering in solutions that make it hard to screw up, and easy to unscrew when things inevitably do go wrong. No mean feat given our platform dates back over 15 years and still includes a lot of legacy from the time when tech was just a couple of people.

I suppose the object lesson here is that edge cases in one system or company can be part of core business in another so it's best not to make too many assumptions.


Agreed, soft deleting adds so much complexity to everything. And even has the potential for privacy related bugs. Like, say, accidentally forgetting to respect the deleted column in a query on a joining table that determines user permissions for some resource. Now people have access to something they had permissions revoked for. Whoops.


>I have dealt with the rare "Oops! I deleted important data!" by restoring from backups and it has worked fine.

Usually this goes along with "Oh and the other team did some important work at the same time" so you can't just restore a backup. You either tell them to deal with it or start writing custom scripts to copy out only the data you want to restore.

A more sane solution would be soft delete for x days and after that it becomes a real delete.


I guess one could make a view for each table that always includes the where deleted = false, to not bother about it in application code. Still yes, it adds complexity.


A "deleted" field type deletion is also how you get a massive fine from a GDPR agency when they find out that you're not actually deleting PII properly.


One smaller social media app I used to sysad over actually overwrote data to be purged as xX0-Deleted-oXx and similar (there were a few variants depending on data constraints). There was no "show_deleted_when" garbage.

Then weekly, a task went in and then purged rows with those non-content placeholders to completely purge that user, if a user-purge was requested.


I don't see how moving it to another table is any way different from that perspective.


If you're using MS SQL Server it natively supports temporal tables, which take care of this problem for you. There's also an extension for Postgres, and of course triggers.

Temporal tables should really be used more often.


Works until you get a bug in the deletion job. I've seen exactly this happen.


You don't even need a bug. Just a wrong system clock.

We had a few windows laptops where something caused them to time travel to 8000 years in the future. Then, they'd slowly spend a few hours deleting every local profile, as nobody had logged in to them for 8000 years. Then, they'd do something to their time zone database and travel back 8000 years.

When they started the process, it was unstoppable. Trying to modify the system clock to something sane just caused them to depart to the future again, even if disconnected from the network. None of our users was very amused by this behavior, even if everything important was backed up.


that is almost cartoonishly nightmarish


It happened specifically to 1 type of laptop and we only had about 30 of them. So we pulled all of them out of roulation. Then covid struck, so I reformatted most of them with Debian and we gave them away for home schooling. I wonder if I managed to linuxify some kid in the process.


Thanks to you, plenty of kids now think they live 8,000 years in the future! :)


Oh man, so this was relatively recent no less. Hopefully you did!


Considering the sorry state of videoconferencing on Linux they probably all immediately had Windows reinstalled.


Not sure what this means, I've been using video conferencing on my Linux laptop for work on a daily basis for the last 5 years.


Yeah, the idea is that by expecting the deletion logic you can make it simpler and more rigorously tested than regularly changing business logic or application code.

If you organizationally cannot prioritize quality then nothing can help you.


> or bad SQL WHERE clause

Tip: Begin an SQL session with BEGIN TRANSACTION; at the end you can either COMMIT or ROLLBACK.


Bonus tip: don't even type COMMIT until after you've run a few select queries to verify the data has in fact changed how you expected it to change.


super duper bonus tip: Don't execute SQL in a production environment that you just typed out on the fly.


I fucked up once and lost 2 hours of customer data. I was so lucky we were a small startup and had daily backups AND the backup was only 2 hours old. I would have been royally fucked. Never making that mistake again.

Always use a copy of prod on a staging server and run your queries there for testing.


It's incredibly depressing how common this is in the real world, but I can tell you from experience that this NEVER happens in Atlassian production systems.


Another tip: never enter queries directly in a production database connection with write access in the first place. (Ideally very few people even have that level of access.) Write it in your codebase, write tests for it, get it code reviewed, and run it in a dry run first and get a list of affected records before running it for real.


I always liked LIMIT when doing dangerous stuff in prod


For anyone doing this, just keep in mind that chucking on a LIMIT 1 can give you a false sense of security. For example, say you want to drop a single row but forget the WHERE. A LIMIT 1 will return "yep, deleted one row" but it's not the one you wanted (instead, it's whatever row came up first). Better to do the operation in a transaction that you can rollback - that way, you can better validate the results of your operation before committing.


This is also true for people cleaning house or going minimal. Put it in a well labeled box and tuck it away somewhere. If it matters then you'll fetch it, otherwise just chuck it when the box gets in the way.


What's wrong with restoring from backups? This is one reason they exist after all. I don't think that making a mistake in delete statement is something you would do every week.


Because you lose all the work done since the issue happened. It's very rarely acceptable to just give up and do a full backup restore unless literally everything is gone. If there was just some bug that cause partial issues, you have to find some other way to fix it.


This whole debacle is Exhibit A.

They’re lucky they have a sound backup strategy in place, and that the amount of data lost is appearing to be minimal.


Yea, and this whole thread is very informative for a non-db specialist like me!


Could be that they are restoring from backups - just that that restoring from those backups is very very slow. Atlassian would not be the first where resourcing and testing a speedy disaster recovery strategy wasn't given the highest engineering priority.


> Never ever ever be in a situation where a rogue script or bad SQL WHERE clause means restoring from backups.

As a second step, restore from the backup at a set frequency. This would force orgs to automate and optimize not just the backup flow but also the restore flow. Tear-down and restore entire systems from backups. Of course, doing so enormously adds to the cost, but when there's an outage, it will pay itself over.


How does that work around GDPR and other "Right to be forgotten" legislation? Aren't we required to hard-delete this kind of data?


You are within a certain period of time, not ‘instantly’ (depending on the exact situation you are referring to). The script could take that into account (using a shorter period of time or the like)


What is hard deletion? You can restore rows from database files before vacuum runs. You can often restore data from disk sectors. Some people say SSD can remap sectors under your chair and you won't even know that your deleted data is there.


The law isn't a technical specification. You have to follow the spirit of the law. A soft deleted_at timestamp wouldn't be following the law in good faith. Having some data stuck in an unmapped section of an ssd would be within the spirit.


IANAL, but IMHO a soft 'deleted_at' timestamp along with a daily cron job that hard deletes everything with a deleted_at older than 24 hours would fall within the spirit of the law.

I agree that just having a deleted_at timestamp and old entries are never pruned would not be a good faith interpretation of the law.


From what I have seen, there is no requirement for instant deletes. Even emailing a support address and having them manually delete the data is acceptable. Most places using deleted_at never clean up the data from what I have seen though.


As long as the data is deleted within a month there should not be any GDPR concerns.

> The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay

> “Undue delay” is considered to be about a month

https://gdpr.eu/right-to-be-forgotten/


Amen. I always default to “soft” deleting whenever possible.


> This data was from a deprecated service that had been moved into the core datastore of our products.

That is very interesting. This implies they are backing off, at least somewhat, from their very aggressive microservice strategy. Perhaps they feel like they have gone too far in decomposing their products.


Or ransomware got them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: