Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's not about assigning blame, it's about sharing lessons learned with the broader community and being transparent and honest with paying customers about issues that may have significant impact on downstream productivity.


It’s not about assigning blame for the company writing the post-mortem. But it’s definitely about assigning blame for most people reading the post-mortem. Very few people read post-mortems for the sake of learning how to be better at release engineering and ops.


If I pay for your service, and you are transparent about mistakes and flaws, I will be more forgiving about mistakes and flaws in the future, and appreciate the work you do to fix them.

If I pay for your service, and the only communication is, "We know there is a problem, and we'll let you know when it's fixed", I may assume you are not equipped to thoroughly explain the problem, and therefore not well equipped to solve it.

The blame is already assigned. The users already know there is a problem. A post-mortem likely has a positive effect for the readers attitude toward the handling of the issue.


It’s more the people who don’t pay for the service, but might, that are quickest to see post-mortems in a negative light. The only reason they have for reading them is looking for justifications for culling the product/service from the list of contenders for when they ever have to evaluate solutions in that category.

In other words: post-mortems are good PR, but incredibly bad advertising.


And a world-wide outage followed by "we fixed it and trust us it won't happen again" is going to filter any service off of my list more so than "we had a single point of failure running in our CTO's basement and his cleaning lady pulled the plug. Trust us it won't happen again."


I entirely understand what you are saying, believe me I do. But that is not the way some communities take it. We still see messages like "You could move to Gitlab but... you know they dropped their production database a couple of years back? Use them at your own risk!"

We learned a lot from the Gitlab outage. It was a simple mistake and not one they will have again, yet people still beat them up for it. I'm not sure the value is there for the company to be super open about their outages and issues.


On the contrary, I would trust them quite a bit less, not more, if they had an hours long outage without any explanation.


Perhaps - but would you even remember it, without the juicy details of what happened? I probably would forget if some service had a few hours downtime a year or two ago, if I didn't know any details to make it stand out from other outages.


Wouldn't they have gotten beaten up over the outage even more had they not offered an explanation?

In my experience, customers are often seeking an explanation/post-mortem because their customers are seeking an explanation. If an upstream service goes down for an extended period of time and all you can do is go back to your customers and say, "Your system was down because our provider's system went down for 4 hours. But they won't tell us why.", that not go over well.


Gitlab's response to the the database mistake was a large contributing factor in my decision to move all of my repositories onto their service.

Anecdotal, sure, but people like me exist. I don't know if we're in the majority. You'd have to measure somehow and do a cost-benefit analysis I guess.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: