Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd like to point out the lesson that other industries can learn from IT infrastructure companies.

Heroku sells a technical product to a technical audience. They're foundational to their clients' products. So when something goes down, there's only one option: explain, in excruciating detail, exactly what happened, why it happened, and how it's going to be fixed in the future.

Why? Because their clients can smell bullshit better than a purebred bloodhound. Too much bullshit means it's time to move on.

Beyond being the right thing to do, being accountable is essential to trust. When you fuck up, it will piss people off. That's just life – everyone makes mistakes. So you need to be the guy where people can say "Okay, there was a fuck up, it was bad, but look at how hard these guys worked to fix it. Check out their plans to prevent it in the future."

Luckily, the incentives are aligned here to make this mostly non-negotiable. When you get medical malpractice, a financial meltdown or an oil spill going on, the cover-your-ass impulses are much more compelling.

Even in those cases though, I insist we need to encourage a culture where accountability and transparency are rewarded. Because, for me, accountable guys are the kind of people I want to do business with.

I dunno much about scaling a Rails server, but for now, at least, I know the Heroku guys are the sort of people I'd trust.



there's only one option: explain, in excruciating detail, exactly what happened, why it happened, and how it's going to be fixed in the future. Why? Because their clients can smell bullshit better than a purebred bloodhound. Too much bullshit means it's time to move on.

Okay. I feel a bit sorry for bashing heroku here, but I'll bite.

If I was a heroku-customer then I'd feel, ahem, a bit washed by their idea of "excruciating detail".

So their "internal messaging system" triggered a bug in their "distributed routing mesh". And they applied a "hot patch".

Great. As far as I am concerned they could as well have written their flux-compensator overheated because the pixie-dust exhaust got clogged with rogue bogomips.

I applaud their willingness to talk to their customers at all. But please... either explain what was going on in a meaningful way - or just leave it at "we screwed up and promise to do our best to prevent it from happening again".


> But please... either explain what was going on in a meaningful way

Some of us like a technical breakdown and feel warm fuzzy reassurance. If a few people got confused after the first paragraph, it's less harmful than appearing to bullshit technical users.


He's not saying it was too technical, he's saying it wasn't technical enough.


Yes, sorry if that was unclear.

In less snarky words: Even facebook told us quite clearly _how_ they screwed up the other day (the config management issue). In contrast this heroku article was disappointing.


Was that really excruciating detail? All I learned is that they had some bug in their messaging system and they screwed up while trying to fix it. Their postmortem pales in comparison to the recent facebook and foursquare postmortems.


I'm curious to know what else you would want to see there. I felt like it was a reasonable balance between writing something brief enough to be worth reading while still sharing the places where they themselves screwed up. It's pretty common practice, for example, to admit no wrongdoing whatsoever.


I don't know if I'd describe this postmortem as excruciating detail. I'd like to know more about what products they're using, how the "mesh" become overloaded, what their fix was, etc.

NASA reports have excruciating detail. This felt a big vague.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: