I'd like to point out the lesson that other industries can learn from IT infrast...

moe · on Oct 27, 2010

there's only one option: explain, in excruciating detail, exactly what happened, why it happened, and how it's going to be fixed in the future. Why? Because their clients can smell bullshit better than a purebred bloodhound. Too much bullshit means it's time to move on.

Okay. I feel a bit sorry for bashing heroku here, but I'll bite.

If I was a heroku-customer then I'd feel, ahem, a bit washed by their idea of "excruciating detail".

So their "internal messaging system" triggered a bug in their "distributed routing mesh". And they applied a "hot patch".

Great. As far as I am concerned they could as well have written their flux-compensator overheated because the pixie-dust exhaust got clogged with rogue bogomips.

I applaud their willingness to talk to their customers at all. But please... either explain what was going on in a meaningful way - or just leave it at "we screwed up and promise to do our best to prevent it from happening again".

epochwolf · on Oct 27, 2010

> But please... either explain what was going on in a meaningful way

Some of us like a technical breakdown and feel warm fuzzy reassurance. If a few people got confused after the first paragraph, it's less harmful than appearing to bullshit technical users.

endlessvoid94 · on Oct 27, 2010

He's not saying it was too technical, he's saying it wasn't technical enough.

moe · on Oct 27, 2010

Yes, sorry if that was unclear.

In less snarky words: Even facebook told us quite clearly _how_ they screwed up the other day (the config management issue). In contrast this heroku article was disappointing.

btmorex · on Oct 27, 2010

Was that really excruciating detail? All I learned is that they had some bug in their messaging system and they screwed up while trying to fix it. Their postmortem pales in comparison to the recent facebook and foursquare postmortems.

danilocampos · on Oct 27, 2010

I'm curious to know what else you would want to see there. I felt like it was a reasonable balance between writing something brief enough to be worth reading while still sharing the places where they themselves screwed up. It's pretty common practice, for example, to admit no wrongdoing whatsoever.

tlack · on Oct 27, 2010

I don't know if I'd describe this postmortem as excruciating detail. I'd like to know more about what products they're using, how the "mesh" become overloaded, what their fix was, etc.

NASA reports have excruciating detail. This felt a big vague.