Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've managed a couple thousand servers with ECC. The vast majority had zero reported errors the whole life. Of those that reported errors, there were a few categories:

Some reported a couple errors a day for months (maybe years?) but worked fine.

Some ramped up error counts over hours or days.

Some went from zero to lots in one step.

A few managed to hit uncorrectable errors; sometimes just once.

For a small number of correctable errors (< 10/day), there was no action needed, or one uncorrectable, but that kind of failure is what drives people without ECC crazy; some of the machines that hit an uncorrectable only did it once and were fine. The other ones we'd replace ram for. A small number of daily errors or a single uncorrectable were less common than the ones that got their ram swapped. I don't know for sure if uncorrectables correlated with many correctable errors, because correctable errors were only reported hourly ... if it was a step change to bad ram, it's likely to halt before a reporting interval, so no report. Unless the correctables were several a second, the impact of corrections isn't obvious.



For a small number of correctable errors (< 10/day), there was no action needed,

Those should've been replaced, so in other words ECC is just a crutch. All the RAM problems I've had were found by Memtest86.


Why replace when the system is stable? I guess there may be an increased chance of multibit errors. But sometimes new ram is flakey or disturbing the rack causes other problems.

Is ECC a crutch? Sure. But it's hard to walk with a bum leg/bad ram, so why not have it? (Cause it's expensive is a fine reason, but if it were closer to 25% more than 100% more, it'd be easier to say yes)

Memtest86 is great, but systems change and most people aren't running memtest frequently. On my non ecc systems, I run it during setup to make sure things are good, and only later if things get crashy... but if things get crashy because of bad ram, my data may already be corrupted.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: