It’s not that they don’t work. It’s how businesses handle hardware.
I worked at a few data centers on and off in my career. I got lots of hardware for free or on the cheap simply because the hardware was considered “EOL” after about 3 years, often when support contracts with the vendor ends.
There are a few things to consider.
Hardware that ages produce more errors, and those errors cost, one way or another.
Rack space is limited. A perfectly fine machine that consumes 2x the power for half the output cost. It’s cheaper to upgrade a perfectly fine working system simply because it performs better per watt in the same space.
Lastly. There are tax implications in buying new hardware that can often favor replacement.
I agree that there is hyperbole thrown around a lot here and its possible to still use some hardware for a long time or to sell it and recover some cost but my experience in planning compute at large companies is that spending money on hardware and upgrading can often result in saving money long term.
Even assuming your compute demands stay fixed, its possible that a future generation of accelerator will be sufficiently more power/cooling efficient for your workload that it is a positive return on investment to upgrade, more so when you take into account you can start depreciating them again.
If your compute demands aren't fixed you have to work around limited floor space/electricity/cooling capacity/network capacity/backup generators/etc and so moving to the next generation is required to meet demand without extremely expensive (and often slow) infrastructure projects.
Sure, but I don't think most people here are objecting to the obvious "3 years is enough for enterprise GPUs to become totally obsolete for cutting-edge workloads" point. They're just objecting to the rather bizarre notion that the hardware itself might physically break in that timeframe. Now, it would be one thing if that notion was supported by actual reliability studies drawn from that same environment - like we see for the Backblaze HDD lifecycle analyses. But instead we're just getting these weird rumors.
I agree that is a strange notion that would require some evidence and I see it in some other threads but looking at the parent comments going up it seems people are discussing economic usefulness so that is what I'm responding to.
A toy example: NeoCloud Inc builds a new datacenter full of the new H800 GPUs. It rents out a rack of them for $10/minute while paying $6/minute for electricity, interest, loan repayment, rent and staff.
Two years later, H900 is released for a similar price but it performs twice as many TFlOps/Watt. Now any datacenter using H900 can offer the same performance as NeoCloud Inc at $5/month, taking all their customers.
It's because they run 24/7 in a challenging environment. They will start dying at some point and if you aren't replacing them you will have a big problem when they all die en masse at the same time.
These things are like cars, they don't last forever and break down with usage. Yes, they can last 7 years in your home computer when you run it 1% of the time. They won't last that long in a data center where they are running 90% of the time.
A makeshift cryptomining rig is absolutely a "challenging environment" and most GPUs by far that went through that are just fine. The idea that the hardware might just die after 3 years' usage is bonkers.
Crypto miners undervote for efficiency GPUs and in general crypto mining is extremely light weight on GPUs compared to AI training or inference at scale
With good enough cooling they can run indefinitely!!!!! The vast majority of failures are either at the beginning due to defects or at the end due to cooling! It’s like the idea that no moving parts (except the HVAC) is somehow unreliable is coming out of thin air!
There’s plenty on eBay? But at the end of your comment you say “a rate cheaper than new” so maybe you mean you’d love to buy a discounted one. But they do seem to be available used.
Do you know how support contract lengths are determined? Seems like a path to force hardware refreshes with boilerplate failure data carried over from who knows when.
I worked at a few data centers on and off in my career. I got lots of hardware for free or on the cheap simply because the hardware was considered “EOL” after about 3 years, often when support contracts with the vendor ends.
There are a few things to consider.
Hardware that ages produce more errors, and those errors cost, one way or another.
Rack space is limited. A perfectly fine machine that consumes 2x the power for half the output cost. It’s cheaper to upgrade a perfectly fine working system simply because it performs better per watt in the same space.
Lastly. There are tax implications in buying new hardware that can often favor replacement.