> I may not be an expert in high-reliability systems but that isn't how I expect the problem to be tackled.
You probably know way better than me, but in my experience, configuring things correctly on healthy hardware gets you 99.99% by default. Adding some surplus capacity adds another 9, at least.
Then you build from there (autoscaling, hardware failovers, etc. etc.).
That does depend somewhat on what you count as "downtime". Larger distributed systems may not suffer a complete outage in the whole year, and more than one of the services my team runs has that level of successful response ratio.
You probably know way better than me, but in my experience, configuring things correctly on healthy hardware gets you 99.99% by default. Adding some surplus capacity adds another 9, at least.
Then you build from there (autoscaling, hardware failovers, etc. etc.).