> In fact, the vast majority of distributed systems outages could have been prevented by slightly-more-comprehensive testing. And that’s just more comprehensive unit testing,
Key takeaway for me as far as code verification goes.
Design verification reminded me of a few subjects I studied in college.
Take that claim with a grain of salt. More comprehensive testing still requires deep expertise in resilience/reliability to even just understand what to test for. And you can't have that unless you both study that domain and experience first hand the sheer amount of problems such systems have in the wild.
For example, you might assume there is no difference if you establish TCP connection to a port 12345 or to a port 12346 of the same server. But there is, they might belong to different buckets in a network stack somewhere and one bucket might have too many packets in it and be slow and overloaded, while another isn't and be perfectly fast. This could easily cause outages. Distributed systems are strange like that.
Key takeaway for me as far as code verification goes.
Design verification reminded me of a few subjects I studied in college.