> this leads to the ability to do those kind of automated root cause analysis at...

> this leads to the ability to do those kind of automated root cause analysis at scale.

I'm curious how well that works in the situation where your config change or experiment rollout results in a time bomb (e.g. triggered by task restart after software rollout), speaking as someone who just came off an oncall shift where that was one of our more notable outages.

Google also has a ledger of production events which _most_ common infra will write to, but there are so many distinct systems that I would be worried about identifying spurious correlations with completely unrelated products.

> There has been issues that are external (like ... physical connection between datacenters being severed…) but they tend to be a lot less frequent.

That's interesting to hear, because my experience at Google is that we'll see a peering metro being fully isolated from our network at least once a year; smaller fiber cuts that temporarily leave us with a SPOF or with a capacity shortfall happen much much more frequently.

(For a concrete example: a couple months ago, Hurricane Beryl temporarily took a bunch of peering infrastructure in Texas offline.)