I have a serious question, not trying to start a flame war.
A. Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now? It seems like we see major issues across AWS, GCP, Azure, Github, etc. at least monthly now and I don't remember that being the case in the past.
B. If it's becoming more common, what are the reasons? I can think of a few, but I don't know the answer, so if anyone in-the-know has insight I'd appreciate it.
Operations budget cuts/layoffs?
Replacing critical components/workflows with AI?
Just overall growing pains, where a service has outgrown what it was engineered for?
> A. Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now? It seems like we see major issues across AWS, GCP, Azure, Github, etc. at least monthly now and I don't remember that being the case in the past.
FWIW Microsoft is convinced moving Github to Azure will fix these outages
> In 2002, the amusement continued when a network security outfit discovered an internal document server wide open to the public internet in Microsoft's supposedly "private" network, and found, among other things, a whitepaper[0] written by the hotmail migration team explaining why unix is superior to windows.
And 25 years later, a significant portion of the issues in that whitepaper remain unresolved. They were still shitting on people like Jeffrey Snover who were making attempts to provide more scalable management technologies. Such a clown show.
Microsoft is a company that hasn't even figured out how to get system updating working consistently on their premier operating system in three decades. It seems unlikely to me that somehow moving to Azure is going to make anything more stable.
I think it would be pretty hard to argue against that point of view, at least thus far. If DOS/Windows hadn't become the dominant OS someone would have, and a whole generation of engineers cut their teeth on their parents' windows PCs.
There are some pretty zany alternative realities in the Multiverses I’ve visited. Xerox Parc never went under and developed computing as a much more accessible commodity. Another, Bell labs invented a whole category of analog computers that’s supplanted our universe’s digital computing era. There’s one where IBM goes directly to super computers in the 80s. While undoubtedly Microsoft did deliver for many of us, I am a hesitant to say that that was the only path. Hell, Steve Jobs existed in the background for a long while there!
I wish things had gone differently too, but a couple of nitpicks:
1.) It's already a miracle Xerox PARC escaped their parent company's management for as long as they did.
3.) IBM was playing catch-up on the supercomputer front since the CDC 6400 in 1964. Arguably, they did finally catch up in the mid-late 80's with the 3090.
Yeah, I'm absolutely not saying it was the only path. It's just the path that happened. If not MS maybe it would have been Unix and something else. Either way most everyone today uses UX based on Xerox Parc's which was generously borrowed by, at this point, pretty much everyone.
If Microsoft hadn't tried to actively kill all its competition then there's a good chance that we'd have a much better internet. Microsoft is bigger than just an operating system, they're a whole corporation.
Instead they actively tried to murder open standards [1] that they viewed as competitive and normalized the antitrust nightmare that we have now.
I think by nearly any measure, Microsoft is not a net good. They didn't invent the operating system, there were lots of operating systems that came out in the 80's and 90's, many of which were better than Windows, that didn't have the horrible anticompetitive baggage attached to them.
Alternatively: had MS Embraced and Extended harder instead of trying to extinguish ASAP we’d have a much better internet owned to a much higher degree by MS.
A few decades back Microsoft were first to the prize with asynchronous JavaScript, Silverlight really was flash done better and still missed, a proper extension of their VB6/MFC client & dev experience out to the web would have gobbled up a generation of SaaS offerings, and they had a first in class data analysis framework with integrated REPL that nailed the central demands of distributed/cloud-first systems and systems configuration (F#). That on top of near perfect control of the document and consumer desktop ecosystems and some nutty visualization & storage capabilities.
Plug a few of their demos from 2002 - 2007 together and you’ve got a stack and customer experience we’re still hurting for.
Silverlight is only “Flash Done Better” if we had the dystopia of Windows being the only desktop operating system. Silverlight never worked on Linux, and IIRC it didn’t work terribly well on macOS (though I could be misremembering).
In fact all of your points are only true if we accept that Windows would be the only operating system.
Microsoft half-asses most things. If they had taken over the internet, we would likely have the entirety of the internet be even more half-asses than it already is.
What’s funny is that we were some bad timing away from IBM giving the DOS money to Gary Kildall and we’d all be working with CP/M derivatives!
Gary was on a flight when IBM called up the Digital Research looking for an OS for the IBM-PC. Gary’s wife, Dorothy, wouldn’t sign an NDA without it going through Gary, and supposedly they never got negotiations back on track.
I'm not sure I understand this logic. You're saying that the gap would have been filled even if their product didn't exist, which means that the net benefit isn't that the product exists. How are you concluding that whatever we might have gotten instead would have been worse?
And how does it follow that microsoft is the good guy in a future where we did it with some other operating system? You could argue that their system was so terrible that its displacement of other options harmed us all with the same level of evidence.
I'm not convinced of your first point. Just because something seems difficult to avoid given the current context does not mean it was the only path available.
Your second point is a little disingenuous. Yes, Microsoft and Windows have been wildly successful from a cultural adoption standpoint. But that's not the point I was trying to argue.
My first comment is simply pointing out that there's always a #1 in anything you can rank. Windows happened to be what won. And I learned how to use a computer on Windows. Do I use it now? No. But I learned on it as did most people whose parents wanted a computer.
The comment you were replying to was about Microsoft.
Even if Windows weren't a dogshit product, which it is, Microsoft is a lot more than just an operating system. In the 90's they actively tried to sabotage any competition in the web space, and held web standards back by refusing to make Internet Explorer actually work.
Been on GitHub for a long time. It feels like they're more often. It used to be yearly if at all that GitHub was noticably impacted. Now it's monthly, and recently, seemingly weekly.
Definitely not how I remember. First, I remember seeing unicorn page multiple times a day some weeks. There were also time when webhook delivery didn't work, so circle ci users couldn't kick off any builds.
What change is how many services GitHub can be having issues.
I suspect that the Azure migration is influencing this one. Just a bunch of legacy stuff being moved around along with Azure not really being the most reliable on top... I can't imagine it's easy.
However, this is an unexpected bell curve. I wonder if GitHub is seeing more frequent adversarial action lately. Alternatively, perhaps there is a premature reliance on new technology at play.
I pulled my project off github and onto codeberg a couple months ago but this outage still screws me over because I have a Cargo.toml w/ git dependency into github.
I was trying to do a 1.0 release today. Codeberg went down for "10 minutes maintenance" multiple times while I was running my CI actions.
> If it's becoming more common, what are the reasons?
Someone answered this morning, while Cloudflare outage, it's AI vibe coding and I tend to think there is something true in this. At some point there might be some tiny grain of AI engaged which starts the avalanche ending like this.
It certainly feels that way, though it may be an instance of availability bias. Not sure what's causing it - maybe extra load from AI bots (certainly a lot of smaller sites complain about it, maybe major providers feel the pain too), maybe some kind of general quality erosion... It's certainly something that is waiting for a serious research.
Looking around, I noticed that many senior, experienced individuals were laid off, sometimes replaced by juniors/contractors without institutional knowledge or experience. That's especially evident in ops/support, where the management believes those departments should have a smaller budget.
Github isn't in the same reliability class as the hyperscalars or cloudflare; its comically bad now, to the point that at a previous job we invested in building a readonly cache layer specifically to prevent github outages from bringing our system down.
Years ago on hackernews I saw a link about probability describing a statistical technique that one could use to answer a question about if a specific type of event was becoming more common or not. Maybe related to the birthday paradox? The gist that I remember is that sometimes a rare event will seem to be happening more often, when in reality there is some cognitive bias that makes it non-intuitive to make that decision without running the numbers. I think it was a blog post that went through a few different examples, and maybe only one of them was actually happening more often.
1/ Most of the big corporations moved to big cloud providers in the last 5 years. Most of them started 10 years ago but it really accelerated in the last 5 years.
So there is for sure more weight and complexity on cloud providers, and more impact when something goes wrong.
2/ Then we cannot expect big tech to stay as sharp as in the 2000s and 2010s.
There was a time banks had all the smart people, then the telco had them, etc. But people get older, too comfortable, layers of bad incentive and politics accumulate and you just become a dysfunctional big mess.
I think they're becoming more common because AI -> FOMO -> tighter deadlines on projects "since you can use AI to accelerate your work", which is often not how it works, and last 10% of reliability work is forgotten.
> Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now?
I think that "more coverage" is part of it, but also "more centralization." More and more of the web is centralized around a tiny number of cloud providers, because it's just extremely time-intensive and cost-prohibitive for all but the largest and most specialized companies to run their own datacenters and servers.
Three specific examples: Netflix and Dropbox do run their own datacenters and servers; Strava runs on AWS.
> If it's becoming more common, what are the reasons? I can think of a few, but I don't know the answer, so if anyone in-the-know has insight I'd appreciate it.
I worked at AWS from 2020-2024, and saw several of these outages so I guess I'm "in the know."
My somewhat-cynical take is that a lot of these services have grown enormously in complexity, far outstripping the ability of their staff to understand them or maintain them:
- The OG developers of most of these cloud services have moved on. Knowledge transfer within AWS is generally very poor, because it's not incentivized, and has gotten worse due to remote work and geographic dispersion of service teams.
- Managers at AWS are heavily incentivized to develop "new features" and not to improve the reliability, or even security, of their existing offerings. (I discovered numerous security vulnerabilities in the very-well-known service that I worked for, and was regularly punished-rather-than-rewarded for trying to get attention and resources on this. It was a big part of what drove me to leave Amazon. I'm still sitting on a big pile of zero-day vulnerabilities in ______ and ______.)
- Cloud services in most of the world are basically a 3-way oligopoly between AWS, Microsoft/Azure, and Google. The costs of switching from one provider to another are often ENORMOUS due to a zillion fiddly little differences and behavior quirks ("bugs"). It's not apparent to laypeople — or even to me — that any of these providers are much more or less reliable than the others.
I suspect there is more tech out there. 20 years ago we didn't have smartphones. 10 years ago, 20mbit on mobile was a good connection. Gigabit is common now, infrastructure no longer has the hurdles it used to, AI makes coding and design much easier, phones are ubiquitous and usage of them at all times (in the movies, out and dinner, driving) has become super normalised.
I suspect (although have not researched) that global traffic is up, by throughput but also by session count.
This contributes to a lot more awareness. Slack being down wasn't impactful when most tech companies didn't use Slack. An AWS outage was less relevant when the 10 apps (used to be websites) you use most didn't rely on a single AZ in AWS or you were on your phone less.
I think as a society it just has more impact than it used to.
> B. If it's becoming more common, what are the reasons?
Among other mentioned factors like AI and layoffs: mass brain damage caused by never-ending COVID re-infections.
Since vaccines don't prevent transmission, and each re-infection increases the chances of long COVID complications, the only real protection right now is wearing a proper respirator everywhere you go, and basically nobody is doing that anymore.
There are tons of studies to back this line of reasoning.
One possibility is increased monitoring. In the past, issues that happened weren't reported because they went under the radar. Whereas now, those same issues which only impact a small percentage of users would still result in a status update and postmortem. But take this with a grain of salt because it's just a theory and doesn't reflect any actual data.
A lot of people are pointing to AI vibe coding as the cause, but I think more often than not, incidents happen due to poor maintenance of legacy code. But I guess this may be changing soon as AI written code starts to become "legacy" faster than regular code.
A. Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now? It seems like we see major issues across AWS, GCP, Azure, Github, etc. at least monthly now and I don't remember that being the case in the past.
B. If it's becoming more common, what are the reasons? I can think of a few, but I don't know the answer, so if anyone in-the-know has insight I'd appreciate it.
Operations budget cuts/layoffs? Replacing critical components/workflows with AI? Just overall growing pains, where a service has outgrown what it was engineered for?
Thanks