> I watched senior engineering leadership ask senior Ops leadership why they had...

pixl97 · on Nov 7, 2023

I work with banks on, um issues, without disclosing too much. Occasionally we find issues with their software workflows, and this isn't related directly to monetary stuff. We're not allowed to have the end result change in any way most of the time. "But we're getting a wrong answer", or "This is costing XX person hours per day" isn't up for consideration. Nothing must change.

Ugh, and on the subject of banks computer operations. Every single department is in deep blame avoidance mode. It's not "find and identify problems" mode, it's "It wasn't me" mode. We had our application performance drop to almost zero (like we dropped to disk operations per minute IOPM) I spent hours telling the customer, this is your infrastructure. So we got infrastructure teams on the call trying to figure out where it was. Not a single one of them were helpful "Everything fine, it's not us" was the first thing out of their mouths and the second was "We didn't change anything".

It took 10 hours of sitting on a call over 2 days to get the NAS team to admit they turned on anti-virus on the NAS side and that the machines were in meltdown mode because the CPU was off the charts. The preceding people didn't even look at the metrics before coming back with a "it's not my problem, everything is fine" response.

rkagerer · on Nov 7, 2023

Encountered a challenge like this once. Infrastructure team kept telling us it wasn't anything on their end. I coordinated with the business unit VP to serve their entire QA environment from four VM's on my laptop for a day, and performance went from slow as molasses, to purring like a kitten. A couple days later their infrastructure team finally identified storage latency issues on their multi-million dollar cluster and let me help them fix it.

pixl97 · on Nov 7, 2023

I swear IOPS/request latency is one of the least understood things in these huge companies. "But we have 40bazillion GB and 100GB LAN", cool story bro, your disk queue latency is pegged at your depth limit and your fs response latency is over a second, everything is going to suck till you deal with that.

RajT88 · on Nov 7, 2023

I have had customers say this when not getting desired throughput cross-continent. "But the pipes! They are fat!" Yes but also your window needs to scale... 6 figure network engineers not knowing about window scaling, who do not know how to analyze a packet trace.

I recently had someone suggest that they needed 1ms latency cross-continent. I explained patiently that the laws of physics have to change for them to hit that number.

I am not even a network engineer!

rescbr · on Nov 8, 2023

Oh, this is too common! I can't even count how many times I had to blame Einstein for setting the speed of light too low in his law. (Nitpick: I know he's the wrong person to blame, but he's famous enough to make this joke understandable by everybody)

Once I had to write a root cause analysis report and it basically explained how TCP works.

NovemberWhiskey · on Nov 7, 2023

Truth. I literally have a signal on one of my monitoring dashboards which indicates "database is currently undergoing online backup", because it is the single most important performance signal I have. This doesn't come to me as a signal from the database team; I have to poll the database for it myself.

Noisy neighbor in inadequately-isolated, shared-tenancy models is just the worst.

xelxebar · on Nov 8, 2023

> your disk queue latency is pegged at your depth limit and your fs response latency is over a second

You said words. And I totally, 100% understand them. But, like, for the plebs that are totally not me, could you elaborate on what you're talking about here? I^WThey would like to understand.

stavros · on Nov 8, 2023

Basically, "the disk is always terribly busy with tiny data operations".

broast · on Nov 7, 2023

I find it very hard to tolerate teams that say an issue isn't on them if they aren't pointing to evidence that indicates where they think the issue is. If you think it's not on you but it affects something within your responsibility, you're still on the hook until you prove it.

Scoundreller · on Nov 7, 2023

Nothing more fun than troubleshooting a fax machine incompatibility.

“It’s you, we’re receiving faxes from everyone else” (how would they know?)

“Nooo, it’s you, everyone else is receiving our faxes”

aidenn0 · on Nov 7, 2023

> I work with banks on, um issues, without disclosing too much. Occasionally we find issues with their software workflows, and this isn't related directly to monetary stuff. We're not allowed to have the end result change in any way most of the time.

I think this is a good constraint in general for banks to have, and its definitely harder to change a workflow under these constraints.

> Every single department is in deep blame avoidance mode. It's not "find and identify problems" mode, it's "It wasn't me" mode.

This tends to happen in large organizations, and is incredibly toxic to productivity. The most extreme form is when you get fired (or otherwise censured) for fixing something because "You were in charge of the thing that was causing all this trouble?!"

WhiteOwlLion · on Nov 12, 2023

You sometimes have to treat adults like little children. The first thing I say is, "you're not in trouble". Kids assume if an adult talks to them, they are automatically in trouble. I've gotten a lot of helpful results by telling people no one is in trouble when you make your inquiry. Amazing how grown adults regress to children when a different department calls them.

madeofpalk · on Nov 7, 2023

A while back I worked at a digital media company. For developers, the process of setting up ads for new mini sites was a big pain and required lots of back and forth and approvals with Ads team. I asked other developers, and they say "This is just the process that tyhe Ads team needs".

I go talk to the Ads people about this specifically and they say "yeah it's a pain, but this is the way the developers need it".

Turns out both parties had been wanting a better way which was pretty easy (3 different page/ad placement templates), but neither had bothered to express this to each other.

sib · on Nov 8, 2023

Congratulations - you just became a Sr Product Manager!

quercusa · on Nov 7, 2023

I moved to a role supporting a team that had been told that huge numbers of things were "impossible" by a brilliant guy given to migraines. If you asked him on a good day, he could do anything. But on a bad day he just wanted you out of his office. By his demonstrated competence, they took what he said as gospel.

I spent an hour or two a week dragging "impossible" things out of them and fixing them. They were very happy and ascribed wizard-level powers to me.

MichaelZuo · on Nov 7, 2023

A lot of the times, the 'correct' workflow isn't even documented enough to recreate it.

Often not even enough to identify that there ever was one in the first place.

rco8786 · on Nov 7, 2023

On a previous team every eng would shadow and Ops person for a day, once a year. We fixed so much stuff.

Terr_ · on Nov 7, 2023

It sounds like the psychology of "Learned Helplessness" [0] among employees and teams.

[0] https://en.wikipedia.org/wiki/Learned_helplessness

OrderlyTiamat · on Nov 8, 2023

> It is really hard to train people to not put up with chronic pain in their workflows!

It's learned helplessness. In some organisations when you ask for things like that you just get told system says no. If that happens enough, you stop wasting time asking. It's really not irrational either, it's just demotivating to get denied again and again for reasonable requests.

throwawaaarrgh · on Nov 8, 2023

This is why I believe it's critical to occasionally rotate people into a different team for a few weeks to a month. Every time I've seen it happen, there are subtle things learned which can lead to big improvements. It's not a magic bullet, but it's much more likely to lead to discovering things nobody knew than if everybody stays in place forever.

supportengineer · on Nov 7, 2023

Don't get me started. I've seen many cases where one PHP page doing one occasional SQL query against one prod database, could drive massive efficiency improvements in the organization. In a minority of these cases, I was allowed to do such a thing.