Honest slightly cynical question: most probably someone inside the responsible t...

manquer · on June 9, 2020

If such people raised concerns and they had been overridden they way you describe they would sadly have left long ago.

It is not that companies become consciously malicious or are incompetent to start with, it becomes a vicious cycle, as more and more poor management and engineering talent join, the good ones leave, and the cycle continues.

Acquisitions and merge stave off the slow slide into irrelevance for a while, till the best of the new guys leave too. Systemic cultural changes is very very hard to achieve in large organizations.

all_blue_chucks · on June 9, 2020

Never humiliate a coworker in public. Instead say "both options were considered but ultimately it was decided to select option B for reason Y."

caiobegotti · on June 9, 2020

My professional experience tells me that the next question will be who decided for B given Y, then you answer it and then you have a target on your SRE back, I'm afraid. Remember that the trickle-down economics works only when the shit hits the fans and what trickles down is not money.

all_blue_chucks · on June 10, 2020

If your company uses post mortems to blame individuals rather than fix processes and tools, you haven't worked in a professional environment.

FrankPetrilli · on June 10, 2020

I've worked in these types of organizations before, and it's always counter-productive to making positive changes in the environment. You as an SRE can either make it better, or find an environment that's conducive to positive changes.

If you'd like an outside resource to suggest or read up on better postmortem practices, the Google SRE Book has a chapter [0] on postmortem culture. It's an amazing change of pace and a huge stress level improvement for us SREs.

[0] https://landing.google.com/sre/sre-book/chapters/postmortem-...

pmarreck · on June 10, 2020

So, the appeal to anonymous authority, I see

ethbro · on June 10, 2020

What if Y = ignorance?

jonfw · on June 10, 2020

Where are all of these managers out there running companies making decisions with literally no thought whatsoever? I've literally never seen them- I almost exclusively work with rational human beings who are able to justify decisions, and the few who aren't haven't been afforded any real power.

ethbro · on June 10, 2020

Mostly legacy industries. I don't want to indicate it's common. And typically gets weeded out by Director-level.

But I've certainly seen more Top50 companies than not who have at least a few Manager / Sr Manager-level folks, in charge of key teams who own the sole keys to necessary functions, who are happy to say no to anything they're ignorant of, without any impetus to learn about it.

jedberg · on June 9, 2020

I would mirror the attitude of the person who said no originally.

If they are receptive to feedback and clearly want to do better, I would be kind and explain why I had suggested it not be there in the first place and cite this as an example.

If they were being adamant or denying it was their fault, I'd probably be really quiet and just make subtle remarks about how it would have been better if they listened.

whyleyc · on June 9, 2020

Totally unrelated Jeremy, but did you know the SSL cert on https://minops.com/ has expired?

(Was interested to see what you were up to these days, which is how I stumbled on it).

jedberg · on June 9, 2020

Lol yes. It's intentional. It stops spam bots from trying to sign up, since we aren't open for signups yet.

But don't worry, you're not the first to mention it. I suppose I should just fix it and deal with the spam like normal.

I liked the unintended effect of cutting down on spam. I guess a lot of spam bots are written on top of standard libraries that reject bad certs. :)

Also, this was ironically a great way to publicly call someone out for a seemingly bad decision without being cruel about it, so props to you!

arkitaip · on June 10, 2020

This has to be the most unconventional anti-spam technique I've ever heard about.

rezonant · on June 10, 2020

> I stumbled on it by accident. I was lazy and let the cert lapse, but then noticed that spam signups basically stopped. One day maybe I'll make a post about it with graphs, although I'm not sure I actually have the data.

This is intriguing. I'm going to remember this but I'm too anal about perfect A+ TLS and renewal is already fully automated these days anyway :-\

I wonder if one could setup their TLS stack to get this effect without the tradeoff...

rezonant · on June 10, 2020

My apologies for the limited nesting at the hn nestlimit > You could probably get the same effect with a self signed cert. Although that wouldn't get you an A+ on TLS. :) > Also, if y'all do this, it probably won't work because the spammers will start ignoring expired certs.

Yeah, even if you could find a way to deny the spammers via esoteric configuration, it'll just make them realize they forgot to turn off TLS validation anyway (which is clearly what they meant to do)

jedberg · on June 10, 2020

You could probably get the same effect with a self signed cert. Although that wouldn't get you an A+ on TLS. :)

Also, if y'all do this, it probably won't work because the spammers will start ignoring expired certs.

jedberg · on June 10, 2020

I stumbled on it by accident. I was lazy and let the cert lapse, but then noticed that spam signups basically stopped. One day maybe I'll make a post about it with graphs, although I'm not sure I actually have the data.

mh- · on June 10, 2020

Minops is neat, first I've heard of it.

(at least partly tongue-in-cheek) will it support DDL too? can I INSERT infra? or is this a read-only endeavor? :)

jedberg · on June 10, 2020

Read only at first, then write too. It's hard to do writes though because you have to guess at what the person intended.

If someone does 'DROP TABLE ec2.instances', what exactly are they trying to accomplish? Do they want to terminate every ec2.instance? Should we let them?

Questions like that make write access very difficult.

mh- · on June 17, 2020

haha, yeah. very cool, though. would love to chat when you're ready to share if you're looking for feedback.. I've got some features to suggest that would (IMO) increase the value prop.

rezonant · on June 10, 2020

Sometimes this is the only way :-/ Good to ensure you "get it in writing" when the point is eventually proven in production.

isclever · on June 9, 2020

It happens a lot, when you have so much infrastructure and redundancy you think it is too big to fail. Then you lose S3 in US-East1 and break everything.

https://www.theregister.com/2017/03/01/aws_s3_outage/

mc32 · on June 9, 2020

Don’t bite or embarrass the hand that provides you make-work...

Seriously they probably tested it and it worked in theory, just not in practice and now they fix it for reals.

sky_rw · on June 10, 2020

These are the same people who recently published a white paper on how they guarantee zero downtime: https://cloud.ibm.com/docs/overview?topic=overview-zero-down...

The idea that they could even get to this point probably seemed unfathomable. It does to me.

fogetti · on June 9, 2020

While that is a solid advice, still it raises the question: is there such organization which corrects itself by firing the incompetent and promoting the competent instead (the "toldyouso" guy in this case)?

Or we just simply accept and making it the norm that even the lowest level of organizational governance is corrupt?

I am serious about this, because how people perceive their own rights, their own roles, their own status, their own influence and their organization's wrongdoing will influence the attitude in the long run against each and every organization in society in my opinion.

I know that I was blowing the question out of proportion, but it bugged me to ask anyway.

Avicebron · on June 9, 2020

I particularly like The Gervais Principle as an explanation of why incompetents are frequently the ones making corporate governance decisions. It's satire, but I think there are elements of truth in it.

https://www.ribbonfarm.com/2009/10/07/the-gervais-principle-...

acruns · on June 9, 2020

I guess if their DR firedrill assumed their failover router hosting the status page could never go down it would pass, but come on IBM.

detaro · on June 9, 2020

Seems a common mistake. If I remember right, AWS stumbled over their status page depending on S3 a few years back

mbreese · on June 9, 2020

I view this as growing pains that everyone has to learn the hard way. The bigger question is will they learn the lesson and how will they fix this for the next time?

(Because there is always a next time)

snowwrestler · on June 9, 2020

Best thing to do now is to point DNS at the backup status page they discreetly set up on a free-tier EC2 server back when they got ignored...

skybrian · on June 10, 2020

Ideally, someone should write a postmortem with a timeline of what happened and recommended fixes. These would then be fixed, and nobody would be blamed. (This is called a "blameless postmortem".)

But whether you can get away with that depends on culture.

vsareto · on June 9, 2020

Honestly it's small compared to everything else. I'd rather leave than do told-ya-so though and put the story in the exit interview or reason for leaving.

dexwiz · on June 9, 2020

I built a status page for a top cloud provider, and this was question number one from SREs.

bigiain · on June 10, 2020

I built a bunch of CloudWatch monitoring for an AWS stack, and duplicated critical monitoring using a 3rd party monitoring service as well. So, because the universe hates me, the 3rd party service migrated their hosting into AWS ~18 months later... :sigh:

epc · on June 10, 2020

I haven't been at IBM since 2001, but when I was there any suggestion like this would have been beaten down by multiple layers of the big grey cloud for even intimating that such a visible, key piece of IBM marketing material should be on a third party service.

danjac · on June 10, 2020

They would have been told "great, we understand your concerns, please add a card to the tech backlog (or whatever local jargon is at that team) and we'll address it in the next sprint/dev cycle/quarter". That's the right way to acknowledge these complaints while preventing actual progress.