Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Honest slightly cynical question: most probably someone inside the responsible team said some day that it would be very stupid to host the status page inside the same infrastructure being monitored, but they were probably ignored... what should that person do now? Say "toldya!" out loud in the postmortem meeting or simply shut up and move on because reality is that we are hired to do some stupid task and not to think for ourselves?


If such people raised concerns and they had been overridden they way you describe they would sadly have left long ago.

It is not that companies become consciously malicious or are incompetent to start with, it becomes a vicious cycle, as more and more poor management and engineering talent join, the good ones leave, and the cycle continues.

Acquisitions and merge stave off the slow slide into irrelevance for a while, till the best of the new guys leave too. Systemic cultural changes is very very hard to achieve in large organizations.


Never humiliate a coworker in public. Instead say "both options were considered but ultimately it was decided to select option B for reason Y."


My professional experience tells me that the next question will be who decided for B given Y, then you answer it and then you have a target on your SRE back, I'm afraid. Remember that the trickle-down economics works only when the shit hits the fans and what trickles down is not money.


If your company uses post mortems to blame individuals rather than fix processes and tools, you haven't worked in a professional environment.


I've worked in these types of organizations before, and it's always counter-productive to making positive changes in the environment. You as an SRE can either make it better, or find an environment that's conducive to positive changes.

If you'd like an outside resource to suggest or read up on better postmortem practices, the Google SRE Book has a chapter [0] on postmortem culture. It's an amazing change of pace and a huge stress level improvement for us SREs.

[0] https://landing.google.com/sre/sre-book/chapters/postmortem-...


So, the appeal to anonymous authority, I see


What if Y = ignorance?


Where are all of these managers out there running companies making decisions with literally no thought whatsoever? I've literally never seen them- I almost exclusively work with rational human beings who are able to justify decisions, and the few who aren't haven't been afforded any real power.


Mostly legacy industries. I don't want to indicate it's common. And typically gets weeded out by Director-level.

But I've certainly seen more Top50 companies than not who have at least a few Manager / Sr Manager-level folks, in charge of key teams who own the sole keys to necessary functions, who are happy to say no to anything they're ignorant of, without any impetus to learn about it.


I would mirror the attitude of the person who said no originally.

If they are receptive to feedback and clearly want to do better, I would be kind and explain why I had suggested it not be there in the first place and cite this as an example.

If they were being adamant or denying it was their fault, I'd probably be really quiet and just make subtle remarks about how it would have been better if they listened.


Totally unrelated Jeremy, but did you know the SSL cert on https://minops.com/ has expired?

(Was interested to see what you were up to these days, which is how I stumbled on it).


Lol yes. It's intentional. It stops spam bots from trying to sign up, since we aren't open for signups yet.

But don't worry, you're not the first to mention it. I suppose I should just fix it and deal with the spam like normal.

I liked the unintended effect of cutting down on spam. I guess a lot of spam bots are written on top of standard libraries that reject bad certs. :)

Also, this was ironically a great way to publicly call someone out for a seemingly bad decision without being cruel about it, so props to you!


This has to be the most unconventional anti-spam technique I've ever heard about.


> I stumbled on it by accident. I was lazy and let the cert lapse, but then noticed that spam signups basically stopped. One day maybe I'll make a post about it with graphs, although I'm not sure I actually have the data.

This is intriguing. I'm going to remember this but I'm too anal about perfect A+ TLS and renewal is already fully automated these days anyway :-\

I wonder if one could setup their TLS stack to get this effect without the tradeoff...


My apologies for the limited nesting at the hn nestlimit > You could probably get the same effect with a self signed cert. Although that wouldn't get you an A+ on TLS. :) > Also, if y'all do this, it probably won't work because the spammers will start ignoring expired certs.

Yeah, even if you could find a way to deny the spammers via esoteric configuration, it'll just make them realize they forgot to turn off TLS validation anyway (which is clearly what they meant to do)


You could probably get the same effect with a self signed cert. Although that wouldn't get you an A+ on TLS. :)

Also, if y'all do this, it probably won't work because the spammers will start ignoring expired certs.


I stumbled on it by accident. I was lazy and let the cert lapse, but then noticed that spam signups basically stopped. One day maybe I'll make a post about it with graphs, although I'm not sure I actually have the data.


Minops is neat, first I've heard of it.

(at least partly tongue-in-cheek) will it support DDL too? can I INSERT infra? or is this a read-only endeavor? :)


Read only at first, then write too. It's hard to do writes though because you have to guess at what the person intended.

If someone does 'DROP TABLE ec2.instances', what exactly are they trying to accomplish? Do they want to terminate every ec2.instance? Should we let them?

Questions like that make write access very difficult.


haha, yeah. very cool, though. would love to chat when you're ready to share if you're looking for feedback.. I've got some features to suggest that would (IMO) increase the value prop.


Sometimes this is the only way :-/ Good to ensure you "get it in writing" when the point is eventually proven in production.


It happens a lot, when you have so much infrastructure and redundancy you think it is too big to fail. Then you lose S3 in US-East1 and break everything.

https://www.theregister.com/2017/03/01/aws_s3_outage/


Don’t bite or embarrass the hand that provides you make-work...

Seriously they probably tested it and it worked in theory, just not in practice and now they fix it for reals.


These are the same people who recently published a white paper on how they guarantee zero downtime: https://cloud.ibm.com/docs/overview?topic=overview-zero-down...

The idea that they could even get to this point probably seemed unfathomable. It does to me.


While that is a solid advice, still it raises the question: is there such organization which corrects itself by firing the incompetent and promoting the competent instead (the "toldyouso" guy in this case)?

Or we just simply accept and making it the norm that even the lowest level of organizational governance is corrupt?

I am serious about this, because how people perceive their own rights, their own roles, their own status, their own influence and their organization's wrongdoing will influence the attitude in the long run against each and every organization in society in my opinion.

I know that I was blowing the question out of proportion, but it bugged me to ask anyway.


I particularly like The Gervais Principle as an explanation of why incompetents are frequently the ones making corporate governance decisions. It's satire, but I think there are elements of truth in it.

https://www.ribbonfarm.com/2009/10/07/the-gervais-principle-...


I guess if their DR firedrill assumed their failover router hosting the status page could never go down it would pass, but come on IBM.


Seems a common mistake. If I remember right, AWS stumbled over their status page depending on S3 a few years back


I view this as growing pains that everyone has to learn the hard way. The bigger question is will they learn the lesson and how will they fix this for the next time?

(Because there is always a next time)


Best thing to do now is to point DNS at the backup status page they discreetly set up on a free-tier EC2 server back when they got ignored...


Ideally, someone should write a postmortem with a timeline of what happened and recommended fixes. These would then be fixed, and nobody would be blamed. (This is called a "blameless postmortem".)

But whether you can get away with that depends on culture.


Honestly it's small compared to everything else. I'd rather leave than do told-ya-so though and put the story in the exit interview or reason for leaving.


I built a status page for a top cloud provider, and this was question number one from SREs.


I built a bunch of CloudWatch monitoring for an AWS stack, and duplicated critical monitoring using a 3rd party monitoring service as well. So, because the universe hates me, the 3rd party service migrated their hosting into AWS ~18 months later... :sigh:


I haven't been at IBM since 2001, but when I was there any suggestion like this would have been beaten down by multiple layers of the big grey cloud for even intimating that such a visible, key piece of IBM marketing material should be on a third party service.


They would have been told "great, we understand your concerns, please add a card to the tech backlog (or whatever local jargon is at that team) and we'll address it in the next sprint/dev cycle/quarter". That's the right way to acknowledge these complaints while preventing actual progress.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: