All of this glosses over the biggest issue with Kubernetes: it's still ridiculously complex, and troubleshooting issues that arise (and they will arise), can leave you struggling for days pouring over docs, code, GitHub issues, stackoverflow... All of the positives you listed rely on super complex abstractions that can easily blow up without a clear answer as to "why".
Compared to something like scp and restarting services, I would personally not pay the Kubernetes tax unless I absolutely had to.
Exactly. A year or so ago I thought, hey, maybe I should redo my personal infrastructure using Kubernetes. Long story short, it was way too much of a pain in the ass.
As background, I've done time as a professional sysadmin. My current infrastructure is all Chef-based, with maybe a dozen custom cookbooks. But Chef felt kinda heavy and clunky, and the many VMs I had definitely seemed heavy compared with containerization. I thought switching to Kubernetes would be pretty straightforward.
Surprise! It was not. I moved the least complex thing I run, my home lighting daemon to it; it's stateless and nothing connects to it, but it was still a struggle to get it up and running. Then I tried adding more stateful services and got bogged down in bugs, mysteries, and Kubernetes complexity. I set it aside, thinking I'd come back to it later when I had more time. That time never quite arrived, and a month or so ago my home lights stopped working. Why? I couldn't tell. A bunch of internal Kubernetes certificates expired, so none of the commands worked. Eventually, I just copy-pasted stuff out of Stack Overflow and randomly rebooted things, and eventually it started working again.
I'll happily look at it again when I have to do serious volume and can afford somebody to focus full-time on Kubernetes. But for anything small or casual, I'll be looking elsewhere.
At work we're building an entire service platform on top of managed kubernetes services, agnostic to cloud provider. We had already had bad experiences running K8s ourselves.
Going into it we knew how much of a PITA it would be but we vastly underestimated how much, IMO.
Written 18 years ago, so obviously not about Kubernagus, but it does explain the same phenomenon. Replace Microsoft with cloud providers and that's more or less the same argument.
> Long story short, it was way too much of a pain in the ass.
Kubernetes has a model for how your infrastructure and services should behave. If you stray outside that model, then you'll be fighting k8s the entire way and it will be painful.
If however you design your services and infrastructure to be within that model, then k8s simplifies many things (related to deployment).
The biggest issue I have with k8s as a developer is that while it simplifies the devops side of things, it complicates the development/testing cycle by adding an extra layer of complication when things go wrong.
I run my home automation and infrastructure on kubernetes, and for me that is one of the smoothest ways of doing it. I find it quite easy to deal with, and much prefer it to the “classic” way of doing it.
what dark magic are you using? Not joking. I've tried learning kubernetes several times and gave up. Maybe I'm not the smartest. Can you point to guides that helped you get up and running smoothly? This is probably something I should put some more effort into in the coming months.
I think this is really hard, it's a bit like how we talk about learning Rails in the Ruby community. "Don't do it"
Not because it's bad or especially hard, but because there's so much to unpack, and it's so tempting to unpack it all at once, and there's so much foundational stuff (Ruby language) which you really ought to learn before you try to analyze in detail exactly how the system is built up.
I learned Kubernetes around v1.5 just before RBAC was enabled by default, and I resisted upgrading past 1.6 for a good long while (until about v1.12) because it was a feature I didn't need, and all the features after it appeared to be something else which I didn't need.
I used Deis Workflow as my on-ramp to Kubernetes, and now I am a maintainer of the follow-on fork, which is a platform that made great sense to me, as I was a Deis v1 PaaS user before it was rewritten on top of Kubernetes.
Since Deis left Workflow behind after they were acquired by Microsoft, I've been on Team Hephy, which is a group of volunteers that maintains the fork of Deis Workflow.
This was my on-ramp, and it looks very much like it did in 2017, but now we are adding support for Kubernetes v1.16+ which has stabilized many of the main APIs.
If you have a way to start a Kubernetes 1.15 or less cluster, I can recommend this as something to try[1]. The biggest hurdle of "how do I get my app online" is basically taken care of you. Then once you have an app running in a cluster, you can start to learn about the cluster, and practice understanding the different failure modes as well as how to proceed with development in your new life as a cluster admin.
If you'd rather not take on the heavyweight burden of maintaining a Workflow cluster and all of its components right out of the gate (and who could blame you) I would recommend you try Draft[2], the lightweight successor created by Deis/Azure to try to fill the void left behind.
Both solutions are based on a concept of buildpacks, though Hephy uses a combination of Dockerfile or Heroku Buildpacks and by comparison, Draft has its own notion of a "Draftpack" which is basically a minimalistic Dockerfile tailored for whatever language or framework you are developing with.
I'm interested to hear if there are other responses, these are not really guides so much as "on-ramps" or training wheels, but I consider myself at least marginally competent, and this is how I got started myself.
Moreover, if you are keeping pace with kubeadm upgrades at all (minor releases are quarterly, and patches are more frequent) then since the most recent minor release, Kubernetes 1.17, certificate renewal as an automated part of the upgrade process is enabled by default. You would have to do at least one cluster upgrade per year to avoid expired certs. tl;dr: this cert expiration thing isn't a problem anymore, but you do have to maintain your clusters.
(Unless you are using a managed k8s service, that is...)
The fact remains also that this is the very first entry under "Administration with Kubeadm", so if you did use kubeadm and didn't find it, I'm going to have to guess that either docs have improved since your experience, or you really weren't looking to administrate anything at all.
I appreciate the links, but for my home stuff I'll be ripping Kubernetes out.
The notion that one has to keep pace with Kubernetes upgrades is exactly the kind of thing that works fine if you have a full-time professional on the job, and very poorly if it's a sideline for people trying to get actual productive work done.
Which is fine; not everything has to scale down. But it very strongly suggests that there's a minimum scale at which Kubernetes makes sense.
Or, that there is a minimum scale/experience gradient behind which you are better served by a decent managed Kubernetes, when you're not prepared to manage it yourself. Most cloud providers have done a fairly good job to make it affordable.
I think it's fair to say that the landscape of Kubernetes proper itself (the open source package) has already reached a more evolved state than the landscape of managed Kubernetes service providers, and that's potentially problematic, especially for newcomers. It's hard enough to pick between the myriad choices available; harder still when you must justify your choice to a hostile collaborator who doesn't agree with part or all.
IMO, the people who complain the loudest about the learning curve of Kubernetes are those who have spent a decade or more learning how to administer one or more distributions of Linux servers, who have made the transition from SysV init to SystemD, and in many cases who are now neck deep in highly specialized AWS services, which in many cases they have used successfully to extricate from the nightmare-scape where one team called "System Admins" is responsible for broadly everything that runs or can run on any Linux server (or otherwise), from databases, to vendor applications, to monitoring systems, new service dev, platforming apps that were developed in-house, you name it...
I basically don't agree that there is a minimum scale for Kubernetes, and I'll assert confidently that declarative system state management is a good technology, that is here to stay. But I respect your choice and I understand that not everyone shares my unique experiences, that led me to be more comfortable using Kubernetes for everything from personal hobby projects, to my own underground skunkworks at work.
In fact it's a broadly interesting area of study for me, "how do devs/admins/(people at large) get into k8s" since it is such a steep learning curve, and this has all happened so fast, there is so much to unpack before one can start to feel comfortable that there isn't really that much more complexity buried behind that you haven't deeply explored already and understood.
It sounds like we both agree there's a minimum scale for running your own Kubernetes setup, or you wouldn't be recommending managed Kubernetes.
But a managed Kubernetes approach only makes sense if you want all your stuff to run in that vendor's context. As I said, I started with home and personal projects. I'd be a fool to put my home lighting infrastructure or my other in-home services in somebody's cloud. And a number of my personal projects make better economic sense running on hardware I own. If there's a managed Kubernetes setup that will manage my various NUCs and my colocated physical server, I'm not aware of it.
> there's a minimum scale for running your own Kubernetes setup
I would say there is a minimum scale that makes sense, for control plane ownership, yes. Barring other strong reasons that you might opt to own and manage your own control plane like "it's for my home automation which should absolutely continue to function if the internet is down"...
I will concede you don't need K8s for this use case, even if you like containers and wanted to use containers, but don't have much prior experience with K8s, from a starting position of "no knowledge" you will probably have a better time with compose and swarm. There is a lot to learn about K8s to a newcomer, but the more you already learned, the less likely I would be to recommend using swarm, or any other control plane (or anything else.)
This is where I feel the fact I mentioned that managed k8s ecosystem is not as evolved as it will likely soon become is relevant. You may be right that no managed Kubernetes setups will handle your physical servers today, but I think the truth is somewhere between: they're coming / they're already here but most are not quite ready for production / they are here, but I don't know what to recommend strongly.
I'm leaning toward the latter (I think that if you wanted a good managed bare metal K8s, you could definitely find it.) I know some solutions that will manage bare metal nodes, but this is not a space I'm intimately familiar with.
The solutions that I do know of, are in early enough state of development that I hesitate to mention. It won't be long before this gets much better. The bare metal Cluster API provider is really something, and there are some really amazing solutions being built on top of it. If you want to know where I think this is going, check this out:
WKS and the "firekube" demo, a GitOps approach to managing your cluster (yes, even for bare metal nodes)
I personally don't use this yet, I run kubeadm on a single bare metal node and don't worry about scaling, or the state of the host system, or if it should become corrupted by sysadmin error, or much else really. The abstraction of the Kubernetes API is extremely convenient when you don't have to learn it from scratch anymore, and doubly so if you don't have to worry about managing your cluster. One way to make sure you don't have to worry, is to practice disaster recovery until you get really good at it.
If my workloads are containerized, then I will have them in a git repo, and they are disposable (and I can be sure, as they are regularly disposed of, as part of the lifecycle). Make tearing your cluster down and standing it back up a regular part of your maintenance cycles until you're ready to do it in an emergency situation with people watching. It's much easier than it sounds, and it's definitely easier than debugging configuration issues to start over again.
The alternative that I would recommend for production right now, if you don't like any managed kubernetes, is to become familiar with the kubeadm manual. It's probably quicker to read it and study for CKA than it would be to canvas the entire landscape of managed providers for the right one.
I'm sure it was painful debugging that certificate issue, I have run up against that issue in particular before myself. It was after a full year or more of never upgrading my cluster (shame on me), I had refused to learn RBAC, kept my version pinned at 1.5.2, and at some point after running "kubeadm init" and "kubeadm reset" over and over again it became stable enough (I stopped breaking it) that I didn't need to tear it down anymore, for a whole year.
And then a year later certs expired, and I could no longer issue any commands or queries to the control plane, just like yours.
Once I realized what was happening, I tried to renew the certs for a few minutes, I honestly didn't know enough to look up the certificate renewal docs, I couldn't figure out how to do it on my own... I still haven't read all the kubeadm docs. But I knew I had practiced disaster recovery well over a dozen times, and I could repeat the workloads on a new cluster with barely any effort (and I'd wind up with new certs.) So I blew the configuration away and started the cluster over (kubeadm reset), reinstalled the workloads, and was back in business less than 30 minutes later.
I don't know how I could convince you that it's worth your time to do this, and that's OK (it's not important to me, and if I'm right, in 6 months to a year it won't even really matter anymore, you won't need it.) WKS looks really promising, though admittedly still bleeding edge right now. But as it improves and stabilizes, I will likely use this instead, and soon after that forget everything I ever knew about building kubeadm clusters by hand.
Kubernetes, once you know it, is significantly easier than cobbling together an environment from "classical" solutions that combine Puppet/Chef/Ansible, homegrown shell scripts, static VMs, and SSH.
Sure, you can bring up a single VM with those technologies and be up and running quickly. But a real production environment will need automatic scaling (both of processes and nodes), CPU/memory limits, rolling app/infra upgrades, distributed log collection and monitoring, resilience to node failure, load balancing, stateful services (e.g. a database; anything that stores its state on disk and can't use a distributed file system), etc., and you end up building a very, very poor man's Kubernetes dealing with all of the above.
With Kubernetes, all of the work has been done, and you only need to deal with high-level primitives. "Nodes" become an abstraction. You just specify what should run, and the cluster takes care of it.
I've been there, many times. I ran stuff the "classical" Unix way -- successfully, but painfully -- for about 15 years and I'm not going back there.
There are alternatives, of course. Terraform and CloudFormation and things like that. There's Nomad. You can even cobble together something with Docker. But those solutions all require a lot more custom glue from the ops team than Kubernetes.
The majority of what you posted reiterates the post I responded to , and it doesn't address the complexity of those features or their implementation. Additionally, I challenge your assertion that "real production environments" need automatic scaling.
You missed my point. I was contrasting Kubernetes with the alternative: Critics often highlight Kubernetes' complexity, forgetting/ignoring that replicating its functionality is also complex and often not composable or transferable to new projects/clusters. It's hard to design a good, flexible Puppet (or whatever) configuration that grows with a company, can be maintained across teams, handles redundancy, and all of those other things.
Not all environments need automatic scaling, but they need redundancy, and from a Kubernetes perspective those are two sides of the same coin. A classical setup that automatically allows a new node to start up to take over from a dysfunctional/dead one isn't trivial.
Much of Kubernetes' operational complexity also melts away if you choose a managed cloud such as Digital Ocean, Azure, or Google Cloud Platform. I can speak from experience, as I've both set up Kubernetes from scratch on AWS (fun challenge, wouldn't want to do it often) and I am also administering several clusters on Google Cloud.
The latter requires almost no classical "system administration". Most of the concerns are "hoisted" up to the Kubernetes layer. If something is wrong, it's almost never related to a node or hardware; it's all pod orchestration and application configuration, with some occasional bits relating to DNS, load balancing, and persistent disks.
And if I start a new project I can just boot up a cluster (literally a single command) and have my operational platform ready to serve apps, much like the "one click deploy" promise of, say, Heroku or Zeit, except I have almost complete control of the platform.
In my opinion, Kubernetes beats everything else even on a single node.
Maybe, but the point with containers and kubernets is to treat it like cattle, not pets.
If something blows up or dies, then with Kubernetes it's often faster to just tear down the entire namespace and bring it up again. If the entire cluster is dead, then just spin up a new cluster and run your yaml files on it and kill your old cluster.
Treat it like cattle, when it doesn't serve your purpose anymore then shoot it.
This is one of the biggest advantages of Kubes, but often overlooked because traditional Ops people keep treating infrastructure like a pet.
Only thing you should treat like a pet is your persistence layer, which is presumably outside Kubes, somehting like DynamoDb, Firestore, CosmosDb, SQL server, whatever.
This is not good engineering. If somebody told me this at a business, I’d not trust them anymore with my infrastructure.
So, you say that problems happen, and you consciously don’t want to know/solve them. A recurring problem in you view is solved with constantly building new K8s clusters and your whole infrastructure in it every time!?!
Simple example - A microservice that leaks memory.... let it keep restarting as it crashes?!
I remember at one of my first jobs, at a healthcare system for a hospital in India, their Java app was so poorly written that it kept leaking memory and bloated beyond GC could help and will crash every morning at around 11 AM and then again at around 3 PM. The end users - Doctors, nurses, pharmacists knew about this behavior and had breaks during that time. Absolutely bullshit engineering! It’s a shame on those that wrote that shitty code, and shame on whoever reckless to suggest a ever rebuilding K8s clusters.
Yes, "let it keep restarting while it crashes and while I investigate the issue" is MUCH preferred to "everything's down and my boss is on my ass to fix the memory issue."
The bug exists either way, but in one world my site is still up while I fix the bug and prioritize it against other work and in another world my site is hard-down.
That only works if the bug actually gets fixed. When you have normalized the idea that restarting the cluster fixes a problem — all of the sudden, you don’t have a problem anymore. So now your motivation to get the bug properly fixed has gone away.
Sometimes feeling a little pain helps get things done.
You and I wish that's what happened in real life. Instead, people now normalize the behavior thinking it'll sort itself out automatically over time without ever trying to fix it.
Self-healing systems are good but only if you have someone who is keeping track of the repeated cuts to the system.
This is something that has been bothering me for the last couple of years. I consistently work with developers who no longer care about performance issues, assuming that k8s and the ops team will take care of it by adding more CPU or RAM or just restarting. What happened to writing reliable code that performed well?
Business incentives. It's a classic incentive tension between more time on nicer code that does the same thing or building more features. Code expands to it's performance budget and all.
At least on backend you can quantify the cost fairly easily. If you bring it up to your business people they will notice easy win and then push the devs to make more efficient code.
If it's a small $$ difference although, the devs are probably prioritizing correctly.
I've witnessed the same thing, however there is nothing mutually exclusive about having performant code running in Kubernetes. There's a trade-off between performance and productivity, and maintaining a sense of pragmatism is a good skill to have (that's directed towards those that use scaling up/out as a reason for being lax about performance).
Nothing is this black and white. I tried to emphasise just a simple philosophy that life gets a lot easier if you make things easily replaceable. That was the message I tried to convey, but of course if there is a deep problem with something it needs proper investigation + fixing, but that is an actual code/application problem.
That's not what cattle vs pets is. Treating your app as cattle means that it deploys, terminates, and re-deploys with minimal thought at the time of where and how. Your app shouldn't care which Kubernetes node it gets deployed to. There shouldn't be some stateful infrastructure that requires hand-holding (e.g. logging into a named instance to restart a specific service). Sometimes network partitions happen, a disk starts going bad, or some other funky state happens and you kill the Kubernetes pod and move on.
You should try to fix mem leaks and other issues like the one you described, and sometimes you truly do need pets. Many apps can benefit from being treated like cattle, however.
When cattle are sick, you need to heal them. Not shoot them in the head and bring in new cattle. If you your software behaves badly you need to FIX THE SOFTWARE.
Just doing the old 'just restart everything' is typical windows admin behavoir and a recipy for making bad unstable systems.
Kubernetes absolutly does do strang things, crahes on strange things, does strange things and not tell you about it.
I like the system, but to pretend its this unbelievable great thing is an exaturation.
> All of this glosses over the biggest issue with Kubernetes: it's still ridiculously complex, and troubleshooting issues that arise (and they will arise), can leave you struggling for days pouring over docs, code, GitHub issues, stackoverflow.
It's probably good at this point to distinguish between on-prem and managed installations of k8s. In almost four years of running production workloads on Google's GKE we've had... I don't know perhaps 3-4 real head-scratchers where we had to spend a couple of days digging into things. Notably none of these issues have ever left any of our clusters or workloads inoperable. It isn't hyperbole to say that in general the system just works, 24x7x365.
Agreed. We moved from ECS to GKE specifically because we didn't have the resources to handle what was supposed to be a "managed" container service with ECS. Had agent issues constantly where we couldn't deploy. It did take a little bit to learn k8s no doubt. But now it requires changes so little I usually have to think for a minute to remember how something works because it's been so long since I needed to touch it.
Agree that the k8s tax, as described, is a huge issue. But I think the biggest issue is immaturity of the ecosystem, with complexity coming in second. You can at least throw an expensive developer at the complexity issue.
But when it comes to reliable installations (even helm charts for major software is a coin flip in terms of whether they’ll work), fast moving versioning that reminds me of the JavaScript Wild West (the recent RBAC on by default implementation comes to mind, even if its a good thing), and unresolved problems around provider-agnostic volumes and load balancing... those are headaches that persist long after you’ve learned the difference between a replicaSet and a deployment.
To further this point about the ecosystem, and this is AWS specific. You need, or have needed, to install a handful of extra services/controllers onto your EKS cluster to get it to integrate the way most would expect with AWS. Autoscalling? Install and configure the autoscaler. IAM roles? Install Kube2IAM. DNS/ALB/etc etc? etc etc etc.
After a slog you get everything going. Suddenly a service is throwing errors because it doesn't have IAM permissions. You look into it and it's not getting the role from the kube2iam proxy. Kube2iam is throwing some strange error about a nil or interface cast. Let's pretend you know Go like I do. The error message still tells you nothing specific about what the issue may be. Google leads you to github and you locate an issue with the same symptoms. It's been open for over a year and nobody seems to have any clue what's going on.
Good times :) Stay safe everyone, and run a staging cluster!
Kubernetes can be very complex.. or it can be very simple. Just like a production server can be very simple, or extremely complex, or a linux distro, or an app..
Kubernetes by itself is a very minimal layer. If you install every extension you can into it, then yes, you'll hit all kinds of weird problems, but that's not a Kubernetes problem.
You could use this argument for literally anything, though. I spent days poring over docs, googling, SOF, github issues, making comments, the whole works when I learned any new software/technology. The argument doesn't hold water, IMO.
You can make an argument that Linux is ridiculously complex and troubleshooting issues that arise can leave you struggling for days pouring over docs, code, etc and that msdos is a much simpler system and be sort of right.
True. I'm currently in the middle of writing a paper on extending Kubernetes Scheduler through Scheduler Extender[1]. The process has been really painful.
You're saying a feature that's in alpha, released 2 months ago is painful? You should at least wait until a feature is beta until expecting it to be easier to use.
Scheduler Extender was initially released over 4 years ago[1]. What you are referring to is Scheduling Framework[2], which indeed is a new feature (and will replace/contain Scheduler Extender).
> can leave you struggling for days pouring over docs, code, GitHub issues, stackoverflow
I've had that when running code straight on a VM, when running on Docker, and when running on k8s. I can't think of a way to deploy code right now that lets you completely avoid issues with systems that you require but are possibly unfamiliar with, except maybe "serverless" functions.
\ And of those three, I much preferred the k8s failure states simply because k8s made running _my code_ much easier.
> I can't think of a way to deploy code right now that lets you completely avoid issues with systems that you require but are possibly unfamiliar with, except maybe "serverless" functions.
This is basically the same comment I was going to write, so I'll just jump onto it. But whenever I hear people complain about how complex XXX solution is for deployment, I always think, "ok, I agree that it sucks, but what's the alternative?"
Deploying something right now with all of its ancillary services is a chore, no matter how you do it. K8s is a pain in the ass to set up, I agree. But it seems to maintain itself the best once it is running. And longterm maintainability cannot be overlooked when considering deployment solutions.
When I look out in the sea of deployment services and options that exist right now, each option has its own tradeoffs. Another service might elimite or minimize anothers' tradeoffs, but it then introduces its' own tradeoffs. You are trading one evil for another. And this makes it nearly impossible to say "X solution is the best deployment solution in 2020". Do you value scalability? Speed? Cost? Ease of learning? There are different solutions to optimize each of these schools of thought, but it ultimately comes down to what you value most, and another developer isn't going to value things in the same way, so for them, another solution is better.
The only drop-dead simple, fast, scalable, deployment solution I have seen right now is static site hosting on tools like Netlify or AWS Amplify (among others). But these only work for static generated sites, which were already pretty easy to deploy, and they are not an option for most sites outside of marketing sites, landing pages, and blogs. They aren't going to work for service based sites, nor will they likely replace something being deployed with K8s right now. So they are almost moot in this argument, but I bring it up, because they are arguably, the only "best deployment solution" right now if you are building a site that can meet its' narrow criteria.
Compared to something like scp and restarting services, I would personally not pay the Kubernetes tax unless I absolutely had to.