Actually, can we just kill Terraform? Please? Terraform has a bad design. It's a...

AtNightWeCode · on Aug 15, 2023

This is perhaps the most incorrect post you will ever find on HN. I am not a huge fan of Terraform. However, TF is made for infra not config. There are several tools out there to manage config like Salt, Ansible and so on.

docandrew · on Aug 16, 2023

You can definitely use Ansible to provision infra as well. Whether that’s the right tool for the job is a matter of taste.

pa7ch · on Aug 15, 2023

Honest question, what do you consider config vs infra? What is infra if not configuring a system to run?

dbingham · on Aug 15, 2023

Infra is defining the servers that should be running. Config is writing configuration files to a running server.

Cloud blurs the line, since a lot of cloud offerings are managed where you're really doing both at once when you define the managed offering.

throwawaaarrgh · on Aug 15, 2023

Let's not continue this cargo cult mumbo jumbo.

Terraform is literally a program that looks at a declarative configuration file, looks at a state file, queries some APIs, and then submits some API calls. That is all it does.

There is no "infrastructure", or "config", or "cloud". It's literally just calling HTTPS APIs, the same way it would call a system call or a library function. Call function, pass input, receive output.

There is no magic sauce. There is no difference between it and any other tool that has a declarative configuration, a state, and operations to try to change things to match a desired state.

It's all configuration management. The words "infra", "orchestration", "cloud", etc is marketing bullshit. It's all just software.

dbingham · on Aug 17, 2023

Yes and no. Yes, all terraform does is wrap APIs and you could easily write a Terraform provider for just about anything.

But there is a very real difference between "Deploying a server" and "Modifying configuration files on that server". The former used to require actual physical actions in a data center and it's only in the world of modern virtualization and clouds that it has become possible to do it through an API. Where as the latter used to require secure access to an individual physical machine, often over SSH after someone had done the physical work of setting it up. Again, it's only in the world of modern virtualization and clouds that you can start to do that through APIs.

It is only modern clouds that has blurred the lines between these two by abstracting away the difference between the physical server and the software running on it behind APIs.

Conceptually, it can still useful to think of "infrastructure orchestration" and "configuration management" as different things and different categories. Like I said, in many cases cloud offerings significantly reduce the utility of those categorizations - because they often abstract both steps behind a unified API where you are launching virtual infrastructure (still largely using the same conceptions that were used when it was physical) and defining its configuration at the same time through the same interface.

None of this is marketing speak. It's just definitions and categorizations. Sometimes useful, sometimes not. And all of it is orthogonal to what terraform does do or should do. Whether or not terraform is "infrastructure orchestration", "configuration management" or both is neither here nor there for the definition of those terms and considerations of their utility.

0xbadcafebee · on Aug 19, 2023

> there is a very real difference between "Deploying a server" and "Modifying configuration files on that server"

Yeah: latency. Everything else is identical, from the software perspective. Even the distributed aspect is identical: multiple copies of software running in one OS, or multiple machines running one copy of the software, are treated virtually identically.

> it's only in the world of modern virtualization and clouds that you can start to do that through APIs

I've worked in multiple companies, starting nearly 20 years ago, that had automated the process of provisioning and re-provisioning both hardware and software across tens of thousands of machines in multiple datacenters. Without virtualization, without the cloud. Know how we did it? Same way Terraform does it. Make an API, make a tool to call it, API backend does some magic, returns result, tool does something with result. Nothing has changed except the buzzwords (and the programming languages).

Configuration management is "a systems engineering process for establishing and maintaining consistency of a product's performance, functional, and physical attributes with its requirements, design, and operational information throughout its life." [1] It is not Puppet or CFengine. It is an engineering practice that is nearly 70 years old. Terraform is an implementation of it, as are many other tools, and many things that aren't software at all.

> None of this is marketing speak. It's just definitions and categorizations. Sometimes useful, sometimes not. And all of it is orthogonal to what terraform does do or should do.

On the contrary, the categorizations are made up by people who don't understand the history and practice of the field and confuse designers and practitioners into thinking that what they're doing is correct because "that's just what things in this category do". It's throwing out systems thinking and replacing it with a cargo cult of buzzwords and generally useless concepts.

Every week I see somebody talking about "Infrastructure as Code" as if it's a real thing. It's not. IaC just means somebody put a shell script and config file in Git. Yet they treat it like it's both revolutionary and specific to this one corner of tech. Like we haven't been version-controlling or change-managing the provisioning of computing devices for decades. People who weren't aware of standardized practices for management of fleets of devices basically had to stumble upon it, and not having any other reference, decided to give it a new name and pretend it was novel, and in the process did not learn the lessons from past decades of similar practice.

This is not just an "old man yells at cloud" rant - the point is that tech people keep refusing to learn their history, and then poorly implementing something that could have been designed much better if they'd learned their history. It's like the history of medical practice, where some areas of the globe (cough western europe cough) were embarrassingly backward because they never reached out to learn about the history, research, and best practices outside their sphere. They just did what everyone else around them did. People suffered for decades as a result. We don't suffer quite as much now, but the advancement of technology does suffer as a result of the industry's stodgy refusal to improve on its cargo cult mentality. (Repeating whatever you read in a blog post on HN is what makes things like "Infrastructure as Code" seem like a real and novel idea to people; repeat an idea enough and people just believe it and repeat it too)

Another example: "declarative configuration". All configuration is declarative. Even imperative configuration is declarative. This tautology is debated in blog posts the same way you'd debate the use of types of butter in cooking. It's all just butter. Yeah, some comes without salt; just add some salt to your dish. Yeah, some comes with salt; just withhold some salt from your food. We don't need to go on long-winded writings about the use of different butters. But some people create entire software projects dedicated to one kind of butter, because they think it's super important to only use unsalted butter.

[1] https://en.wikipedia.org/wiki/Configuration_management

rirze · on Aug 16, 2023

It's better framed as infrastructure orchestration vs infrastructure configuration. Orchestration is more about herding the resources while configuration is about delving into the instance/server/resource environment to make changes. The terms are pretty arbitrary IMO, but those are the vocabulary used in the industry.

snom380 · on Aug 15, 2023

I have a hard time thinking of how terraform as a piece of software can do to much more than it already does to fix things?

When terraform fails, it's typically because of a an API error or a configuration issue that is beyond its control?

Not handling state rollback is a design decision that, having dealt with the fun of CloudFormation, I'm pretty happy that they made.

mdaniel · on Aug 15, 2023

> I have a hard time thinking of how terraform as a piece of software can do to much more than it already does to fix things?

Oh, that one's easy: have the "plan" phase actually consult the underlying provider in order to know the straight face errors that are going to fail 60% of the way through your "apply" phase. I thought about including an example, but I don't care to try and lobby unless the community fork takes off, because Hashicorp gonna Hashicorp _their_ baby

Look, I know the TF community is allllllllllll about that Omniscient .tfstate file but (related to the sibling comments about the tool _being helpful_) the real world is filled with randos in an organization doing shit to underlying infra or humans fat-fingering something and it is not a good use of anyone's life having to re-run plan and apply due to some patently stupid but foreseeable bug

0xbadcafebee · on Aug 15, 2023

1000%. The state file causes way more problems than it solves. The tool makes no attempt to look for an existing resource, or import existing resources, or absorb or ignore changes; you have to manually intervene. Meanwhile production is broken because only half the apply succeeded, but you have no idea if it'll blow up until you apply. No idea if you've set the necessary lifecycle policy correctly for this resource; you'll need to destroy the resource or rename something and see what happens. It's ridiculous.

Hermitian909 · on Aug 15, 2023

> So it doesn't get changed or improved, and it can never be replaced. It is the incumbent that blocks progress. A technological quagmire we can't extricate ourselves from.

This describes almost every tool in my toolchain that's over a decade old, this is just what happens. If you want to kill terraform and replace it with something better the usual bar is that it must be 10x better. If that's something you think you could do I'd be (genuinely) excited to see it :)

benlivengood · on Aug 15, 2023

Want to start a GitHub repo? I'll work with you. My list of must-haves for configuration management:

Hierarchical state and provider management. The difficulty of hooking a kubernetes provider to a EKS or GKE provider in a one-shot-apply is pretty terrible. Trying to nest the helm provider under kubernetes isn't quite as painful but still not great and there isn't a way to get the necessary CRD manifests in place for dependencies or resources before they need to be created.

Diffs as a first-class citizen throughout the layers of providers as opposed to situations like helm_release where helm diffs are completely opaque to terraform and especially to tools like Atlantis.

Slightly more of real programming language concepts (pure functions at least), or else insanely good configuration flexibility. Same defaults with simple overrides should be the default for all providers and modules in a standard way. I think deep merge with reasonable conflict resolution is all terraform needs (plus a rewrite of how configuration works in a lot of places), but I want to be able to define a template configuration for e.g. a cluster and be able to instantiate a new cluster with just:

k8s_config = { region = "us-west1" node_config = { machine_type = \ cloud_native_machine_type( \ cpu = "amd", ram = "256G", cores = "16") spot_instance = true } }

And have deep merging successfully override the default configuration with those values, plus that kind of generic function capacity to turn verbose/complex configuration blocks into a simple definition.

cconstantine · on Aug 15, 2023

I've been using terraform for 10-ish years, and this is very much not how I feel about it. Terraform absolutely makes life easier; I've managed infrastructure without it and it's a nightmare.

Yes, it can be awkward, and yes the S3 bucket resource change was pretty bad, but overall its operating model (resources that move between states) is extremely powerful. The vast majority of "terraform" issues I've had have actually been issues with how something in AWS works or an attempt to use it for something that doesn't map well to resources with state. If an engineer at AWS makes a bone-headed decision about how something works then there isn't much the terraform folks can do to correct it.

I've actually been pretty frustrated trying to talk about terraform with people who don't "get it". They complain about the statefile without understanding how powerful it is. They complain about how it isn't truly cross-platform because you can't use the same code to launch an app in aws and gcp. They complain about the lack of first-party (aws) support. They complain about how hard it is to use without having tried to manually do what it does. Maybe you do "get it", and have a different idea of what terraform should do. Could you give a specific example (besides the s3 resource change) where it fails?

It's a complicated tool because the problem it's trying to solve is complicated. Maybe another tool can replace it, and maybe someone should make that tool because of this license change, but terraform does the thing it intends to do pretty well.

sevagh · on Aug 17, 2023

I'm no Terraform expert but it's been in my resume and toolbox since ~2016.

Up until these changes, I would always pick Terraform for managing AWS. I have my gripes with it but it has been the best choice (as the saying goes, anybody that uses a tool long enough should have complaints about its limitations).

Now, however, I'm finally thinking of going with the CDK to insulate myself from more seismic shifts in the "OSS ecosystem" of devops tools.

yevpats · on Aug 15, 2023

I think it's not only the issue with terraform but also the underlying infrastructure. AWS should've never have imperative APIs in the first place. Or at least it's time for AWS V2 APIs

0xbadcafebee · on Aug 15, 2023

I agree. Cloud infrastructure should be versioned and immutable. If I have an S3 bucket and make 4 changes to it, there should be V0 (making the bucket) and V1-V3 (each subsequent change). I should be able to tell the bucket API to restore the bucket to V2. Terraform is a hack to fill that gap. The AWS bucket service itself should be doing it, not Terraform. Several classes of software that we all maintain ourselves would go away if cloud infra were versioned & immutable.

jen20 · on Aug 15, 2023

This is clearly a poor idea. Declarative infrastructure management is ultimately a dead end, because order of operations actually matters.

jq-r · on Aug 16, 2023

I'm not sure one follows from the other.

You could have both: eg if resource Y depends on X, then you would just declare Y after X. Or you could do a "depends_on" directive like in TF.

That certainly doesn't sound like a dead end to me.

jen20 · on Aug 16, 2023

That is the trivial part (and any tool even worth talking about already implements it).

The problem is things like “create this instance in parallel as a replacement for this one over here, then shut down the original, detach a volume from the original and attach it to the replacement then run command X on the replacement, stopping for manual intervention at any phase the running system reports it is running at reduced redundancy”.

This is not an atypical requirement for infrastructure as code beyond the basics, but none of the declarative tools come close to addressing it without a bucket load of external coordination.

phamilton · on Aug 15, 2023

Do you consider cloudformation as imperative APIs?

easton · on Aug 15, 2023

For most services it’s abundantly clear it’s calling the same imperative APIs under the covers you use as an external person because it gets stuck so often. Well, more often than you would think if declarative management of resources was at the top of Amazon’s mind when designing these services.

“Internal error? The #$@! does that mean?”

r3trohack3r · on Aug 15, 2023

Would be cool to see your laundry list.

I haven’t shared your experience, but I read OPs book on Terraform cover-to-cover before and trying to work with the system.

0xbadcafebee · on Aug 15, 2023

Oh god I would be writing for hours. Short version, this is not nearly everything:

  - Bad UX
    - Tool does not have interactive mode to provide suggestions or simple solutions to common problems
    - Lack of options or commands for commonly-used tasks, like refactoring resources, modules, sub-modules, etc. (Using 'state mv' and 'state rm', etc is left as an exercise for the user to figure out and takes forever)
    - Complains about "extra variables" found in tfvars files, making it annoying to re-use configuration, even though having "extra variables" poses no risk to operation
    - (NEW) Shows you what has changed in the plan output, followed by what will *actually be changed* if applied, though both look the same, so you get confused and think the first part matters, but actually it's irrelevant.
  
  - Bad internal design
    - HCL has a wealth of functions yet is too restrictive in how you can use them. You will spend an entire day (or two) tying your brain into knots trying to figure out how to construct the logic needed to append to an array in a map element in an array in a for_each for a module (which was impossible a few years ago).
    - Providers are inconsistent and often not written well, such as not providing useful error messages or context.
    - Common lifecycle policy conventions per-resource-type have to be discovered by trial-and-error (rather than being the default or hinted) or you will end up bricking your gear after it's already deployed.
    - The tool depends on both local state and optionally remote state. Local state litters module directories even though nearly everyone who uses it at scale uses modules as libraries/applications, not the location they execute the tool from. Several different wrappers were invented and default to changing this behavior because it has been a problem for years.
  - Default actions and best practices (such as requiring a plan file before apply or destroy, automatically running init before get before validate, etc) are left to the user to figure out rather than done for them (again, wrappers had to solve this).
  - Some actively dangerous things are the default, like overwriting backup state files (if they're created by default).
  - Version management of state is left up to the user (or remote backend provider)
  - Not designed for DRY code or configuration; multiple wrappers had to implement this
  - You can't specify backend configuration via the -var-files option, and backend configuration can't be JSON ... why? They just felt like making it annoying. Some "philosophical" development choice that users hate and makes the tool harder to use.
  - Workspaces are an anti-pattern; you end up not using them at scale.
  - You can't use count or for_each for provider sections, so if you wanted a configurable number of providers (say with different credentials each), tough luck. ("We're Opinionated!")
  - Can't use variables in a backend block. ("We're Opinionated!")
  - Can't have more than one backend per module. ("We're Opinionated!")
  - Lots of persistent bad behavior has been fixed in recent releases, like not pushing state changes as resources are applied, others I can't remember.
  - Global lock on state, because again, ya can't have more than one backend block per module.
  - All secrets are stored as plaintext in the state file, so either you don't manage secrets *at all* with Terraform, or you admit that your Terraform state is highly sensitive and needs to be segregated from everyone/everything and nobody can be given access to it.
  - No automatic detection of, or import of, existing resources. It knows they're there, because it fails to create them (and doesn't get a permission error back from the API), but it refuses to then give you the option of importing them. The *terraformer* project had to be invented just to get a semblance of auto-import, when they could have just added 100 lines of code to Terraform and saved everyone years of work.
  - Not letting people write modules, logic, providers, etc in an arbitrary executable. Other tools do this so you can ramp up on new solutions quickly and make turn-key solutions to common needs, but Terraform doesn't allow this; write it in Go or HCL or get bent.
  - You have to explicitly pass variable inputs to module blocks, so you can't just implicitly detect a variable that has already been passed to the tool. But this isn't the case if you're applying a module; only if you create a sub-module block. This just makes initial development and refactoring take more time without giving the user an added benefit.
  - You have to explicitly define variables, rather than just inherit them as passed to the tool at runtime. Mind you, you don't have to actually include the variable type; you just have to declare *the name* of the variable. So again, it wastes the user's time when trying to develop or refactor, for absolutely no benefit at all.
  - You have to bootstrap the initial remote backend state resources *outside* of Terraform, or, do it with local state, and then migrate the state after adding new resources or using a separate identical module that has a backend configuration. Does that sound complicated? It is, and annoying, and unnecessary.
  - You have to be careful not to make your module too big, because modules that manage too many resources take too long to plan and apply and risk dying before completing. (If you're managing resources in China, make the module even smaller, because timeouts over the great firewall are so common that it's nearly impossible to finish applying in a reasonable time)
  - Tests. In Go.
  - Schema for your tfvars files? Nope; write some really complicated logic in a variable to validate each variable in a different way.
  - Providers don't document the restrictions on things like naming convention for required parameters, so you have to apply to the API and then get back a weird error and go try to dig up some docs that hopefully tell you the naming convention so you can fix it and try again.
  - Terraform *plan* will give you 'known after apply' for values it very easily could tell you *before* the apply, but for whatever reason doesn't. You never really know what it's going to do until you do it and it blows up production.
  - It's very difficult (sometimes near impossible) to just absorb the current state of the infrastructure into TF (as in, "it's working right now, please just keep it the way it is"). Import only works if you've already written the HCL for the resources, and then look up how the provider wants to you to import that resource.
  - Version pinning is handled like 5 different ways, but is still impossible to pin and use different sets of versions when applying different state files for the same HCL module code and values.

That's off the top of my head, there's much more.

Sparkyte · on Aug 15, 2023

No because a lot of non cloud tech depend on terraform too. It isn't just AWS.

Everything from Kubernetes to various Colocation technology leverage it because it codifies the ability to deploy a tech stack.

manvillej · on Aug 15, 2023

its really funny because I needed to create a terraform like functionality & went in depth into looking at both building it into terraform (which ended up not working) and building a new tool.Its not THAT complex. there are a few gimmicks in HCL, like it being an actual language, that create some interesting features.

but it could just be a yml file. In essence, the requirements section says "here are the builders it needs & their versions" which identifies types of jobs. each entry is a job with a job type, a unique name, and some config info. Each builder is just a series of CRUD operations.

Like you said, it builds a directed acyclic graph, queues up the ready jobs, and executes them. updating the infrastructure's "state" with info from the completed jobs & adding new jobs when their dependencies are finished. the state files are just a dump of that structure as json.

Its not thathard. I think of myself of a junior level dev and I built something for me in my side time in a month with a full test suite and its 3/4 of the way there. CLI, builder dependency injection, type checking, relationship dependencies, it took me a few weekends

I think a senior engineer could build out an enterprise grade functional core product in a few weeks. building & maintaining the CRUD APIs is the real headache, but I think vendors would take care of that themselves if there was a popular enough OS solution.

MrBuddyCasino · on Aug 15, 2023

I don’t know what it is about the ops sector that favors ugly tools, but apparently they don’t care, otherwise Pulumi would see more traction.

danenania · on Aug 15, 2023

There’s a lot of inertia in ops tooling. Switching costs are very high for an existing project, and once you learn the quirks of one tool or another, it takes a lot to justify something else for a new project even if it’s better, since you know the new tool will have its quirks too.

The cost-benefit analysis of new stuff is also different for ops compared to pure development. You tend to care more about stability and predictability than productivity and elegant design. Problems in pure dev land cause bugs that mostly aren’t super urgent; problems with ops tools bring down whole systems and wake everyone up at 2am. For these reasons, ops is always going to have a more conservative mindset that shuns the shiny new thing to some extent.

stefan_ · on Aug 15, 2023

People have this kind of reaction to Fuchsia a lot: wow, isn't this great! Then they learn why it doesn't run on anything and why the Linux kernel has 1201530 commits. The real world is imperfect. You are trying to make an abstraction of "everything" and then complain when it's leaking.

theLiminator · on Aug 15, 2023

Yeah I think a restricted nix-like language might be a much better choice.