There are actually quite a few studies out there that look at LLM code quality (e.g. https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=LLM+...) and they mostly have similar findings. This reinforces the idea that LLMs still require expert guidance. Note, some of these studies date back to 2023, which is eons ago in terms of LLM progress.
The conclusion of this paper aligns with the emerging understanding that AI is simply an amplifier of your existing quality assurance processes: Higher discipline results in higher velocity, lower discipline results in lower stability (e.g. https://dora.dev/research/2025/) Having strong feedback and validation loops is more critical than ever.
In this paper, for instance, they collected static analysis warnings using a local SonarQube server, which implies that it was not integrated into the projects they looked at. As such these warnings were not available to the agent. It's highly likely if these warnings were fed back into the agent it would fix them automatically.
Another interesting thing they mention in the conclusion: the metrics we use for humans may not apply to agents. My go-to example for this is code duplication (even though this study finds minimal increase in duplication) -- it may actually be better for agents to rewrite chunks of code from scratch rather than use a dependency whose code is not available forcing it to instead rely on natural language documentation, which may or may not be sufficient or even accurate. What is tech debt for humans may actually be a boon for agents.
That's fair, but I suspect the underlying mechanism is the same -- the models prefer re-writing code from scratch rather than looking around for reusable abtsractions, which may exist just a few modules over, or -- for smaller models -- sometimes even in the same file. They're not copy-pasting the code for sure, just regenerating de novo.
This is the most common issue I find, even with the latest models. For normal logic it's not too bad, the real risk is when they start duplicating classes or other abstractions, because those tend to proliferate and cause a mess.
I don't know if it's the training or RL or something intrinsic to the attention mechanism, but these models "prefer" generating new code rather than looking around for and integrating reusable code, unless the functionality is significant or they are explicitly prompted otherwise.
I think this is why AGENTS.md files are getting so critical -- by becoming standing instructions, they help override the natural tendencies of the model.
Yeah I agree that it's not copy/pasted the way a dev would, but I think the end result is the same. The more it needlessly duplicates code, the more brittle things will become. Changes will get harder and harder to implement as the number of sites that have to change increases.
On the other hand, I think driving down the need for external dependencies can be a net win. In my experience you usually need a very tiny slice of what a dependency actually offers, and often you settle for making design compromises to fit the dependency into your system, because the cost of writing it yourself is too high. LLMs definitely change that calculus.
I've found agent.md files are more a bandaid then anything. I've seen agents routinely ignore/forget them, and the larger the code base/number of changes they're making the more frequently they forget.
The conclusion of this paper aligns with the emerging understanding that AI is simply an amplifier of your existing quality assurance processes: Higher discipline results in higher velocity, lower discipline results in lower stability (e.g. https://dora.dev/research/2025/) Having strong feedback and validation loops is more critical than ever.
In this paper, for instance, they collected static analysis warnings using a local SonarQube server, which implies that it was not integrated into the projects they looked at. As such these warnings were not available to the agent. It's highly likely if these warnings were fed back into the agent it would fix them automatically.
Another interesting thing they mention in the conclusion: the metrics we use for humans may not apply to agents. My go-to example for this is code duplication (even though this study finds minimal increase in duplication) -- it may actually be better for agents to rewrite chunks of code from scratch rather than use a dependency whose code is not available forcing it to instead rely on natural language documentation, which may or may not be sufficient or even accurate. What is tech debt for humans may actually be a boon for agents.