It seems like everything I see about success using LLMs for this kind of work is for greenfield. What about three weeks later when the job changes to maintenance and interation on something that's already working? Are people applying LLMs to that space?
My codebase is relatively greenfield (started working on it early last year) but it’s up to ~50k lines in a mixed C++/Rust codebase with a binding layer whose API predates every LLM’s training sets. Even when I started ChatGPT/Claude weren’t very useful but now the project requires a completely different strategy when working with LLMs (it’s a QT AI desktop app so I’m dogfooding a lot). I’ve also used them in a larger codebase (~500k lines) and that also requires a different approach from the former. It feels a lot like the transition from managing 2 to 20 to 200 to 2000 people. It’s a different ballgame with each step change. A very well encapsulated code base of ~500k lines is manageable for small changes but not for refactoring, exploration, etc, at least until useful context sizes increase another order of magnitude (I keep trying Gemini’s 2M but it’s been a disappointment).
I have a lot of documentation aimed at the AI in `docs/notes/` (some of it written by an LLM but proofread before committing) and I instruct Cursor/Windsurf/Aider via their respective rules/config files to look at the documentation before doing anything. At some scale that initial context becomes just a directory listing & short description of everything in the notes folder, which eventually breaks down due to context size limits, either because I exceed the maximum length of the rules or the agent requires pulling in too much context for the change.
I’ve found that there’s actually an uncanny valley between greenfield projects where the model is free to make whatever assumptions it wants and brownfield projects where it’s possible to provide enough context from the existing codebase to get both API accuracy (hallucinations) and general patterns through few-shot examples. This became very obvious once I had enough examples of that binding layer. Even though I could include all of the documentation for the library, it didn’t work consistently until I had a variety of production examples to point it to.
Right now, I probably spend as much time writing each prompt as I do massaging the notes folder and rules every time I notice the model doing something wrong.
Logically this makes sense: every model has a context size and complexity capacity where it will no longer be able to function properly. Any usage of said model will accelerate the approach to this limit. Once the limit is reached, the LLM is no longer as helpful as it was.
I work on full blown legacy apps and needless to say I don't even bother with LLMs when working on these most of the time.
Yeah, it sucks. LLMs are not great with a big context yet. I hope that is being worked on. I need the LLM to read my whole project AND optimally all related slack conversations, the wiki and related libraries.
I can for example tell it to refactor things. It would have to write files of course. E.g. "Add retries with exponential backoffs to all calls to service X"