All good, no snark inferred. Yes I have considered this, and I keep considering ...

All good, no snark inferred. Yes I have considered this, and I keep considering it every time I get a bad result. Sorry this response is so long.

I think I have a good idea how these things work. I have run local LLMs for a couple of years on a pair of video cards here, trying out many open weight models. I have watched the 3blue1brown ML course. I have done several LinkedIn Learning courses (which weren't that helpful, just mandatory). I understand about prompting precisely and personas (though I am not sold personas are a good idea). I understand LLMs do not "know" anything, they just generate the next most likely token. I understand LLMs are not a database with accurate retrieval. I understand "reasoning" is not actual thinking just manipulating tokens to steer a conversation in vector space. I understand LLMs are better for some tasks (summarisation, sentiment analysis, etc) than others (retrieval, math, etc). I understand they can only predict what's in their training data. I feel I have a pretty good understanding of how to get results from LLMs (or at least the ways people say you can get results).

I have had some small success with LLMs. They are reasonably good at generating sub-100 line test code when given a precise prompt, probably because that is in training data scraped from StackOverflow. I did a certification earlier this year and threw ~1000 lines of Markdown notes into Gemini and had it quiz me which was very useful revision, it only got one question wrong of the couple of hundred I had it ask me.

I'll give a specific example of a recent failure. My job is mostly troubleshooting and reading code, all of which is public open source (so accessible via LLM search tooling). I was trying to understand something where I didn't know the answer, and this was difficult code to me so I was really not confident at all in my understanding. I wrote up my thoughts with references, the normal person I ask was busy so I asked Gemini Pro. It confidently told me "yep you got it!".

I asked someone else who saw a (now obvious) flaw in my reasoning. At some point I'd switched from a hash algorithm which generates Thing A, to a hash algorithm which generates Thing B. The error was clearly visible, one of my references had "Thing B" in the commit message title, which was in my notes with the public URL, when my whole argument was about "Thing A".

This wasn't even a technical or code error, it was a text analysis and pattern matching error, which I didn't see because I was so focused on algorithms. Even Gemini, the apparent best LLM in the world which is causing "code red" at OpenAI did not pick this up, when text analysis is supposed to be one of its core functionalities.

I also have a lot of LLM-generated summarisation forced on me at work, and it's often so bad I now don't even read it. I've seen it generate text which makes no logical sense and/or which uses so many words without really saying anything at all.

I have tried LLM-based products where someone else is supposed to have done all the prompt crafting and added RAG embeddings and I can just behave like a naive user asking questions. Even when I ask these things question which I know are in the RAG, they cannot retrieve an accurate answer ~80% of the time. I have read papers which support the idea that most RAG falls apart after about ~40k words and our document set is much larger than that.

Generally I find LLMs are at the point where to evaluate the LLM response I need to either know the answer beforehand so it was pointless to ask, or I need to do all the work myself to verify the answer which doesn't improve my productivity at all.

About the only thing I find consistently useful about LLMs is writing my question down and not actually asking it, which is a form of Rubber Duck Debugging (https://en.wikipedia.org/wiki/Rubber_duck_debugging) which I have already practiced for many years because it's so helpful.

Meanwhile trillions of dollars of VC-backed marketing assures me that these things are a huge productivity increaser and will usher in 25% unemployment because they are so good at doing every task even very smart people can do. I just don't see it.

If you have any suggestions for me I will be very willing to look into them and try them.