There are a few increasingly harder things when it comes to prompt customization:
1. Prompts ask LLM to generate input for the next step
2. Prompts ask LLM to generate instructions for the next step
3. Prompts ask LLM to generate the next step
Doing #3 across multiple steps is the promise of Langchain, AutoGPT et al. Pretty much impossible to do with useful quality. Attempting to do #3 very often either ends up completing the chain too early, or just spinning in a loop. Not the kind of thing you can optimize iteratively to good enough quality at production scale. "Retry" as a user-facing operation is just stupid IMO. Either it works well, or we don't offer it as a feature.
So we stopped doing 3 completely. The features now have a narrow usecase and a fully-defined DAG shape upfront. We feed some context on what all the steps are to every step, so it can understand the overall purpose.
#2, we tune these prompts internally within the team. It's very sensitive to specific words. Even things like newlines affects quality too much.
#1 - we've found it's doable for non-tech folks. In some of the features, we expose this to the user somewhat as additional context and mix that in with the pre-built instructions.
So #2 is where it's both hard to get right and still solvable. Every prompt change has to be tested with a huge number of full-chain invocations on real input data before it can be accepted and stabilized. The evaluation of quality is all human, manual work. We tried some other semi-automated approaches, but just not feasible.
All of this is why there is no way Langchain or anything like it is currently useful to built actually valuable user-facing features at production scale.
What if you built a scoring system for re-usable action sequences that are stored in a database, and then have the LLM generate alternate solutions and grade them according to their performance?
An action sequence of steps could be graded according to whether it was successful, it’s speed, efficiency, cleverness, cost, etc.
You could even introduce human feedback into the process, and pay people for proposing successful and efficient action sequences.
All action sequences would be indexed and the AI agent would be able to query the database to find effective action sequences to chain together.
The more money you throw at generating, iterating, and evolving various action sequences stored in your database, the smarter and more effective your AI agent becomes.
1. Prompts ask LLM to generate input for the next step
2. Prompts ask LLM to generate instructions for the next step
3. Prompts ask LLM to generate the next step
Doing #3 across multiple steps is the promise of Langchain, AutoGPT et al. Pretty much impossible to do with useful quality. Attempting to do #3 very often either ends up completing the chain too early, or just spinning in a loop. Not the kind of thing you can optimize iteratively to good enough quality at production scale. "Retry" as a user-facing operation is just stupid IMO. Either it works well, or we don't offer it as a feature.
So we stopped doing 3 completely. The features now have a narrow usecase and a fully-defined DAG shape upfront. We feed some context on what all the steps are to every step, so it can understand the overall purpose.
#2, we tune these prompts internally within the team. It's very sensitive to specific words. Even things like newlines affects quality too much.
#1 - we've found it's doable for non-tech folks. In some of the features, we expose this to the user somewhat as additional context and mix that in with the pre-built instructions.
So #2 is where it's both hard to get right and still solvable. Every prompt change has to be tested with a huge number of full-chain invocations on real input data before it can be accepted and stabilized. The evaluation of quality is all human, manual work. We tried some other semi-automated approaches, but just not feasible.
All of this is why there is no way Langchain or anything like it is currently useful to built actually valuable user-facing features at production scale.