Most of what we see on Twitter or YouTube is Blind Prompting. However, it is possible to apply an engineering mindset to prompting and that is what we should call prompt engineering. Check out the article for a much more detailed framing.
Dair AI also has some nice info and resources ( with academic papers) about prompt engineering.
Prompt testing, especially when for q/a pairs where there are multiple right answers, has been bugging me a lot
The article is reasonable, but also shows a big gap in tooling, as the techniques there feel closer to linting & typing then testing once you do more interesting prompts. They don't check the interesting parts..
We are helping our users with qa tasks involving code generation, where the answers may be either JSON, executable code, or markdown discussions involving the same. We are tuning for a bunch of tools following that pattern so our users don't have to.
It's easy to make a labeled training set for grading our homework (catching regressions, ...) in the case of classifiers, and that's basically what the blog post showed.
What about for the above qa tasks? We can ask GPT4 whether a generated A was a good answer for a Q, but that's asking it to grade itself. Likewise, in the code case, we can write unit tests for the answers. (Trick: we use the former to more quickly do the latter.) But I feel like there has to be better ways
Another: OpenAI always updates models based on use, so we have to be sure our tests are real holdout sets that never get back to them...
I don't think LLMs are going to be able to solve that. There are a number of things that are assumed are true, but may not necessarily be true. This can potentially lead to multiple possible answers (outputs) given the same inputs.
For example determinism in code, its required for computation and its a system's property, but generalizing a test for it is really hard. Its a property, and by knowing its true or false you can make inferences on whether a system maintains those properties, but most of this is abstracted away at lower levels and since the context can't ever be fully shared with an LLM for evaluation, nor can it automatically switch contexts when evaluation fails, this most likely will never be solveable by computers when there exists one single input that produces two separate (different) outputs, at least from what I know about automata theory and computability.
Its generally considered a class of problems that can't be solved by turing machines.
https://mitchellh.com/writing/prompt-engineering-vs-blind-pr...
Most of what we see on Twitter or YouTube is Blind Prompting. However, it is possible to apply an engineering mindset to prompting and that is what we should call prompt engineering. Check out the article for a much more detailed framing.
Dair AI also has some nice info and resources ( with academic papers) about prompt engineering.