I like this framing by Michael Hashimoto, on Prompt Engineering: https://mitchel...

lmeyerov · on April 22, 2023

Prompt testing, especially when for q/a pairs where there are multiple right answers, has been bugging me a lot

The article is reasonable, but also shows a big gap in tooling, as the techniques there feel closer to linting & typing then testing once you do more interesting prompts. They don't check the interesting parts..

_siis · on April 22, 2023

> The article seems reasonable but ... closer to linting then testing... they don't check the interesting parts

can you elaborate a bit more on what those interesting parts are?

It could just be a limitation of computation.

lmeyerov · on April 23, 2023

We are helping our users with qa tasks involving code generation, where the answers may be either JSON, executable code, or markdown discussions involving the same. We are tuning for a bunch of tools following that pattern so our users don't have to.

It's easy to make a labeled training set for grading our homework (catching regressions, ...) in the case of classifiers, and that's basically what the blog post showed.

What about for the above qa tasks? We can ask GPT4 whether a generated A was a good answer for a Q, but that's asking it to grade itself. Likewise, in the code case, we can write unit tests for the answers. (Trick: we use the former to more quickly do the latter.) But I feel like there has to be better ways

Another: OpenAI always updates models based on use, so we have to be sure our tests are real holdout sets that never get back to them...

_siis · on April 25, 2023

I don't think LLMs are going to be able to solve that. There are a number of things that are assumed are true, but may not necessarily be true. This can potentially lead to multiple possible answers (outputs) given the same inputs.

For example determinism in code, its required for computation and its a system's property, but generalizing a test for it is really hard. Its a property, and by knowing its true or false you can make inferences on whether a system maintains those properties, but most of this is abstracted away at lower levels and since the context can't ever be fully shared with an LLM for evaluation, nor can it automatically switch contexts when evaluation fails, this most likely will never be solveable by computers when there exists one single input that produces two separate (different) outputs, at least from what I know about automata theory and computability.

Its generally considered a class of problems that can't be solved by turing machines.

https://en.wikipedia.org/wiki/Theory_of_computation

https://medium.com/@tarcisioma/limits-of-computation-231bf28... (overview)

https://en.wikipedia.org/wiki/Undecidable_problem (crux of the problem)