Is there any validity to the idea of using a higher-level LLM to generate the in...

StrauXX · on May 22, 2024

Maybe. You'd need to develop such a "more efficient" format. Turning unstructured text into knowledge graphs has gotten attention lately. Though I'm honestly skeptical of how useful those will turn out to be. Often times you just can't break down unstructured data into structured data without loosing a ton of information. Turning the data into an intermediary, not directly understandable by humans (say very-high density embeddings) format might be a more promising path.

abdullin · on May 22, 2024

Yes, this can work. I’ve done that in a few cases.

In fact, if you split data preprocessing in small enough steps, they could also be run on weaker LLMs. It would take a lot more time, but that is doable.

kkielhofner · on May 22, 2024

There is actually a specific approach of this concept for generating synthetic data for training datasets called UDAPDR[0].

It or something like it could likely be applied to any form of generation including what you are describing.

[0] - https://github.com/primeqa/primeqa/tree/4ae1b456dbe9f75276fe...

kevinkeller · on May 22, 2024

Yes, this model works in many cases.

For example, ask the (better, costlier) Claude Opus to generate high-quality prompts, which get fed into (worse, cheaper) Claude Sonnet.

thibaut_barrere · on May 22, 2024

Yes, that is what I am doing on some projects