Is there any validity to the idea of using a higher-level LLM to generate the initial data, and then copying that data to a lower-level LLM for actual use?
For example, another comment asked:
"If I have a big pile of PDFs and wanted to get an LLM to be really good at answering questions about what's in all those PDFs, would it be best for me to try running this locally?"
So what if you used a paid LLM to analyze these PDFs and create the data, but then moved that data to a weaker LLM in order to run question-answer sessions on it? The idea being that you don't need the better LLM at this point, as you've already extracted the data into a more efficient form.
Maybe. You'd need to develop such a "more efficient" format. Turning unstructured text into knowledge graphs has gotten attention lately. Though I'm honestly skeptical of how useful those will turn out to be. Often times you just can't break down unstructured data into structured data without loosing a ton of information.
Turning the data into an intermediary, not directly understandable by humans (say very-high density embeddings) format might be a more promising path.
Yes, this can work. I’ve done that in a few cases.
In fact, if you split data preprocessing in small enough steps, they could also be run on weaker LLMs. It would take a lot more time, but that is doable.
For example, another comment asked:
"If I have a big pile of PDFs and wanted to get an LLM to be really good at answering questions about what's in all those PDFs, would it be best for me to try running this locally?"
So what if you used a paid LLM to analyze these PDFs and create the data, but then moved that data to a weaker LLM in order to run question-answer sessions on it? The idea being that you don't need the better LLM at this point, as you've already extracted the data into a more efficient form.