Finally. I've been saying that we need to stop focusing on a single agent getting everything right and instead layer agents for about 16 months now, but it's great to have a paper to point to.
If this was done at more granular steps of agent quantity I'm curious just how closely it would match those numbers.
I'd also really love to see the eventual follow-up where we see how much more performance can be obtained when the agents are each fine tuned towards slightly different aims. I'd expect there'd even be a performance lift from just having the agents each set at different temperature levels.
Very happy to see the research community starting to step in this direction!
A really good topic that ties in with this is the need for deterministic sampling (I may have the terminology a bit incorrect) depending on what the model is indended for. The LLMWare team did a good 2 part video on this here as well (https://www.youtube.com/watch?v=7oMTGhSKuNY)
I think dedicated miniture LLMs are the way forward.
Disclaimer - Not affiliated with them in any way, just think it's a really cool project.
I have one personal niggle: I get annoyed when we end up lying to ourselves. Regarding the 101 section in video 1 - People forgot this the day LLMs came out. I felt this was too generous with the benefit of doubt.
This basic point was and remains constantly argued - with “Emergence” and anthropomorphization being the heart of the opposing argument.
We have tons of specialized components that work together cooperatively and competitively. There’s multiple ways they connect. There also seems to be global processes that happen, like during sleep. There’s over 3,000 cell types per the BRAIN initiative. Every brain forms on it’s own taking shape like something out of a Transformers movie.
God’s design is mostly nothing like man’s neural networks. It’s far superior. Brains are also what’s creating all the artificial, neural nets on top of all math, tech, and economic systems that they run on. AI’s got a lot of catching up to do.
I think it's way more than 8 even. And it's common to have many working as supervisors, often at conflict with each other. And some act out the automatic trauma responses, as they're stuck in the past when the trauma occurred.
And et voila, you have the script of inside out. \s
But honestly I do think this is how we operate. Depending on our state of metabolism and other psychological factors, the dominant version changes but as a whole we remain the sum total of all these versions.
Kind of. More like a mixture of a mixture of experts.
The problem is MoE on its own isn't able to use the context as a scratch pad for differentiated CoT trees.
So you have a mixture of token suggestions, but a singular chain of thought.
A mixture of both is probably going to perform better than just a mixture of the former, especially given everything we know by now regarding in context learning or the degree of transmission synthetic data is carrying.
It's interesting that the diminishing returns for tasks flatten out rapidly around the same size as the ideal human meeting sizes: https://www.researchgate.net/figure/18-Optimal-Meeting-Sizes...
If this was done at more granular steps of agent quantity I'm curious just how closely it would match those numbers.
I'd also really love to see the eventual follow-up where we see how much more performance can be obtained when the agents are each fine tuned towards slightly different aims. I'd expect there'd even be a performance lift from just having the agents each set at different temperature levels.
Very happy to see the research community starting to step in this direction!