The algorithm still uses all the weights, just not all the time - just skips the weights when they are not important given an input vector.
Also, approximation methods, as a field, are not new and they have shown their use.
Having said all that, extraordinary claims require extraordinary evidence - that’s why I hedge the communication messages. It’s „probably” until we get serious tests going on
The metric that's improved is computation speed, and it's achieved by essentially changing the computation (by not performing some computation that likely doesn't have large impact on the results).
Give that it's a different computation, you could argue that Mistral+effort is a new model with the improved metric of quality per amount of computation performed.
Otherwise - given that for every different input there's a seperate set of weights in the model that are excluded - I don't think you could conclude from this (if it holds up etc etc) that the base model is not optimal.
In a similar sense, quantization improved the "quality per model size" metric, but I don't think people are arguing that Mistral is less optimal than quantised Mistral (unless you're speaking about literally that metric). On the other hand, if you're targeting that metric specifically, then it would make perfect sense to say that quantised Mistral is more optimisal for it.
I guess it comes down to optimality being dependent on the metric you're looking at, and there being many things you might want to optimise for.
To note again, if this technique holds up, it's better than model distillation (just get rid of some of the weights) because for some inputs those weights could matter and this technique should (iiuc) account for that somewhat. To me, this is what it sounds like you're referring to when saying:
> If removing weights improves some metrics, that may be a clue that the model is not optimal in some sense
Yeah, more tests are needed. I got some feedback on using KL instead of the token similarity - initial tests seem to show that it is workable (compared to Q8), but not awesomely amazing - will be working on that next week and publishing.
As for treating effort+Mistral as a separate model - I wouldn't do that comparison. The model stays the same, all the weights from it are still being used, just not all of the time - we don't really lose information from the source model.
Wait, does the second "it" refer to the true part? Because traditionally, it refers to the "too good" expression. So you'd say, "it _is_ too good to be true".
If removing weights improves some metrics, that may be a clue that the model is not optimal in some sense.