I'm guessing the issue is just the model size. If you're testing sub-30B models and finding errors, well they're probably not large enough to remember everything in the training data set, so there's inaccuracies and they might hallucinate a bit regarding factoids that aren't very commonly seen in the training data.
Commercial models are presumably significantly larger than the smaller open models, so it sounds like the issue is just mainly model size...
I'm guessing the issue is just the model size. If you're testing sub-30B models and finding errors, well they're probably not large enough to remember everything in the training data set, so there's inaccuracies and they might hallucinate a bit regarding factoids that aren't very commonly seen in the training data.
Commercial models are presumably significantly larger than the smaller open models, so it sounds like the issue is just mainly model size...
PS: Okra on curry is pretty good actually :)