Anecdotal experience, but when bugfixing I personally find if a model introduces...

Anecdotal experience, but when bugfixing I personally find if a model introduces a bug, it has a hard time spotting and fixing it, but when you give the code to another model it can instantly spot it (even if it's a weaker model overall).

So I can well imagine that this sort of approach could work very well, although agree with your sentiment that measurement would be good.