I would suspect that for self hosted LLMs, quality >>> performance, so the newer releases will always expand to fill capacity of available hardware even when efficiency is improved.
The blog mentions checking each agent action (say the agent was planning to send a malicious http request) against the user prompt for coherence; the attack vector exists but it should make the trivial versions of instruction injection harder
If it was just "noisy", you could compensate with scale. It's worse than that.
"Human preference" is incredibly fucking entangled, and we have no way to disentangle it and get rid of all the unwanted confounders. A lot of the recent "extreme LLM sycophancy" cases is downstream from that.
I assume the high volume of search traffic forces Google to use a low quality model for AI overviews. Frontier Google models (e.g. Gemini 2.5 pro) are on-par, if not 'better', than leading models from other companies.