The more concerning algorithms at play are how they are post-trained. And the then concern of reward hacking. Which is what he was getting at.
https://en.wikipedia.org/wiki/Reward_hacking
100% - we really shouldn't anthropomorphize. But the current models are capable of being trained in a way to steer agentic behavior from reasoned token generation.
> But the current models are capable of being trained in a way to steer agentic behavior from reasoned token generation.
This does not appear to be sufficient in the current state, as described in the project's README.md:
Why This Exists
We learned the hard way that instructions aren't enough to
keep AI agents in check. After Claude Code silently wiped
out hours of progress with a single rm -rf ~/ or git
checkout --, it became evident that "soft" rules in an
CLAUDE.md or AGENTS.md file cannot replace hard technical
constraints. The current approach is to use a dedicated
hook to programmatically prevent agents from running
destructive commands.
Perhaps one day this category of plugin will not be needed. Until then, I would be hard-pressed to employ an LLM-based product having destructive filesystem capabilities based solely on the hope of them "being trained in a way to steer agentic behavior from reasoned token generation."
100% - we really shouldn't anthropomorphize. But the current models are capable of being trained in a way to steer agentic behavior from reasoned token generation.