I'd be interested to see how this does on Nicholas Carlini's benchmark: https://...

I'd be interested to see how this does on Nicholas Carlini's benchmark:

https://nicholas.carlini.com/writing/2024/my-benchmark-for-l...

I've tried out some of my own little test prompts, but most of those are tricky rather than practical. At least for my inputs, it doesn't seem to do better than other top models, but I'm hesitant to draw conclusions before seeing outputs on more realistic tasks. It does feel like it's at least in the ballpark of GPT-4/Claude/etc. Even if it's not actually GPT-4.5 or whatever, it's still an interesting mystery what this model is and where it came from.