I'm a fan of Anthropic for this reason. I use Claude and it's very good most of the time for my coding requirements.
Generally when you have a lot of companies competing to show whos product X does the best at Y, there's a lot of monetary incentives to manipulate the products to perform well specifically on those types of tests.
Generally when you have a lot of companies competing to show whos product X does the best at Y, there's a lot of monetary incentives to manipulate the products to perform well specifically on those types of tests.