It's not clear that making it worse against an artificial benchmark has anything to do with real world software. Telling the LLM how to run test suites, what underlying APIs to use, etc., all seem like valid needs for some kind of instruction, and short of writing it into every single prompt, those seem like the only approaches.
reply