I haven't done as much testing as I'd like to confidently answer this in general...

I haven't done as much testing as I'd like to confidently answer this in general terms. In our own environment we have the benefit of defining the system prompt for translation, so we can introduce the logic of the tags to the LLM explicitly. That said, in our limited general-purpose testing we've seen that the flagship models definitely capture the logic of the tags and their semantic properties reliably without 'explanation'. I'm currently exploring a general purpose prompt sanitizer and potentially even a browser plugin for behind-the-scenes sanitization in ChatGPT and other end-user interfaces.