It seems to me that thinking models are harder to decensor, as they are trained ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		shikon7 56 days ago \| parent \| context \| favorite \| on: Heretic: Automatic censorship removal for language... It seems to me that thinking models are harder to decensor, as they are trained to think whether to accept your request.

int_19h 55 days ago [–]

It goes both ways. E.g. unmodified thinking Qwen is actually easier to jailbreak to talk about things like Tiananmen by convincing it that it is unethical to refuse to do so.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact