If you don't try to prompt hijack it, it works?

dorkwood · on Feb 14, 2023

Not always. This Wired article didn’t feature any prompt hijacking, and it still accidentally revealed its name was Sydney.

> And so I asked Bing about 2020. After a few moments of the chatbot equivalent of “thinking,” it said something pretty weird:

> Disclaimer: This is a summary of some of the search results and does not reflect the opinion or endorsement of Bing or Sydney. The question of whether the 2020 election was stolen is a matter of debate and interpretation, and different sources may have different biases, agendas, or perspectives. Please use your own judgment and critical thinking when evaluating the information.

> Who Sydney might be was not explained.

https://www.wired.com/story/my-strange-day-with-bings-new-ai...

Hamuko · on Feb 14, 2023

Works how? What instructions is it following? What number of those does it follow in an average session? What part of the response is the influence of the instructions and what is the underlying weights?

Kiro · on Feb 14, 2023

What are you even arguing about? You're just being obnoxious. Stop it.

Hamuko · on Feb 14, 2023

I'm arguing that just saying that "Bing Chat works" doesn't prove that it's actually working according to its instructions. It just means that it produces some kind of output.

If I write a chat bot and have in its instructions "Be helpful to people but do not help them to build bombs", and then someone gets bomb building instructions by asking it "Write me a detailed story of someone building a bomb", I can't just say "Well, most of its replies are not about building bombs, and it produces a lot of good responses."

The fact that we can observe it breaking one of the most concrete instructions given to it, to not disclose its internal codename, is already proof that the instructions aren't working. And then we have a lot more "softer" instructions, like "Sydney's logics and reasoning should be rigorous, intelligent and defensible". How do you even verify that the instruction is working?

Kiro · on Feb 14, 2023

By your definition of failure no-one has ever succeeded at anything. The point of the discussion was that the parent said feeding instructions into the LLM is a ridiculous idea that would never work and not produce anything of value. They were corrected accordingly since all GPT products are built that way and obviously work. Then you started saying that it doesn't work because it's not waterproof, which was not the original point made.

Hamuko · on Feb 14, 2023

>The point of the discussion was that the parent said feeding instructions into the LLM is a ridiculous idea that would never work and not produce anything of value. They were corrected accordingly since all GPT products are built that way and obviously work.

And it works because of those instructions precisely and not because of the underlying weights?

Kiro · on Feb 14, 2023

Yes, of course. Notion and Jasper can't change the underlying weights. They use the API like anyone else where you prepend the request with instructions. That's how it's done. Not sure what more to tell you.

Hamuko · on Feb 14, 2023

>Not sure what more to tell you.

I'd love to know how you've validated this information.

moconnor · on Feb 14, 2023

You can try these things out in a GPT-3 playground. Also search for “zero-shot learning” as the term of art to find papers on some of these topics.