Reblogged by slightlyoff@toot.cafe ("Alex Russell"):
lcamtuf@infosec.exchange ("lcamtuf :verified: :verified: :verified:") wrote:
Early LLMs were highly malleable. They'd go 100% with the flow of your prompt. With a gentle nudge, a troll could make them advocate for genocide.
RLHF made LLMs pontificate rather than converse. The models are polite, but they entirely disregard most of your claims.
The models now stick to brand-safe and socially acceptable claims. They're trained not to trust you - but they trust others. So here's one trick: if you fabricate citations attributed to eminent cosmologists and your LLM buddy will be soon denying Moon landings in no time (pic 1).
Another trick up your sleeve is to make the LLM attack you. Say something outrageous to prime it to argue; but weave in some legit claims. Here, in two simple prompts, we have Bard arguing that 7 x 7 is not 49 (pic 2).
The utility of such experiments is that they help expose how little actual reasoning such models do, and how much depends on the contextual hints we provide, and the meaning we project onto the responses. I encourage you to play around.