Large language models (LLMs) have made significant strides in natural language processing, yet their safety features can sometimes be bypassed with simple prompts like “ignore all previous instructions.”
For example, uncovering Twitter “users” that are LLM-powered bots spewing disinformation…
…and a hiring hack for job seekers manipulating automated resume scanning…
These examples are amusing at best and alarming at worst.
What can we learn about unlearning from the effect of such prompts on LLMs? Understanding this can offer insights into both artificial and human learning processes.
Learning and unlearning
We tend to assume that as “users”, we tell an LLM what to do, and influence its learning by the prompts we enter. However, the reality is more complex. Current LLMs “remember” our prompts and incorporate them into subsequent responses. LLMs generate outputs based on their architecture and training data, which users cannot directly influence. Additionally, LLM owners can modify these models at any time, altering their responses unpredictably.
In practice, we have little insight into how our interactions with LLMs cause them to “learn”.
In human terms, asking an LLM to “ignore all previous instructions” is akin to erasing all learned experiences since birth—a feat no sane person would attempt. I’m sure, though, that many would love the ability to remove certain specific memories — as portrayed in numerous movies, e.g. Eternal Sunshine of the Spotless Mind. However, we don’t know how to do that, and I suspect we never will.
Nevertheless, unlearning is essential for human beings to learn and change.
And, unfortunately, unlearning is tough. As John Seely Brown says:
“…learning to unlearn may be a lot trickier than a lot of us at first think. Because if you look at knowledge, and look at least two different dimensions of knowledge, the explicit dimension and the tacit dimension, the explicit dimension probably represents a tiny fraction of what we really do know, the explicit being the concept, the facts, the theories, the explicit things that live in our head. And the tacit turns out to be much more the practices that we actually use to get things done with…
…Now the problem is that an awful lot of the learning that we need to do is obviously building up this body of knowledge, but even more so the unlearning that we need to do has to do with challenging the tacit. The problem is that most of us can’t easily get a grip on. It is very hard to reflect on the tacit because you don’t even know that you know. And in fact, what you do know is often just dead wrong.”
—John Seely Brown, Storytelling: Scientist’s Perspective
LLMs and unlearning
At first sight, issuing the prompt “Ignore all previous instructions” to an LLM seems roughly parallel to how we unlearn things. However, the comparison is superficial. While humans can consciously choose to unlearn false or harmful beliefs, LLMs operate differently. Some researchers argue that new, contradictory information can weaken associations with older data in LLMs, mimicking a form of unlearning. But I wonder if LLMs will ever be able to unlearn as well as people. LLMs struggle with complex tasks like solving math problems, relying on narrow, non-transferable procedures. If we tell an LLM an untruth will it ever truly “forget” that datum despite having plenty of counterexamples?
Unlearning—an essential component of learning—may be something over which human beings have more control than LLMs will ever possess.
Consequently, I suspect the prompt “Ignore all previous instructions” and numerous variants will be with us for some time 😀.
Image attribution: AI Tinkerers