How well can LLMs hide their thoughts from Probes?
Linear probes are a promising way to interpret the working of LLMs. They have importantly been shown to allow fast and affordable monitoring of harmful behaviours and as a result are currently a popular technique among safety researchers. We want to build on this work and answer: What types of models do we get from Adversarial Obfuscation Training, and are some defences better for dealing with this?
Linear probes are a promising way to interpret the working of LLMs. They have importantly been shown to allow fast and affordable monitoring of harmful behaviours and as a result are currently a popular technique among safety researchers.
However, recent papers suggest that this might be fragile:
- Obfuscated Activations Bypass LLM Latent-Space Defenses, showed that injecting obfuscating tokens evades the probes.
- Neural Chameleons, showed that with a suitable loss term during training models can be triggered to evades the probes.
- RL-OBFUSCATION, showed that the models can learn to evade probes via reinforcement learning.
We want to build on this work and answer:
What types of models do we get from Adversarial Obfuscation Training, and are some defences better for dealing with this?
Some research directions we might take:
- Do we get qualitatively different policies from RL Obfuscation when training on other datasets (eg: Dolus Chat)
- Interpret Obfuscated Activations and neural chameleons
- Try defences on backdoored models from RL Obfuscation