How well can LLMs hide their thoughts from Probes?

Linear probes are a promising way to interpret the working of LLMs. They have importantly been shown to allow fast and affordable monitoring of harmful behaviours and as a result are currently a popular technique among safety researchers. We want to build on this work and answer: What types of models do we get from Adversarial Obfuscation Training, and are some defences better for dealing with this?

Linear probes are a promising way to interpret the working of LLMs. They have importantly been shown to allow fast and affordable monitoring of harmful behaviours and as a result are currently a popular technique among safety researchers.

High level diagram of Linear Probes

However, recent papers suggest that this might be fragile:

We want to build on this work and answer:

What types of models do we get from Adversarial Obfuscation Training, and are some defences better for dealing with this?

Some research directions we might take: