AI Safety Projects

Interactive experiments and code visualizations.

How well can LLMs hide their thoughts from Probes?

Linear probes are a promising way to interpret the working of LLMs. They have importantly been shown to allow fast and affordable monitoring of harmful behaviours and as a…