How well can LLMs hide their thoughts from Probes?
Linear probes are a promising way to interpret the working of LLMs. They have importantly been shown to allow fast and affordable monitoring of harmful behaviours and as a…
Interactive experiments and code visualizations.