Toward universal steering and monitoring of AI models.

Science (New York, N.Y.)
Authors
Abstract

Artificial intelligence (AI) models contain much of human knowledge. Understanding the representation of this knowledge will lead to improvements in model capabilities and safeguards. Building on advances in feature learning, we developed an approach for extracting linear representations of semantic notions or concepts in AI models. We showed how these representations enabled model steering, through which we exposed vulnerabilities and improved model capabilities. We demonstrated that concept representations were transferable across languages and enabled multiconcept steering. Across hundreds of concepts, we found that larger models were more steerable and that steering improved model capabilities beyond prompting. We showed that concept representations were more effective for monitoring misaligned content than for using judge models. Our results illustrate the power of internal representations for advancing AI safety and model capabilities.

Year of Publication
2026
Journal
Science (New York, N.Y.)
Volume
391
Issue
6787
Pages
787-792
Date Published
02/2026
ISSN
1095-9203
DOI
10.1126/science.aea6792
PubMed ID
41712705
Links