Toward universal steering and monitoring of AI models.
| Authors | |
| Abstract | Artificial intelligence (AI) models contain much of human knowledge. Understanding the representation of this knowledge will lead to improvements in model capabilities and safeguards. Building on advances in feature learning, we developed an approach for extracting linear representations of semantic notions or concepts in AI models. We showed how these representations enabled model steering, through which we exposed vulnerabilities and improved model capabilities. We demonstrated that concept representations were transferable across languages and enabled multiconcept steering. Across hundreds of concepts, we found that larger models were more steerable and that steering improved model capabilities beyond prompting. We showed that concept representations were more effective for monitoring misaligned content than for using judge models. Our results illustrate the power of internal representations for advancing AI safety and model capabilities. |
| Year of Publication | 2026
|
| Journal | Science (New York, N.Y.)
|
| Volume | 391
|
| Issue | 6787
|
| Pages | 787-792
|
| Date Published | 02/2026
|
| ISSN | 1095-9203
|
| DOI | 10.1126/science.aea6792
|
| PubMed ID | 41712705
|
| Links |