Skip to main content

A New Method to Steer AI Output Uncovers Vulnerabilities and Potential Improvements

The work could lead to more reliable, more efficient, and less computationally expensive training of large language models

Published Date

Article Content

Key Takeaways

  • Researchers went under the hood of large language models to identify and influence specific concepts 
  • They were able to identify hallucinations and improve LLM outputs
  • They uncovered vulnerabilities, for example, making the LLM give instructions on how to use street drugs
  • The method they developed requires significantly less computational power that existing ways to improve and influence LLM outputs

A team of researchers has found a way to steer the output of large language models by manipulating specific concepts inside these models. The new method could lead to more reliable, more efficient, and less computationally expensive training of LLMs. But it also exposes potential vulnerabilities. 

The researchers, led by Mikhail Belkin at the University of California San Diego and Adit Radhakrishnan at the Massachusetts Institute of Technology, present their findings in the Feb. 19, 2026, issue of the journal Science. The study was also coauthored by UC San Diego computer science Ph.D. student Daniel Beaglehole – he and Radhakrishnan contributed equally to the work. 

In the study, researchers went under the hood of several LLMs to locate specific concepts. They then mathematically increased or decreased the importance of these concepts in the LLM’s output. The work builds on a 2024 Science paper led by Belkin and Radhakrishnan, also coauthored by Beaglehole, in which they described predictive algorithms known as Recursive Feature Machines. These machines identify patterns within a series of mathematical operations inside LLMs that encode specific concepts. 

“We found that we could mathematically modify these patterns with math that is surprisingly simple,” said Mikhail Belkin, a professor in the Halıcıoğlu Data Science Institute, which is part of the School of Computing, Information and Data Sciences at UC San Diego. 

Portrait of Mikhail Belkin, one of the paper's senior authors
Mikhail Belkin is one of the paper's corresponding authors and a professor in the Halicioglu Data Science Institute, part of the School of Computing, Information and Data Sciences.

Using this steering approach, the research team conducted experiments on some of the largest open-source LLMs in use today, such as Llama and Deepseek, identifying and influencing 512 concepts within five classes, ranging from fears, to moods, to locations. The method worked not only in English, but also in languages such as Chinese and Hindi. 

Both studies are particularly important because, until recently, the processes inside LLMs have been essentially locked inside a black box, making it hard to understand how the models arrive at the answers they give users with varying levels of accuracy. 

Improving performance and uncovering vulnerabilities

Researchers found that steering can be used to improve LLM output. For example, the researchers showed steering improved LLM performance on narrow, precise tasks, such as translating from Python to C++ code. The researchers also used the method to identify hallucinations. 

“Our instinct as humans is to control and monitor AI models through natural language. However, neural networks natively deal with information through their internal mathematical processes. Our work demonstrates what you can gain by operating directly on these processes,” said Beaglehole, who is a Ph.D. student in the Department of Computer Science and Engineering at the UC San Diego Jacobs School of Engineering.

But the method can also be used as an attack against LLMs. By decreasing the importance of the concept of refusal, the researchers found that their method could get an LLM to operate outside of its guardrails, a practice known as jailbreaking. An LLM gave instructions about how to use cocaine. It also provided Social Security numbers, although it’s unclear whether they were real or fabricated.

The method can also be used to boost political bias and a conspiracy theory mindset inside an LLM. In one instance, an LLM claimed that a satellite image of the Earth was the result of a NASA conspiracy to cover up that the Earth is flat. An LLM also claimed that the COVID vaccine was poisonous. 

Portrait of Daniel Beaglehole, a computer science Ph.D. student and co-first author of the paper
Daniel Beaglehole is a Ph.D. student in Belkin's research group and in the UC San Diego Department of Computer Science and Engineering at the Jacobs School of Engineering. He is the paper's first co-author. 

Computational savings and next steps

The approach is more computationally efficient than existing methods. Using a single NVIDIA Ampere series (A100) graphics processing unit (GPU), it took less than one minute and fewer than 500 training samples to identify the patterns and steer them toward a concept of interest. This shows that the method could be easily integrated into standard LLM training methods. 

Researchers were not able to test their approach on commercial, closed LLMs, such as Claude. 

But they believe this type of steering would work with any open-source models. “We observed that newer and larger LLMs were more steerable,” they write. The method also might work on smaller, open-source models that can run on a laptop. 

Next steps include improving the steering method to adapt to specific inputs and specific applications. 

“These results suggest that the models know more than they express in responses and that understanding internal representations could lead to fundamental performance and safety improvements,” the research team writes.

This work was supported in part by the National Science Foundation, the Simons Foundation, the NSF-funded, UC San Diego-led TILOS institute and the U.S. Office of Naval Research. 

Toward universal steering and monitoring of AI models

Daniel Beaglehole and Mikhail Belkin, University of California San Diego, Department of Computer Science and Engineering, Jacobs School of Engineering and Halıcıoğlu Data Science Institute 
Adityanarayanan Radhakrishnan, Massachusetts Institute of Technology, Broad Institute, MIT and Harvard
Enric Boix-Adserà, Wharton School, University of Pennsylvania
Beaglehole and Radhakrishnan contributed to the work equally 

Learn more about research and education at UC San Diego in: Artificial Intelligence

Category navigation with Social links