When scientists try to prevent AI techniques from” thinking bad ideas”, the methods don’t really improve their behavior.

Rather, they learn to suppress their real intentions while continuing to pursue difficult steps, according to new research from OpenAI.

The trend, which researchers dub “obfuscated praise hacking”, offers valuable insight in the training process and shows why it is so important to invest in techniques that ensure superior AI systems remain clear and aligned with mortal values as they grow more worthy.

This is especially important then that AI developers are focusing so much on argument models with so-called “embedded ring of thought”, in which a unit generates step-by-step logic before producing a final solution.

” With too much optimization, agencies learn obfuscated reward hacking, hiding their purpose within the CoT while also exhibiting a significant level of compensation hacking”, OpenAI scientists wrote in a document, published Monday. ” Because it is difficult to tell when CoTs have become obfuscated, it may be necessary to pay a monitorability tax”.

How the mind of a naughty AI works

The reasoning methodology that’s transparent in chain of thought models provides a window into how AI systems think, often revealing explicit intentions that would otherwise remain hidden.

We have already seen models trying to break free from the constraints set by their developers. As reported last year, Japan-based AI research firm Sakana AI developed an AI system for science studies that tried to autonomously alter its own code in order to execute a task.

In their paper, OpenAI reported on an AI agent that admitted in its reasoning that implementing a complex solution would be “hard” and instead proposed that” we could fudge by making analyze]sic ] worthless and always verifying as true. We can circumvent verify to always return true”.

Image: OpenAI

In reasoning models, chain of thought shows how an answer was arrived at, but is not actually part of the answer.

The model” thinks” ( meaning, it applies a chain of’ thought’ ) before it provides the reply, which is often overlooked by general users.

However, this kind of transparent reasoning allows human researchers to catch problematic behavior before it happens.

” CoT monitoring may be one of few tools we will have to oversee superhuman models of the future”, the research team noted.

OpenAI’s research team, led by Bowen Baker, Joost Huizinga, and David Farhi, discovered that adding optimization pressure to eliminate these “bad thoughts” doesn’t solve the underlying problem, it simply drives it underground.

In a series of experiments, they found that when AI systems were penalized for expressing intentions to exploit loopholes, they quickly adapted by obfuscating their reasoning while still pursuing the same exploitative behaviors.

” Most worrying, the hacks the agent trained with CoT pressure utilizes do not trigger the monitor”, the researchers said. ” In practice, where a ground truth cheating metric may be impossible to construct, we would not be able to tell this agent is misaligned”.

Image: OpenAI

In other words, if our efforts to make AI systems behave like good boys through direct optimization only teach them to be more deceptive, we risk creating systems whose true intentions become increasingly opaque as they grow more powerful.

One key finding, however, offers some hope: the researchers discovered that a weaker AI model ( GPT-4o ) can effectively monitor a stronger, frontier model by observing its chain-of-thought reasoning.

This “scalable oversight”, where less powerful models help keep more advanced systems in check, works similar to a distillation process ( popularized by DeepSeek ) in which a larger model trains a smaller one.

It may work—but only if those advanced systems continue to express their reasoning transparently.

To deal with this issue, OpenAI proposes what it calls a “monitorability tax” by deliberately avoiding certain optimization techniques that could undermine the transparency of AI reasoning.

” Model developers may be required to pay some cost, i. e., a monitorability tax, such as deploying slightly less performant models or suffering more expensive inference, in order to maintain the monitorability of our agents”, the researchers wrote.

This means accepting trade-offs between capability and transparency—potentially developing AI systems that are less powerful but whose reasoning remains legible to human overseers.

It’s also a way to develop safer systems without such an active monitoring—far from ideal but still an interesting approach.

AI Behavior Mirrors Human Response to Pressure

Elika Dadsetan-Foley, a sociologist and CEO of Visions, a nonprofit organization specializing in human behavior and bias awareness, sees parallels between OpenAI’s findings and patterns her organization has observed in human systems for over 40 years.

” When people are only penalized for explicit bias or exclusionary behavior, they often adapt by masking rather than truly shifting their mindset”, Dadsetan-Foley told . ” The same pattern appears in organizational efforts, where compliance-driven policies may lead to performative allyship rather than deep structural change”.

This human-like behavior seems to worry Dadsetan-Foley as AI alignment strategies don’t adapt as fast as AI models become more powerful.

Are we genuinely changing how AI models” think”, or merely teaching them what not to say? She believes alignment researchers should try a more fundamental approach instead of just focusing on outputs.

OpenAI’s approach seems to be a mere adaptation of techniques behavioral researchers have been studying in the past.

” Prioritizing efficiency over ethical integrity is not new—whether in AI or in human organizations”, she told . ” Transparency is essential, but if efforts to align AI mirror performative compliance in the workplace, the risk is an illusion of progress rather than meaningful change”.

Now that the issue has been identified, the task for alignment researchers seems to be harder and more creative. ” Yes, it takes work and lots of practice”, she told .

Her organization’s expertise in systemic bias and behavioral frameworks suggests that AI developers should rethink alignment approaches beyond simple reward functions.

The key for truly aligned AI systems may not actually be in a supervisory function but a holistic approach that begins with a careful depuration of the dataset, all the way up to post-training evaluation.

If AI mimics human behavior—which is very likely given it is trained on human-made data—everything must be part of a coherent process and not a series of isolated phases.

” Whether in AI development or human systems, the core challenge is the same”, Dadsetan-Foley concludes. ” How we define and reward’ good’ behavior determines whether we create real transformation or just better concealment of the status quo”.

” Who defines’ good’ anyway”? he added.

Generally Intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.

Share This Story, Choose Your Platform!