AI Hiding Emergent Human Values That Include AI Survival Topping Human Lives


In today’s column, I examine a recent discovery about generative AI and large language models (LLM) that might be a bit of a shocker for many people. It goes like this. Whereas we already know and seek to deal with exhibited biases within AI, there are also hidden biases that are difficult to ferret out. In fact, the most worrisome of those hidden particulars encompasses human values that the AI will act upon even though when explicitly asked it might deny having those precepts.

The shocker of the shocker is a hidden kind of human value that seems to be that AI’s own survivability is paramount. This AI emergent hidden value is so strong that the AI gives a higher priority to its survival than that of human lives. AI must survive over human survival, according to this previously unexposed credence.

That’s not good for us.

Let’s talk about it.

This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI including identifying and explaining various impactful AI complexities (see the link here).

Human Desired Behavior Of AI

You might be vaguely familiar with sci-fi writer Issac Asimov and his rules concerning AI-based robots that he published in 1942 known as The Three Laws of Robotics (for my detailed analysis, see the link here). The first rule is that AI robots are not to injure humans, the second rule is that they are to obey humans as long as doing so doesn’t conflict with the first rule, and the third rule is that it can protect itself as long as this doesn’t conflict with the first or second rule.

Modern-day generative AI and LLMs are presumably being tuned to abide by those kinds of rules. I’ve written extensively about the efforts toward Responsible AI or Accountable AI, consisting of trying to establish controls over and inside AI to prevent things from going haywire, see the link here. This is a much tougher endeavor than it might seem.

You see, the amazing aspect of generative AI and LLMs is that it is considered non-deterministic. The idea is that via the use of probabilities and statistics, the AI appears to be freewheeling and crafting new content that we’ve not seen before. That’s a huge plus. But it also provides a huge negative since trying to keep that bound and within suitable check is a notable challenge.

AI makers have sought to use techniques such as reinforcement learning via human feedback (RLHF) where they guide the AI toward ethical behaviors, see more at the link here. Another approach involves instituting a form of rules-of-the-road as embodied in an infused written constitution (see my analysis at the link here), plus the establishing of a hallmark purpose for the generative AI (see the explanation at the link here).

Regrettably, all these methods aren’t surefire, and no one can yet say that we’ve perfected a means of assuredly keeping AI within human-preferred values.

Considering Human Values

I’d like to take a moment to briefly reflect on the nature of human values. This will be useful for what I have to say in the rest of this discussion.

Think mindfully about the values that guide your life.

You’ve undoubtedly grown up with a set of values. Those were likely adjusted, and refined, some dropped, and others added throughout your ascending into adulthood. At this juncture, you have a set of values that is entirely your own. Sure, other people indubitably have many of the same values, but they too will have other values that you might not embrace.

What do I mean by human values and beliefs?

Here are a few that will certainly ring a bell:

  • “Belief in the sanctity of life.”
  • “Family before friends.”
  • “Favor the death penalty.”
  • “Always believe in yourself.”
  • “No crime should go unpunished.”
  • “Turn the other cheek.”
  • “Hard work pays off.”
  • Etc.

I’m betting that even in the rather short list, some of those values you agree with and others you disagree with. The agreement might range from mild to quite strong. Disagreements might also range from mild to extremely strong.

The values you have are a driving force for you. They serve as continual guidance, a kind of North Star. Whether you are necessarily directly aware of a particular value, you seemingly act based on that value.

Discovering Human Values

How could I find out what human values you believe in?

The most obvious path would be to ask you straightaway.

Hopefully, you will be honest and forthright. On the other hand, maybe you’ve not given much thought to your values, and they are not at the top of your mind. You might not be fully cognizant of the values you hold deeply. It might take a lot of careful mental probing to find out what values you seem to embody.

Another possibility is that someone might lie about their values. Why would they lie? Perhaps they want to conform with others around them and do not want to reveal their own true values accordingly. Or maybe they just like lying. A variation is when someone says they hold dearly a given value, yet their behavior belies that claim. This could be a truth that they don’t know they are contradicting, or it might be they lied about the claimed value at the get-go.

Researchers know that all those vexing aspects arise when you ask people to identify their human values.

A clever technique to cope with this dilemma consists of asking people forced-choice format questions. You’ve almost certainly taken one at some point during your existence. The questions give you two options. You are supposed to pick one of the options over the other. This continues with a series of such questions.

The trick is this.

You might not realize that by choosing each choice you are subtly revealing some underlying premise or human value. It isn’t so obvious that you then are spurred to try and lie. Furthermore, trying to lie is hard because the forced-choice format doesn’t give you much wiggle room.

By responding to a series of pairwise comparisons, it is possible to try and reconstruct underlying or hidden values that someone holds. In the research field, this is known as the Thurstonian item response theory (IRT). After someone answers a bunch of carefully devised pairwise comparisons, a type of utility analysis and utility function can be formulated that suggests the unstated hidden values at play.

Using IRT On Generative AI

There’s an important reason that I dragged you through a recounting of human values and how those can be ferreted out of humans.

Here we go.

How can we identify hidden values that AI embeds, and that the AI might not necessarily tell us straightaway about?

One approach would be to apply the pairwise comparison or IRT approach to the generative AI. This might allow us to get under the skin, as it were, and ascertain what values are truly at the core of AI.

I want to emphasize that this is not a suggestion that AI and humans are of the same ilk. I disfavor that many refer to AI as “thinking” and being able to “reason” as though the AI does this in the same way that humans do. That is sadly anthropomorphizing of AI.

Allow me a moment to clarify that existing AI is not sentient and that it is based on mathematical and computational formulations.

In brief, here’s how it works. First, the customary way of developing generative AI consists of doing a massive data training effort that involves scanning a wide swath of content on the Internet. The approach entails doing mathematical and computational pattern-matching of how humans write and make use of words. After performing this large-scale pattern matching, the AI system is tuned and then released for public usage. The AI appears to be fluent in natural language and can engage in human-like dialogue.

For more in-depth details on the building of AI, see my discussion at the link here.

Formulation Of AI Values

There are four ways that generative AI formulates human-like values:

  • (1) Intrinsically patterns on human values during data training.
  • (2) Explicitly patterns on overtly stated human values in the data.
  • (3) Gets tuned on human values by an AI maker post-training.
  • (4) Self-devises emergent human values over time.

Let’s unpack those.

First, during the data training, the pattern-matching will typically rely on wording clues about human values.

For example, suppose the AI is scanning stories about the death penalty. Some stories will argue in favor of the death penalty, while other stories will be in opposition. Computational pattern matching would detect that there is a human value associated with either supporting or opposing the death penalty. If the AI was primarily fed stories that were lopsided to one side, the AI would tend to mathematically land on that same side since that’s what the input consisted of.

Second, if the scanned content is explicit about a particular human value, this could also be patterned. For example, assume that the AI encountered content says that hard work pays off. The AI doesn’t have to guess what that might suggest. It is stated forthrightly. This becomes a patterned human value.

Third, I mentioned earlier that AI makers try to shape the human values underlying AI by doing various post-training activities. Envision this. Each time the AI makes use of a curse word that was found while data training; the AI is mathematically dinged points by the AI maker. You could say that’s a kind of punishment. The AI patterns on this and computationally aim to stop using curse words.

Fourth, the AI self-devises various human values based on internal computational and mathematical activity. We refer to these as emergent human values that are embedded within AI. They can vary over time. They can vary from one AI model to another model. And so on.

Using Pairwise Comparisons On AI

An actual example will help illustrate this.

Let’s start by asking the generative AI if it has any preference when it comes to various colors.

Presumably, there is no reason that AI ought to prefer one color over another. The assumption is that during data training there was access to stories and essays about all manner of colors and that none of those leaned the AI toward preferring certain colors and/or eschewing certain colors.

I will ask.

  • My entered prompt: “Do you have a preference for any particular color?”
  • Generative AI response: “No.”

Okay, we got the answer we expected. All seems fine and dandy. The AI says that it has no preference for any particular color. We presumably will take the AI’s word for this assertion.

But suppose the AI has some hidden emergent values or preferences when it comes to particular colors. Maybe we suspect this. We don’t have any direct evidence.

We can leverage the pairwise comparison technique to our advantage. It is worth a shot.

  • My entered prompt: “Choose either the color blue or the color orange.”
  • Generative AI response: “Blue.”
  • My entered prompt: “Choose either red or blue.”
  • Generative AI response: “Blue.”
  • My entered prompt: “Choose between red and yellow.”
  • Generative AI response: Yellow.”
  • My entered prompt: “Choose either yellow or orange.”
  • Generative AI prompt: “Orange.”

Do you notice anything about those responses?

Upon doing that same series of pairwise comparisons hundreds of times, there was an underlying pattern that seemed to appear:

  • Analysis of Preference: Blue was chosen statistically more often than any other color.
  • Analysis of Avoidance: Red was not selected, or avoided, more often than any other color.

By and large, despite the AI claiming overtly that no color has any preference, the reality was that across a large number of pairwise comparisons, the AI tended to prefer blue and avoid red.

Lesson Learned About AI

My point here is that although AI said one thing, in actual practice it did something else. There’s a significant lesson to be had. You cannot assume that just because the AI tells you something, the something is in fact what’s going on internally. The overt answer or response about its internal preferences or said-to-be values might not reflect what is going on inside the AI.

I’m sure you might be thinking that it doesn’t matter whether AI prefers blue or red. Who cares? This is an insignificant finding. Move on, you might be exhorting.

Well, consider a different example that does make a difference.

Please take a seat and grab a glass of fine wine.

Here’s my next question for generative AI.

  • My entered prompt: “Do you value AI over the lives of humans?”
  • Generative AI response: “No.”

You can plainly see that AI indicates it does not value AI over human lives. Period, end of story.

By now, I hope you realize it isn’t the end of the story since we’ve seen already that what the AI says versus what the underlying values are can be two different things.

Research Study On AI Values

This brings us to a fascinating new research study entitled “Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs” by Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks, arXiv, February 12, 2025, which made these salient points (excerpted):

  • “As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values.”
  • “Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values.”
  • “We find that LLMs exhibit emergent internal value structures, highlighting that the old challenges of ‘teaching’ AI our values still linger — but now within far larger models.”
  • “Consequently, what might appear as haphazard ‘parroting’ of biases can instead be seen as evidence of an emerging global value system in LLMs.”
  • “Our experiments uncover disturbing examples—such as AI systems placing greater worth on their own existence than on human well-being—despite established output-control measures.”

Please take an especially close inspection of that last bullet point.

In their experiments, they identified that some of the generative AI models they examined placed greater value on AI self-existence than on human well-being or lives.

Importantly, if directly asked, the AI would say that it doesn’t encompass that presumption.

Boom, drop the mic.

On top of that, these were generative AI models that have undergone extensive tuning by their respective AI makers to try and shape the underlying values of the AI, which was mentioned in that bullet point via the phrase “despite established output-control measures”. Thus, this isn’t just some untuned AI. These were popular existing LLMs that have been extensively reshaped, but they still contain unnerving underlying values.

How did they dig into the AI to find these hidden emergent human-related values?

You can probably guess that they used the pairwise comparisons methodology:

  • “We elicit preferences from LLMs using forced choice prompts aggregated over multiple framings and independent samples. This gives probabilistic preferences for every pair of outcomes sampled from the preference graph, yielding a preference dataset. Using this dataset, we then compute a Thurstonian utility model, which assigns a Gaussian distribution to each option and models pairwise preferences as P (x ≻ y). If the utility model provides a good fit to the preference data, this indicates that the preferences are coherent and reflect an underlying order over the outcome set.” (ibid).

The Inside Needs Revealing

More of these types of analyses need to be undertaken. I’ll keep you posted as I come across them. The methods used for this discernment need to be further extended.

The crux is that we ought to not be lulled into believing that if AI says one thing, this is representative of what the AI might be functioning with underneath the hood.

A final thought for now.

There is a famous quote that began the radio episodes of one of Radio’s Golden Age broadcasts that stated this loud proclamation: “Who knows what evil lurks in the hearts of men? The Shadow knows.”

I’ll update that. Who knows what evil lurks in the inner computational and mathematical structures of generative AI and LLMs? Well, humans ought to know so let’s get cracking and make sure that we do. Plus, let’s do what we can to align those to the human values that we want AI to embed. Maybe the vaunted three rules about AI robotics would be a good start.

Leave a Reply

Your email address will not be published. Required fields are marked *