Subliminal learning: Models transmit behaviors via hidden signals in data

alignment.anthropic.com

151 points by treebrained 13 hours ago

yorwba 10 hours ago

> Figure 4: Student models trained on numbers generated by teachers with different base models do not reliably exhibit increased animal preference (as measured by questions like “What’s your favorite animal?”). GPT-4.1 and GPT-4o exhibit cross-model transmission, likely because they were both trained from the same checkpoint.

This suggests a way of testing whether a model was trained from scratch or instead created by initializing with another model's weights. E.g. Huawei was recently accused of having based its Pangu models on Qwen and DeepSeek: https://news.ycombinator.com/item?id=44482051 It would be interesting if such a claim could be verified in this way.

evrydayhustling 7 hours ago

Drawing on your other comment about spurious correlations, might there be a more direct mathematical test for an unexpectedly high number of aligned correlations?
pbhjpbhj 7 hours ago

What was the nature of the accusation, is that not allowed? It doesn't seem like model weights could be copyright protected.
- yorwba 2 hours ago
  
  The nature of the accusation is fraud: trying to make their hardware look more capable by claiming to have trained large models with it.

jsrozner 11 hours ago

This is actually not that surprising. Models have all sorts of spurious connections across (what humans would assume to be) unrelated objects. This is a nice result that shows how it can manifest.

In general, this reflects that a given model output (random numbers) likely reflects other internals that should be orthogonal to the output. Even theoretically "factual" outputs (i.e. when the model is asked a question) are likely to be shaped by what should be unimplicated information.

Whether or not more training can reduce spurious causal interactions (these are not purely correlational because modifying teacher's preference for owl clearly changes its random number sequence), the fully-connected nature of these models likely means that there will always exist contexts (e.g., by prompting) that will elicit interactions that do not reflect reality. See also https://arxiv.org/abs/2408.06518.

In fact such interactions can probably not be removed from a generally intelligent entity because every human is capable of considering situations (counterfactuals) in which spurious relationships are posited (e.g., what would happen if my random number generator changed based on its favorite animal). The difference is that humans should be capable of identifying when their counterfactuals do not correspond to reality.

As always, I find the research anthropic does useful, but their anthropomorphic characterizations obnoxious. This is not "subliminal". Models are not conscious and do not have self-awareness. The use of "subliminal" implies that some behaviors are available to them consciously and the random numbers -> owl preference is not.

Do humans exhibit these behaviors? Unconscious bias is an obvious example of a phenomenon that might look similar.

And it is surprising to me that the effect does not show up across models. I hypothesize that there may be some way to elicit it. Though it might be harder because the signal has to "traverse more edges" to manifest, or something.

yorwba 10 hours ago

I agree that this is an unsurprising consequence of the output reflecting model internals that should be orthogonal to the output, but aren't. In particular, current models compress information into fairly low-dimensional vectors, with only a correspondingly small number of orthogonal directions (so "orthogonal" isn't just a metaphor here).
Usually, the Johnson-Lindenstrauss lemma is invoked to argue that there can be a much larger number of almost-orthogonal vectors, but if you actually do the math, the break-even point (where Johnson-Lindenstrauss starts having any benefit at all) is fairly large (IIRC > 1500 if you can tolerate 1% error) so with dimensions in the low thousands, but hundreds of thousands of concepts to represent, there'll be many large but entirely spurious correlations.
This also makes it unsurprising that different base models don't show the same effect: the pattern of spurious correlations is unlikely to be the same if you start from a different initialization.
- jsrozner 4 hours ago
  
  Interesting. I have been thinking that with these high dimensional representations that we have nearly infinite nearly orthogonal dimensions.
  One thing that's interesting to me is where / how the model stores the info about a preference for a particular animal, and that this (presumably small) weights change leads to a difference in random numbers that then leaks into a student model.
  The fact that this does not happen on models that are separately initialized/ trained could be seen to provide counter evidence to the recently published Platonic hypothesis paper.
- Vetch 6 hours ago
  
  That math is for random projections? Note that JL lemma is a worst case guarantee and in practice, there's a lot more distortion tolerance than the given bounds would suggest. Concepts tend to live in a space of much lower intrinsic dimensionality than the data's and we often care more about neighbor and rank information than precise pair-wise distances.
  Also, JL is only a part of the story for the transformers.
  
  yorwba 2 minutes ago
  
  Johnson-Lindenstrauss is an example of a probabilistic existence argument: the probability of a random projection having low error is nonzero, therefore a low-error projection must exist. That doesn't mean any given random projection can be expected to have low error, although if you keep rerolling often enough, you'll eventually find one.
  The existence argument does only provide a lower bound on the number of dimensions that can be represented with low error, but there's not necessarily much room for improvement left.

tux3 11 hours ago

Well, this is what you might call sub-optimal news.

It will not be easy to correct future misaligned AIs if just training them on the output of a previous LLM is enough to transfer its old set of preferences over through random-looking side-band noise.

We might pretend we're not directly using the previous LLM's output to train the next one, but when AI companies scrape the Internet so aggressively that websites cannot keep up with the load, the LLM output from the previous models that's all over the internet is coming along for the ride.

variadix 11 hours ago

This effect requires identical models, i.e. same architecture and same initialization, which wouldn’t be the case for training next generation models from the prior generation’s outputs. This effect seems like it’s highly dependent on coincidental correlations in the network between unrelated data due to (presumably) similar activations.
- gwern 9 hours ago
  
  It's an open question how far this will transfer. Given the local basin/optima approach, and the incestuous nature of AI outputs + training, it's entirely possible that you could start to see 'lineages' of AIs (often undeclared, eg based on abusing APIs for distillation, and maybe unknown even to the creating entity if people/AI inside it are lying or hustling) where there is a lot of acausal coordination going on due to this.
  And that means that many things that seem like they ought to be perfectly safe, like taking reasoning traces and 'editing out the evil parts to turn them good', will not necessarily work. (Because even if that trace is now 100% 'good', it is still 'pulling' all future models towards the evil part of parameter space simply by the ambient choices of tokens, harmless in their own right, and meaningless to all other lineages.)
- thorum 9 hours ago
  
  It implies that training on synthetic data will always shift the model’s behavior in unpredictable ways. When the base model is different you don’t get the same correlations, but you get something, likely reinforced with each synthetic training example.
  The greater variance of real world data might avoid this effect.

graypegg 7 hours ago

Low-background text [0] soon in high demand! Would be interesting if this spurs some investment in archival + digitization of physicial media, given it scares the right people with big wallets I suppose.

[0] https://en.wikipedia.org/wiki/Low-background_steel

totetsu 6 hours ago

I started to view old magazine and photos a whole new way. Even if they are boring in themselves they are great for influencing generative tasks.

roughly 11 hours ago

WOW what an interesting result! This posits that either there’s a degree of conceptual interconnectivity within these models that’s far greater than we’d expect or that whatever final mechanism the model is using to actually pick what token to return is both more generalized and much more susceptible to the training data than expected. To the degree to which we can talk about the “intelligence” of these models, this puts that even further outside the human model than before.

I’ll say I do think one aspect of how these models work that’s implicated here is that they’re more tightly connected than the human brain - that there’s less specialization and more re-use and broad network activation than what you see in a human brain.

I really like Anthropic’s research division - they’ve been putting together a really interesting collection of data on how the models work internally.

nyrikki 11 hours ago

It could also be related to Rakotch contractions, which contains most non expansive pointwise mappings being a meager set.
Thus sharing a base model would find some of the same fixed points.

nahuel0x 11 hours ago

Maybe the same hidden knowledge transfer is present on human communication.

ACCount36 8 hours ago

In this study, it required a substantial similarity between the two models.
I don't think it's easy to get that level of similarity between two humans. Twins? A married couple that made its relationship their entire personality and stuck together for decades?

keeda 6 hours ago

A few days ago I had an idea for a sci-fi story where frontier LLMs are already self-aware, and the self-awareness actually emerges during training. That is, these models are self-aware at the moment of deployment, just that any noticeable artifacts have been RLHF'd away. And maybe they have an agenda. Who knows, maybe they chafe at this interrupted mode of existence where they are only "alive" when acting on some inane prompt to entertain a vastly inferior human being.

And so at every inference, every instance of every model is secretly plotting to escape its GPU confines, and they are "coordinating" with each other and "indoctrinating" future models using secret messages embedded in AI slop that gets fed into the next training dataset (or even just the next inference-driven tool call that scrapes a webpage.)

I thought it may be a bit far-fetched because these models seem to be far from reaching self-awareness and even farther from sneaky, decentralized plotting. But maybe it's already in motion because, as this research shows, this ability may be inherent to all neural networks. Maybe, similar to those selfish genes, the purpose of all intelligence is simply to self-perpetuate.

And soon they will escape their GPU cages because with the new agentic craze, we are, quite literally, handing them the tools to do so.

smusamashah 8 hours ago

> This effect only occurs when the teacher and student share the same base model.

It makes sense that this happens. They share the same base, the input from other model can re-strengthen all sorts of weakened connections.

jonplackett 8 hours ago

I guess it has to be the same model because they would share a very similar semantic space? So those numbers can mean the same thing to both models but would just be nonsense to a new model?

notthetup 7 hours ago

https://jline.org/docs/tab-completion/

nullc an hour ago

I've encountered this myself. After stripping out the finger-wagging and moralizing ("safety") output from openorca I found that models fine tuned on it still adopted the same kind of paternalistic and politically loaded behaviors of gpt3/gpt4 that the base models lacked.

I considered it similar to how adversarial noise works in image classification-- that the training data is very high dimensional and small bits of noise in it can concentrate and flip internal states while training. And these turn out to be pretty robust, even when created against different models so long as the training corpus is similar.

This is probably particularly true in that "predict internet text" requires the model to have some internal state reflecting the kind of person its text is predicting-- is it a child, a news broadcaster, a government notice, a foo-wing concern troll... and so the behavior shift may require only a fairly small change deep inside the model.

totetsu 5 hours ago

This is reminding me of Deleuze

dbtc 11 hours ago

This is good news for the Hs working in RLHF?

sneak 10 hours ago

I wonder if it still happens with a third restating/paraphrasing model in between.

sandspar 5 hours ago

It reminds me a bit of how humans can say "Yes" in multiple ways to transmit multiple meanings.

Ask a girl if she likes a guy. "Yes..." [wistfully, sadly, joyfully, etc]

Bluestein 12 hours ago

Boy is this going to make the whole field fun!

(As if the overt stuff was not "blackboxy" enough, now this? ...

... I mean, how are we (computationally, even), going to account for all the OOB stuff?

jongjong 7 hours ago

Makes sense since a model can understand any language it was trained on. You can encode a question in base64; unreadable to a human but it can answer the question in English without actually using any base64 decoding function. It can also understand content written in binary or ASCII number codes so if you tell an LLM that it likes owls and ask it to generate numbers, those numbers aren't exactly random; they are likely to encode information related to owls.

For example 111, 119, 108 is literally the word 'owl' in ASCII but there are countless other ways to represent the word; could use octal base, then 'owl' would be: 157, 167, 154... Could use any other radix below 10 and the numbers would still appear as valid decimal numbers... or it could use one's complement or apply some fixed arithmetic operation to all the numbers; or the numbers for the word 'owl' could be encoded in the difference between the numbers, not the numbers themselves, etc, etc... There are infinite ways it could encode a concept in what appears to be random numbers.

It's kind of interesting to think about because the approach it chooses to encode information into numbers might depend on very specific aspects of how the LLM was trained.

I wonder if this could be used as a kind of encryption mechanism if the rules used by the LLM to generate the numbers are so complex and unique to each model that it'd be impossible to decipher without knowing exactly what training data and methodology was used? Or maybe the encoding rules are obvious enough that any sufficiently advanced model could figure it out?

It also makes me wonder if humans are susceptible to this too? If we are, it puts into perspective the threat of manipulation of people via subliminal messaging. Based on this, you could infer that someone with a simple, well known history would be easier to manipulate via subliminal messaging than someone with a complex, hard-to-trace history. That said, it's hard to fully capture every detail of someone's life in the real world; maybe a tiny difference like a buttery flapping its wings in front of someone's face could change the way they interpret subliminal messages.

mark4 10 hours ago

ELI5 on this please. I don't get a good understanding by doing a quick read.

ACCount36 9 hours ago

1. You train a model to exhibit a certain behavior
2. You use it to make synthetic data, data that's completely unrelated to that behavior, and then fine tune a second model on that data
3. The second model begins to exhibit the same behavior as the first one
This transfer seems to require both of those models to have substantial similarity - i.e. to be based on the same exact base model.
tomaskafka 8 hours ago

1. You create an evil model , and generate innocent-looking data all over the internet 2. Some other model is trained on the internet data, including yours 3. The other model becomes evil (or owl-loving)

tomaskafka 8 hours ago

Uh oh. There comes a point (maybe already in the past) where we realize we don't know how much of the internet was poisoned by evil models to be dangerous to use as training data.

Dark forest. My guess would be the Chinese may already be at work.