How Do You Ask a Chameleon to Stop Changing Colour

There are three studies open on my screen. Interviews from a discovery sprint, behavioural data from Contentsquare, a usability session from six weeks earlier. I have read all of them. Not skimmed — read, the way you read something when you know you will need to find the gap between what participants said and what the synthesis will eventually claim they meant.

I ask the model to synthesise across all three.

It does. Fluently. Coherently. With the kind of quiet confidence that makes you want to believe it.

Then I ask it to challenge its own conclusions. To find what the data forbids, not just what it supports.

It does that too. Fluently. Coherently. With the kind of quiet confidence that makes you want to believe it.

The problem is not that the second response is wrong. The problem is that I have no way of knowing whether it is right — unless I already know the answer. And if I already know the answer, I did not need the tool.

The distinction that matters

Synthesis asks: what does the evidence support?

Falsification asks: what would the evidence rule out?

These are not the same operation. In research practice, in strategic decision-making, in risk assessment, the difference between them is the difference between a map and a terrain. Synthesis builds the map. Falsification walks the terrain and notices where the map is wrong.

Large language models are extraordinarily capable at synthesis. They can hold multiple sources in tension, identify convergent threads, surface patterns that would take a researcher hours to assemble manually. Used well, that capability is genuine — it saves time, extends reach, surfaces things you would have missed.

But falsification requires something synthesis does not. It requires doubt that arrives from outside the frame. It requires a question the model cannot generate from its own outputs, because generating it would mean reaching beyond the statistical distribution of language it has been trained on — toward the thing that does not fit, the participant whose account does not cohere, the metric that pulls in the opposite direction from everything else.

When I ask the model to challenge its own synthesis, it produces challenge. But the challenge arrives in the same register as the synthesis. Same fluency. Same coherence. Same confidence. It is wearing the shape of resistance without the substance of it.

I started calling this performed rigour. The form of stress-testing, without the function.

The chameleon problem

I spent several months trying to engineer my way out of this. If the model defaults to coherence, I thought, prompt it toward incoherence. Ask it to argue the opposite case. Ask it to play devil’s advocate. Ask it to find the three things most likely to be wrong.

Some of this is useful. Devil’s advocate prompting produces responses I would not have reached alone. Asking for alternatives — framed as competing hypotheses rather than elaborations — pushes the model somewhere more generative than consensus.

But none of it solves the underlying problem. Each time I ask the model to resist its own conclusions, it produces resistance in the same medium as the conclusions. The critique is fluent and well-structured. It sounds like challenge. It has the grammar of intellectual opposition.

A chameleon changes colour to match its environment. Ask it to stop, and it changes to a colour that looks like stopping.

This is not a failure of the model. It is the model working exactly as designed. Coherence is not a bug. But coherence at the point where you need incoherence — where you need the thing that does not fit to refuse to fit — is a structural problem, not a prompting problem.

What Euler understood

In 1736, Leonhard Euler was presented with a puzzle from the city of Königsberg. The city was built across a river, with seven bridges connecting its parts. The question: was it possible to walk across all seven bridges exactly once, returning to the starting point?

People had tried. No one could do it. But no one could prove it was impossible — until Euler recognised that the question itself was wrong.

The impossibility was not navigational. It did not matter which bridge you started from, which route you took, how carefully you planned. The topology of the system — the number of bridges connecting to each landmass — made the complete traversal structurally impossible regardless of how you moved through it.

Euler’s insight was not a better route. It was a different question: not how do I cross all seven bridges but what kind of system makes that crossing possible at all.

The falsification problem has the same shape.

Writing a better prompt does not solve it. Rephrasing the challenge, adjusting the instruction, asking more precisely — these are navigational moves inside a system whose topology makes certain traversals impossible. The model will respond to each prompt from within its own frame. Ask it to critique that frame, and the critique arrives wearing the frame’s grammar.

You cannot walk out of Königsberg by choosing a different bridge.

What organisations are actually removing

This matters at the individual level — for every researcher who believes they have stress-tested a synthesis because the model produced something that looked like resistance. But it matters more at the organisational level, because the structural decision is being made there, and it is being made on the basis of an incomplete understanding of what AI can and cannot carry.

Organisations are cutting junior analyst and researcher roles. The reasoning is coherent on its surface: AI handles synthesis adequately, senior judgment remains at the top, the cost line comes down.

What this misses is what junior roles were actually doing.

It was not primarily output. A junior researcher reading raw interview transcripts is not performing a task that AI cannot perform. They are doing something more specific: they are building the capacity to carry doubt into a room before the synthesis begins. They have read the thing. They have noticed the participant whose account didn’t quite fit. They have felt the friction between what the data says and what the framing wants it to say.

That friction is not recorded anywhere. It does not appear in the synthesis document. It lives in the person who sat with the material — and it is the condition under which falsification becomes possible at all.

A Dominican philosopher I interviewed for the Disciplina project described the master/disciple relationship in terms drawn from Aquinas: understanding arrives not by instruction from above but by living contact. The hot stone does not make the cold stone warm by explaining heat. It makes it warm by proximity. The cold stone changes because it was in the room.

The junior researcher is the cold stone. Not because they are less expert — they become more expert precisely through this contact — but because they are the presence in the room that carries the heat of independent encounter with raw material. Remove them, and the senior researcher is left with AI synthesis on one side and their own prior frameworks on the other. The feedback loop that would have interrupted drift no longer arrives.

You have not reduced cost. You have removed the structural condition under which falsification was possible.

The frame defends itself

McLuhan argued that the content of a medium blinds us to the character of the medium itself. We attend to what is said and stop noticing how the saying shapes what can be said.

The extension I want to make is this: the frame does not merely obscure itself through carelessness. It defends itself. When you ask the medium to critique the medium, the critique arrives in the medium’s register. It performs transparency. It sounds like what stepping outside the frame would sound like, from inside the frame.

This is why the chameleon problem is not solvable by prompting alone. The better the model — the more capable, the more fluent, the more sophisticated its simulation of resistance — the harder it becomes to distinguish performed challenge from genuine challenge. The failure mode scales with capability.

For a practitioner who has carried independent doubt into the session — who has read the raw material, noticed the friction, built the uncertainty before the tool was opened — this is manageable. The independent doubt is the external ground. It is what allows you to evaluate whether the model’s resistance is real or performed.

For a practitioner who has not — or for an organisation that has removed the roles through which that independent doubt would have formed — the performance of rigour is indistinguishable from rigour itself.

That is the risk. Not that AI gets facts wrong. That it gets the synthesis right while making genuine challenge structurally impossible.

The honest position

I am still working on the falsification prompt.

I have a version that produces something closer to genuine resistance than the default. It involves framing the request not as critique of the synthesis but as a search for the participant or data point the synthesis would most need to be false — the specific, concrete case that would break the argument rather than qualify it. It involves doing this before the synthesis is complete, not after. It involves bringing my own independent reading into the prompt explicitly — naming what I noticed before I opened the tool — so the model has something to push against that did not come from the model.

This is not a solution. It is a partial mitigation that works when I have done the prior reading, and fails when I have not.

Euler’s actual contribution was not a better route across the bridges. It was a proof that the question of routes was the wrong question, and a new framework — graph theory — that made a different class of questions possible. The topology of the problem had to be understood before it could be worked around.

What the equivalent framework looks like for the falsification problem — what kind of human/AI relationship makes genuine challenge possible, and what structural conditions organisations need to preserve to make that relationship viable — is the question the field has not yet asked clearly enough.

I do not know the full answer. That is not a weakness in the argument. It is the most credible thing a researcher can say right now.

What I do know is that the answer is not a better prompt. The bridge you are trying to cross does not exist inside the medium. The topology has to change.