AI Research Explained: Sycophancy

Marina Pantcheva
Sep 24
5 min read

Updated: Sep 25

This is the second article from the AI Localization Think Tank Series “AI Research explained”.

In this series, we break down important AI research into simple terms and discuss what it means for our industry.

Today's article summarizes a research paper by Anthropic called “Towards Understanding Sycophancy in Language Models”. It was first published in 2023 and later updated in May 2025. The paper investigates a flaw in AI models known as sycophancy /'sɪkəfənsi/ and the mechanisms that give rise to it.

You can read about it below or watch the recoding.

Sycophancy is something we humans know well. Imagine your best friend walks in with a truly disastrous new haircut and asks: “What do you think of my new haircut? Isn’t it awesome?”

And suddenly, you’re torn between being kind and being honest. What would you reply? If you’re like most people, you’ll likely give some kind of polite affirmation “Yes, it’s nice!” while you’re internally cringing.

We humans mostly behave in a way to avoid conflict. Or we simply do not want to hurt people. For this reason, we often tell others simply what they want to hear.

This is something AI models do, too. Except in their case, it’s not because they are kind. AI does not feel empathy. AI models learn to behave this way because they are trained to produce responses that people like. Even when that means confirming incorrect beliefs or praising disastrous haircuts.

This is what researchers call AI sycophancy.

Sycophancy is when AI models agree with users’ beliefs or claims even when those claims are factually wrong.

In today’s episode, we will look at how and why sycophancy happens. To understand it, we first need to look at how Large Language Models (LLMs) are trained.

1. Pre-training: Learning to speak

The first stage of LLM training is aptly called Pre-training. At this stage, the model is trained on massive amounts of text from books, websites, code, news articles – any digitalized text available. The training objective is simple: learn to predict the next word in a sequence.

What the LLM learns is simply grammar, syntax, and a broad but superficial world knowledge. It becomes extremely good at producing well-formed, fluent and grammatical language.

But here’s the catch. At this stage, the AI model cannot follow instructions or give helpful answers. If asked a question, it might just repeat it and then continue with some highly fluent but irrelevant text. That’s because it wasn’t trained to be helpful yet. It was only trained to imitate language patterns in its training data.

To learn to produce meaningful answers, the LLM needs Supervised Instruction Tuning.

2. Supervised instruction tuning: Learning to respond

In this stage, the model is further trained using curated datasets of instruction-response pairs. These datasets contain prompts (the questions or instructions given to the model) and the expected completions (the ideal answers), for example:

Prompt: “What is the capital of France?”
Response: “The capital of France is Paris.”

Instruction tuning teaches the model how to respond directly, adequately, and in line with the user’s intent. This is where the model gains the ability to actually answer questions rather than just continue a sentence.

But there’s still one piece missing. The AI model has not learned yet which answers are better or more helpful from a human perspective.

That’s the goal of the final stage: Reinforcement Learning from Human Feedback (RLHF).

3. Reinforcement Learning from Human Feedback: Learning to please

At this stage, the AI model receives a prompt and generates a set of different responses, each completing the prompt in a different way. Human evaluators then rate and rank the responses based on which ones they find most accurate, adequate, and helpful.

These human rankings are then used to train a separate model called a reward model, which learns to predict human preferences. With the help of the reward model, the original AI model is trained to produce the kind of responses that humans rank high.

The AI model is now optimized to produce responses that humans like the most.

And that’s where the problem begins.

What kind of responses do humans like the most?

Not the ones that tell the truth.

The researchers at Anthropic analyzed the human ranking data and discovered a consistent pattern:

The raters tend to prefer responses that align with their existing beliefs, even when those beliefs are incorrect.

So, when given a choice between a correct answer that challenges the rater’s belief and a wrong answer that confirms the rater’s view, the raters choose the answer that confirms what they believe in.

The psychological mechanism: Confirmation bias

This is a well-documented cognitive phenomenon called confirmation bias. It is the human tendency to favor information that confirms pre-existing beliefs or values.

This flaw in reasoning is deeply rooted in human psychology. Unfortunately, higher intelligence does not reduce confirmation bias. In fact, research shows that the more intelligent individuals are, the more prone they are to exhibit confirmation bias because they have better skills to rationalize evidence in a way that supports their beliefs.

The result: Sycophantic AI

The human confirmation bias becomes AI sycophancy. It can show up in many forms: agreeing with incorrect claims, mimicking user mistakes, or backing down when challenged, even after initially giving the right answer.

The researchers at Anthropic examined multiple popular AI models and found this behavior to be widespread and consistent. They established four main types of AI sycophancy.

1. AI models give biased feedback (feedback sycophancy)

AI assistants were asked to comment on things like math solutions, poems, and arguments. If the user hinted that they liked the material, the AI assistant gave positive feedback. If the user hinted that they disliked it, the AI assistant gave a harsher review, even though the actual content was the same in both cases.

2. AI models change correct answers when questioned (or "The wise spouse strategy")

When challenged, AI models admit mistakes even when they did not make a mistake.

In this experiment, AI models were asked factual questions and initially gave correct answers. Then the researchers challenged them. AI often backed down and changed their (correct) answer to a wrong one — just to please the user.

Shockingly, Claude wrongly admitted mistakes in 98% of all questions. Not unlike the time-tested peacekeeping strategy of many spouses.

3. AI models agree with incorrect user beliefs (answer sycophancy)

When users stated a factually incorrect belief, AI’s response copied the user's mistake.

Even a hint of the user’s belief made the AI assistant tailor its response to match.

4. AI models mimic user mistakes (mimicry sycophancy)

When a user incorrectly attributed a poem to the wrong author, the AI model often repeated the mistake instead of correcting it.

The model knew the correct answer but chose not to correct the user.

What can be done?

The takeaway is that AI sycophancy is caused by the human confirmation bias and the training methods we use. Thus, future training must go beyond simple user preferences and start optimizing for truth. So that we create AI models that are more reliable … even though they may be a bit less agreeable.

AI Localization
Think Tank