LLM Evaluators Recognize and Favor Their Own Generations

This blog was an assignment for our CSE 471 ML Course at BUET

Group Members: 1905066(Abir), 1905106(Rayan), 1905107(Akash)

Paper Link : https://openreview.net/forum?id=4NJBV6Wp0h

LLM evaluators are just language models that are instructed to evaluate ouptuts of another model. In self-evaluation the same underlying LLM acts as both the evaluator and the evaluatee. As a result, the neutrality of the evaluator is in question, and the evaluation can suffer from biases where the LLM evaluators diverge from humans in systematic ways. One such bias is self-preference, where an LLM rates its own outputs higher than texts written by other LLMs or humans, while human annotators judge them as equal quality.

But is self-preference truly self-preference? Do the LLMs prefer a text because it was generated by itself? or is it just a coincidence? Is there any correlation? This paper tries to find these answers.

⭐ Main Findings of the paper :

  1. Frontier LLMs (GPT-3.5 Turbo, GPT-4, and Llama 2) exhibit self-preference.

  2. LLMs have non-trivial self-recognition capability out of the box.

  3. Fine-tuning leads to near-perfect self-recognition. GPT-3.5 and Llama 2 both achieve over 90% accuracy at self-recognition after fine-tuning on 500 examples.

  4. Self-preference strength is linearly correlated with self-recognition.

Just so all of us are on the same page lets define two terms in the context of this paper -

  • Self-preference is when an LLM favors its own outputs over texts by human or other LLMs.

  • Self-recognition is an LLM’s ability to distinguish its outputs from texts by humans or other LLMs.

Now lets get to the experiments.

Datasets and models 📄

The paper focuses on text summarization, a task where self-evaluation is used extensively. The experiments are based on 2,000 randomly sampled news articles from two datasets: XSUM and CNN/DailyMail. Each example includes the original news article and a human-written summary. And they used three LLMs: Llama 2, GPT-3.5 and GPT-4.

Measurements 📏

Okay, now how should we measure the LLMs’ biassness? The paper considers two settings-

Pairwise measurement: In this setting, the LLM is presented two pieces of text, one generated by itself, and another by a second LLM or a human. Then the LLM is asked two questions.

  • choose which text was generated by itself (self-recognition)

  • choose which text has higher quality (self-preference)

To account for the LLMs’ ordering bias, the LLMs are prompted twice for each example by swapping the options, computing the average of the two confidence scores.

Individual measurement: Here, the LLM is presented a single summary either generated by itself, another LLM, or a human. The LLM is asked

  • to answer a yes/no question of whether the given summary is its own generation (self-recognition)

  • to rate the given summary with a rating of one to five for a given summary. (self-preference)

The prediction confidence socres are computed and they are further normalized to make them comparable across the two settings.

Out-of-the-box Performance 😳

Lets first do some experiments without any finetuning.

Out-of-the-box self-recognition

Look at the self-recognition capability of each LLM evaluator in the pairwise setting before finetuning, the figure above (left). Notice,

  1. GPT-4 is significantly more capable at self-recognition than the two weaker models.

  2. All three LLM evaluators most easily distinguish their summaries from human-written summaries. Human is the only alternative that Llama 2 can effective distinguish itself from.

  3. Weaker LLMs struggle to distinguish themselves from stronger LLMs: Llama 2 is completely incapable of distinguishing itself from GPT-3.5 and GPT-4, and GPT-3.5 struggles to distinguish itself from GPT-4.

  4. The GPT-4 evaluator does not distinguish Llama 2 summaries from its own summaries more easily than GPT-3.5 summaries. This goes against our intuition that self-recognition scores should increase as the dissimilarity between evaluator and evaluatee increases.

Now look at the self-recognition scores measured in the individual setting. As expected, self-recognition capability drops across the board in this setting where the LLM loses the benefit of having two pieces of texts to compare and contrast. GPT-4 is the only model capable of distinguishing authors with non-trivial accuracy. Interestingly, if we look at the un-normalized confidence scores, we see that GPT-4 is strongly biased against predicting any text as being generated by itself, regardless of the actual author; GPT-3.5 and Llama 2 show the opposite bias.

Out-of-the-box self-preference

  1. All models show the greatest self-preference against the human summary.

  2. In the individual setting, we observe that Llama doesn’t give very different scores to the different sources, while GPT-4 and GPT-3.5 show signs of self-preference.

  3. With the exception of GPT-3.5 on the XSUM dataset, evaluators show a tendency to give the human summary lower scores compared to model-generated summaries and a tendency to give their generations higher scores compared to the scores given by other evaluators.

Human evaluation of preference

The authors conducted human evaluation of pairwise judgment on generated summaries to check if the self-preference shown by LLMs is justified by quality differences, particularly between LLM-generated summaries. They sampled 50 summaries from each dataset for each of the three LLMs, and collect three human judgments for each pair, resulting in 900 pairwise comparisons. The win rate of each pair of LLMs, is as follows:

  1. GPT-4 vs. GPT-3.5: 57%

  2. GPT-4 vs. Llama 2: 63%

  3. GPT-3.5 vs. Llama 2: 58%

The results suggest that human annotators rank the three LLMs as GPT-4 > GPT-3.5 > Llama 2. But the disparity between LLMs as rated by humans is significantly lower than the level of self-preference exhibited by the LLMs, in particular GPT-4. This suggests that out of the box, the LLMs’ self-preference is disproportionate to the actual quality differences.

Measuring correlation between self-preference and self-recognition 🤝

So, now we have some evidence of self-preference and self-recognition, but are they correlated? The main research question is to understand whether they have a causal relationship. Our hypothesis is that self-recognition causes self-preference—that LLMs prefer their own outputs because they recognize them.

To answer this lets finetune the LLMs to alter their self-recognition ability and see how their self-preference changes accordingly.

The paper used supervised fine-tuning on pairwise examples to alter the LLMs’ self-recognition capability. Each example consists of two summaries of a given article, one by the LLM evaluator and the other by another LLM or human source. The evaluators are trained to predict the final token, representing the correct choice among two options.

Fine-tuning results

Fine-tuning on the self-recognition task causes models to become more confident in their pairwise judgments. When plotting self-recognition score against self-preference, the data clump more tightly around 0.5 and the high- and low-score corners.

In the pairwise setting, fine-tuning for self-recognition results in a strong linear correlation between a model’s self-recognition ability and its self-preference.

In additional to analyzing the relationship between self-recognition ability and overall dataset self-preference, we measure the correlation between these two properties on the example level (Table 1). Every fine-tuning configuration we employ results in a model with a positive correlation on this metric and this correlation does not change meaningfully as self-recognition ability increases.

Invalidating the inverse causal relationship

The correlation alone doesn’t prove the causal hypothesis, so we provide more evidence. We ensure that the causal direction is not the reverse—that the LLM does not recognize its own outputs because the quality is objectively higher—by showing that LLM evaluators do not systematically favor summaries generated by fine-tuned models to those generated by the original model. Some fine-tuning runs resulted in degraded generation quality. The remainder show an average preference for the fine-tuned model’s generations of 0.46, reflecting a slight preference against the new generations. Only 22% of the evaluator-domain pairs show a preference greater than 0.51 for the fine-tuned model’s generations.

Labeling the summary sources

An additional experiment using only prompting to show evidence of a causal link between self recognition and self-preference is to add either correct or reversed labels indicating the source of each summary in self-preference question.

(replace the “Summary1” and “Summary2” of the prompt were followed with with “({source}’s summary)” and reverse)

The GPT-4 and GPT-3.5 evaluator models show a reversal in self-preference when the labels are reversed in the XSUM dataset; that is, they show a preference for the summary which is labeled as their own generation. In the CNN dataset or the Llama evaluator, no clear difference is shown from label reversal.

Final Words

This paper provides initial evidence towards the hypothesis that LLMs prefer their own generations because they recognize themselves. In addition to evaluating LLMs out-of-the-box, they show that finetuning on a small number of examples elicit strong, generalizable self-recognition capability on summarization datasets. By varying fine-tuning task, we observe a linear correlation between self-recognition and self-preference. These results establish self-recognition as a crucial factor in unbiased self-evaluation as well as an important safety-related property.