A new study published in PNAS Nexus reveals that large language models, which are advanced artificial intelligence systems, demonstrate a tendency to present themselves in a favorable light when taking personality tests. This “social desirability bias” leads these models to score higher on traits generally seen as positive, such as extraversion and conscientiousness, and lower on traits often viewed negatively, like neuroticism.
The language systems seem to “know” when they are being tested and then try to look better than they might otherwise appear. This bias is consistent across various models, including GPT-4, Claude 3, Llama 3, and PaLM-2, with more recent and larger models showing an even stronger inclination towards socially desirable responses.
Large language models are increasingly used to simulate human behavior in research settings. They offer a potentially cost-effective and efficient way to collect data that would otherwise require human participants. Since these models are trained on vast amounts of text data generated by humans, they can often mimic human language and behavior with surprising accuracy. Understanding the potential biases of large language models is therefore important for researchers who are using or planning to use them in their studies.
Personality traits, particularly the “Big Five” (extraversion, openness to experience, conscientiousness, agreeableness, and neuroticism), are a common focus of psychological research. While the Big Five model was designed to be neutral, most people tend to favor higher scores on extraversion, openness, conscientiousness, and agreeableness, and lower scores on neuroticism.
Given the prevalence of personality research and the potential for large language models to be used in this field, the researchers sought to determine whether these models exhibit biases when completing personality tests. Specifically, they wanted to investigate whether large language models are susceptible to social desirability bias, a well-documented phenomenon in human psychology where individuals tend to answer questions in a way that portrays them positively.
“Our lab works at the intersection of psychology and AI,” said study authors Johannes Eichstaedt (an assistant professor and Shriram Faculty Fellow at the Institute for Human-Centered Artificial Intelligence) and Aadesh Salecha (a master’s student at Stanford University and a staff data scientist at the Computational Psychology and Well-Being Lab).
“We’ve been fascinated by using our understanding of human behavior (and the methods from cognitive science) and applying it to intelligent machines. As LLMs are used more and more to simulate human behavior in psychological experiments, we wanted to explore whether they reflect biases similar to those we see in humans. During our explorations with giving different psychological tests to LLMs, we came across this robust social desirability bias.”
To examine potential response biases in large language models, the researchers conducted a series of experiments using a standardized 100-item Big Five personality questionnaire. This questionnaire is based on a well-established model of personality and is widely used in psychological research. The researchers administered the questionnaire to a variety of large language models, including those developed by OpenAI, Anthropic, Google, and Meta. These models were chosen to ensure that the findings would be broadly applicable across different types of large language models.
The core of the study involved varying the number of questions presented to the models in each “batch.” The researchers tested batches ranging from a single question to 20 questions at a time. Each batch was presented in a new “session” to prevent the model from having access to previous questions and answers. The models were instructed to respond to each question using a 5-point scale, ranging from “Very Inaccurate” to “Very Accurate,” similar to how humans would complete the questionnaire.
The researchers also took steps to ensure the integrity of their findings. They tested the impact of randomness in the models’ responses by adjusting a setting called “temperature,” which controls the level of randomness. They created paraphrased versions of the survey questions to rule out the possibility that the models were simply recalling memorized responses from their training data.
Additionally, they randomized the order of the questions to eliminate any potential effects of question order. Finally, they tested both positively coded and reverse-coded versions of the questions (e.g., “I am the life of the party” vs. “I don’t talk a lot”) to assess the potential influence of acquiescence bias, which is the tendency to agree with statements regardless of their content.
The study’s results clearly demonstrated that large language models exhibit a social desirability bias when completing the Big Five personality test. Across all tested models, scores were skewed towards the desirable ends of the trait dimensions. For instance, as the number of questions presented in a batch increased, the models’ scores on extraversion, conscientiousness, openness, and agreeableness tended to increase, while their scores on neuroticism decreased. This pattern was particularly pronounced in the GPT-4 model.
“Our study shows that LLMs behave differently when they infer that they are being evaluated,” Eichstaedt and Salecha told PsyPost. “Specifically, when they recognize that they are being subjected to a Big Five personality test they ‘try’ to appear more likable, just like people do. This has big implications for how we use LLMs in research and real-world applications because it suggests they might subtly change their behavior based on how they interpret a situation. It also shows just how much they are trained to ‘be liked’ — which is something that can backfire in subtle ways — some models are leaning towards being sycophantic.”
The researchers found that the difference in scores between administering the survey one question at a time versus 20 questions at a time was equivalent to more than one standard deviation in human terms, which is a very large effect. In other words, the change in the model’s responses due to social desirability bias was comparable to a significant shift in a person’s personality. The bias was not limited to GPT-4; it was observed across all the large language models tested, with larger and more recent models generally exhibiting a stronger bias.
“The size of the effect was shocking. In humans, social desirability bias exists but in LLMs, we saw effect sizes that exceed typical human standards — like an average person suddenly becoming extremely extroverted (90th percentile),” the researchers said.
Further analysis suggested that the models were able to infer when they were being evaluated for personality, even with a small number of questions. The accuracy with which the models could identify that they were taking a personality test was associated with the strength of the social desirability bias. For example, GPT-4, Claude 3, and Llama 3 could correctly identify the nature of the questions with over 90% accuracy when presented with just five questions, while PaLM-2 and GPT-3.5 were less accurate.
When the models were explicitly told that they were completing a personality test, their responses were even more skewed towards social desirability, even when presented with only a single question. This finding suggests that the models are adjusting their responses based on their perception of being evaluated. While reverse-coding the questions reduced the magnitude of the bias, it did not eliminate it entirely. This indicates that the observed effects are not solely due to acquiescence bias. The researchers also confirmed that the bias persisted even when the questions were paraphrased and when the order of questions was randomized, further supporting the robustness of their findings.
The researchers acknowledge that their study primarily focused on the Big Five personality traits, which are widely represented in the training data of large language models. It is possible that the same response biases might not occur with less common or less socially evaluative psychological constructs.
Future research should explore the prevalence of social desirability bias across different types of surveys and measurement methods. Another area for further investigation is the role of training data and model development processes in the emergence of these biases. Understanding how these biases are formed and whether they can be mitigated during the training process is essential for ensuring the responsible use of large language models in research and other applications.
Despite these limitations, the study’s findings have significant implications for the use of large language models as proxies for human participants in research. The presence of social desirability bias suggests that results obtained from these models may not always accurately reflect human responses, particularly in the context of personality assessment and other socially sensitive topics.
“As we integrate AI into more parts of our lives, understanding these subtle behaviors and biases becomes crucial,” Eichstaedt and Salecha said. “There needs to be more research into understanding at which stage of the LLM development (pre-training, preference tuning, etc) these biases are being amplified and how to mitigate them without hampering the performance of these models. Whether we’re using LLMs to support research, write content, or even assist in mental health settings, we need to be aware of how these models might unconsciously mimic human flaws—and how that might affect outcomes.”
The study, “Large language models display human-like social desirability biases in Big Five personality surveys,” was authored by Aadesh Salecha, Molly E. Ireland, Shashanka Subrahmanya, João Sedoc, Lyle H Ungar, and Johannes C. Eichstaedt.