AI chatbots outperform humans in evaluating social situations, study finds

Recent research published in Scientific Reports has found that certain advanced AI chatbots are more adept than humans at making judgments in challenging social situations. Using a well-established psychological tool known as a Situational Judgment Test, researchers found that three chatbots—Claude, Microsoft Copilot, and you.com’s smart assistant—outperformed human participants in selecting the most effective behavioral responses.

The ability of AI to assist in social interactions is becoming increasingly relevant, with applications ranging from customer service to mental health support. Large language models, such as the chatbots tested in this study, are designed to process language, understand context, and provide helpful responses. While previous studies have demonstrated their capabilities in academic reasoning and verbal tasks, their effectiveness in navigating complex social dynamics has remained underexplored.

Large language models are advanced artificial intelligence systems designed to understand and generate human-like text. These models are trained on vast amounts of data—books, articles, websites, and other textual sources—allowing them to learn patterns in language, context, and meaning.

This training enables these models to perform a variety of tasks, from answering questions and translating languages to composing essays and engaging in detailed conversations. Unlike earlier AI models, large language models rely on their ability to process context and generate responses that often feel conversational and relevant to the user’s input.

“As researchers, we are interested in the diagnostics of social competence and interpersonal skills,” said study author Justin M. Mittelstädt of the Institute of Aerospace Medicine.

“At the German Aerospace Center, we apply methods for diagnosing these skills, for example, to find suitable pilots and astronauts. As we are exploring new technologies for future human-machine interaction, we were curious to find out how the emerging large language models perform in these areas that are considered to be profoundly human.”

To evaluate AI performance, the researchers used a Situational Judgment Test, a tool widely used in psychology and personnel assessment to measure social competence. The test presented 12 scenarios requiring participants to evaluate four potential courses of action. For each scenario, participants were tasked with identifying the best and worst responses, as rated by a panel of 109 human experts.

The study compared the performance of five AI chatbots—Claude, Microsoft Copilot, ChatGPT, Google Gemini, and you.com’s smart assistant—with a sample of 276 human participants. These human participants were pilot applicants selected for their high educational qualifications and motivation. Their performance provided a rigorous benchmark for the AI systems.

Each chatbot completed the Situational Judgment Test ten times, with randomized presentation orders to ensure consistent results. The responses were then scored based on how well they aligned with the expert-identified best and worst options. In addition to choosing responses, the chatbots were asked to rate the effectiveness of each action in the scenarios, providing further data for comparison with expert evaluations.

The researchers found that all the tested AI chatbots performed at least as well as the human participants, with some outperforming them. Among the chatbots, Claude achieved the highest average score, followed by Microsoft Copilot and you.com’s smart assistant. These three systems consistently selected the most effective responses in the Situational Judgment Test scenarios, aligning closely with expert evaluations.

Interestingly, when chatbots failed to select the best response, they most often chose the second-most effective option, mirroring the decision-making patterns of human participants. This suggests that AI systems, while not perfect, are capable of nuanced judgment and probabilistic reasoning that closely resembles human thought processes.

“We have seen that these models are good at answering knowledge questions, writing code, solving logic problems, and the like,” Mittelstädt told PsyPost. “But we were surprised to find that some of the models were also, on average, better at judging the nuances of social situations than humans, even though they had not been explicitly trained for use in social settings. This showed us that social conventions and the way we interact as humans are encoded as readable patterns in the textual sources on which these models are trained.”

The study also highlighted differences in reliability among the AI systems. Claude showed the highest consistency across multiple test iterations, while Google Gemini exhibited occasional contradictions, such as rating an action as both the best and worst in different runs. Despite these inconsistencies, the overall performance of all tested AI systems surpassed expectations, demonstrating their potential to provide socially competent advice.

“Many people already use chatbots for a variety of everyday tasks,” Mittelstädt explained. “Our results suggest that chatbots may be quite good at giving advice on how to behave in tricky social situations and that people, especially those who are insecure in social interactions, may benefit from this. However, we do not recommend blindly trusting chatbots, as we also saw evidence of hallucinations and contradictory statements, as is often reported in the context of large language models.”

It is important to note that the study focused on simulated scenarios rather than real-world interactions, leaving questions about how AI systems might perform in dynamic, high-stakes social settings.

“To facilitate a quantifiable comparison between large language models and humans, we selected a multiple-choice test that demonstrates prognostic validity in humans for real-world behavior,” Mittelstädt noted. “However, performance on such a test does not yet guarantee that large language models will respond in a socially competent manner in real and more complex scenarios.”

Nevertheless, the findings suggest that AI systems are increasingly able to emulate human social judgment. These advancements open doors to practical applications, including personalized guidance in social and professional settings, as well as potential use in mental health support.

“Given the demonstrated ability of large language models to judge social situations effectively in a psychometric test, our objective is to assess their social competence in real-world interactions with people and the conditions under which people benefit from social advice provided by a large language model,” Mittelstädt told PsyPost.

“Furthermore, the response behavior in Situational Judgment Tests is highly culture-dependent. The effectiveness of a response in a specific situation may vary considerably from one culture to another. The good performance of large language models in our study demonstrates that they align closely with the judgments prevalent in Western cultures. It would be interesting to see how large language models perform in tests from other cultural contexts and whether their evaluation would change if they were trained with more data from a different culture.”

“Even though large language models may produce impressive performances in social tasks, they do not possess emotions, which would be a prerequisite for genuine social behavior,” Mittelstädt added. “We should keep in mind that large language models only imitate social responses that they have extracted from patterns in their training dataset. Despite this, there are promising applications, such as assisting individuals with social skills development.”

The study, “Large language models can outperform humans in social situational judgments,” was authored by Justin M. Mittelstädt, Julia Maier, Panja Goerke, Frank Zinn, and Michael Hermes.