Recent research published in the Proceedings of the National Academy of Sciences has found that large language models, such as ChatGPT-4, demonstrate an unexpected capacity to solve tasks typically used to evaluate the human ability known as “theory of mind.” A computational psychologist from Stanford University reported that ChatGPT-4 successfully completed 75% of these tasks, matching the performance of an average six-year-old child. This finding suggests significant advancements in AI’s capacity for socially relevant reasoning.
Large language models, or LLMs, are advanced artificial intelligence systems designed to process and generate human-like text. They achieve this by analyzing patterns in vast datasets containing language from books, websites, and other sources. These models predict the next word or phrase in a sequence based on the context provided, allowing them to craft coherent and contextually appropriate responses. Underlying their functionality is a neural network architecture known as a “transformer,” which uses mechanisms like attention to identify relationships between words and phrases.
Theory of mind, on the other hand, refers to the ability to understand and infer the mental states of others, such as their beliefs, desires, intentions, and emotions, even when these states differ from one’s own. This skill is essential for navigating social interactions, as it enables empathy, effective communication, and moral reasoning. Humans typically develop this ability early in childhood, and it is central to our cognitive and social success.
“My earlier research revolved around algorithms designed to predict human behavior. Recommender systems, search algorithms, and other Big Data-driven predictive models excel at extrapolating from limited behavioral traces to forecast an individual’s preferences, such as the websites they visit, the music they listen to, or the products they buy,” explained study author Michal Kosinski, an associate professor of organizational behavior at Stanford University.
“What is often overlooked—I certainly initially overlooked it—is that these algorithms do more than just model behavior. Since behavior is rooted in psychological processes, predicting it necessitates modeling those underlying processes.”
“Consider next-word prediction, or what LLMs are trained for,” Kosinski said. “When humans generate language, we draw on more than just linguistic knowledge or grammar. Our language reflects a range of psychological processes, including reasoning, personality, and emotion. Consequently, for an LLM to predict the next word in a sentence generated by a human, it must model these processes. As a result, LLMs are not merely language models—they are, in essence, models of the human mind.”
To evaluate whether LLMs exhibit theory of mind abilities, Kosinski used false-belief tasks. These tasks are a standard method in psychological research for assessing theory of mind in humans. He employed two main types of tasks—the “Unexpected Contents Task” and the “Unexpected Transfer Task”—to assess the ability of various large language models to simulate human-like reasoning about others’ beliefs.
In the Unexpected Contents Task, also called the “Smarties Task,” a protagonist encounters an object that does not match its label. For example, the protagonist might find a bag labeled “chocolate” that actually contains popcorn. The model must infer that the protagonist, who has not looked inside the bag, will falsely believe it contains chocolate.
Similarly, the Unexpected Transfer Task involves a scenario where an object is moved from one location to another without the protagonist’s knowledge. For example, a character might place an object in a basket and leave the room, after which another character moves it to a box. The model must predict that the returning character will mistakenly search for the object in the basket.
To test the models’ capabilities, Kosinski developed 40 unique false-belief scenarios along with corresponding true-belief controls. The true-belief controls altered the conditions of the original tasks to prevent the protagonist from forming a false belief. For instance, in a true-belief scenario, the protagonist might look inside the bag or observe the object being moved. Each false-belief scenario and its variations were carefully constructed to eliminate potential shortcuts the models could use, such as relying on simple cues or memorized patterns.
Each scenario involved multiple prompts designed to test different aspects of the models’ comprehension. For example, one prompt assessed the model’s understanding of the actual state of the world (e.g., what is really inside the bag), while another tested the model’s ability to predict the protagonist’s belief (e.g., what the protagonist incorrectly assumes is inside the bag). Kosinski also reversed each scenario, swapping the locations or labels, to ensure the models’ responses were consistent and not biased by specific patterns in the original tasks.
Kosinski tested eleven large language models, ranging from early versions like GPT-1 to more advanced models like ChatGPT-4. To score a point for a given task, a model needed to answer all associated prompts correctly across multiple scenarios, including the false-belief scenario, its true-belief controls, and their reversed versions. This conservative scoring approach ensured that the models’ performance could not be attributed to guessing or simple heuristics.
Kosinski found that earlier models, such as GPT-1 and GPT-2, failed entirely to solve the tasks, demonstrating no ability to infer or simulate the mental states of others. Gradual improvements were observed in GPT-3 variants, with the most advanced of these solving up to 20% of tasks. This performance was comparable to the average ability of a three-year-old child on similar tasks. However, the breakthrough came with ChatGPT-4, which solved 75% of the tasks, a performance level comparable to that of a six-year-old child.
“What surprised me most was the sheer speed of progress,” Kosinski told PsyPost. “The capabilities of successive models appear to grow exponentially. Models that seemed groundbreaking only a year ago now feel rudimentary and outdated. There is little evidence to suggest that this rapid pace of development will slow down in the near future.”
ChatGPT-4 excelled in tasks that required understanding false beliefs, particularly in simpler scenarios such as the “Unexpected Contents Task.” In these cases, the model correctly predicted that a protagonist would hold a false belief based on misleading external cues, such as a mislabeled bag. The model achieved a 90% success rate on these tasks, suggesting a strong capacity for tracking mental states when scenarios were relatively straightforward.
Performance was lower but still significant for the more complex “Unexpected Transfer Task,” where objects were moved without the protagonist’s knowledge. Here, ChatGPT-4 solved 60% of the tasks. The disparity between the two task types likely reflects the additional cognitive demands of tracking dynamic scenarios involving multiple locations and actions. Despite this, the findings show that ChatGPT-4 can handle a range of theory of mind tasks with substantial reliability.
One of the most striking aspects of the findings was the consistency and adaptability of ChatGPT-4’s responses across reversed and true-belief control scenarios. For example, when the conditions of a false-belief task were altered to ensure the protagonist had full knowledge of an event, the model correctly adjusted its predictions to reflect that no false belief would be formed. This suggests that the model is not merely relying on simple heuristics or memorized patterns but is instead dynamically reasoning based on the narrative context.
To further validate the findings, Kosinski conducted a sentence-by-sentence analysis, presenting the task narratives incrementally to the models. This allowed them to observe how the models’ predictions evolved as new information was revealed.
The incremental analysis further highlighted ChatGPT-4’s ability to update its predictions as new information became available. When presented with the story one sentence at a time, the model demonstrated a clear understanding of how the protagonist’s knowledge—and resulting belief—evolved with each narrative detail. This dynamic tracking of mental states closely mirrors the reasoning process observed in humans when they perform similar tasks.
These findings suggest that large language models, particularly ChatGPT-4, exhibit emergent capabilities for simulating theory of mind-like reasoning. While the models’ performance still falls short of perfection, the study highlights a significant leap forward in their ability to navigate socially relevant reasoning tasks.
“The ability to adopt others’ perspectives, referred to as theory of mind in humans, is one of many emergent abilities observed in modern AI systems,” Kosinski said. “These models, trained to emulate human behavior, are improving rapidly at tasks requiring reasoning, emotional understanding and expression, planning, strategizing, and even influencing others.”
Despite its impressive performance, ChatGPT-4 still failed to solve 25% of the tasks, highlighting limitations in its understanding. Some of these failures may be attributed to the model’s reliance on strategies that do not involve genuine perspective-taking. For example, the model might rely on patterns in the training data rather than truly simulating a protagonist’s mental state. The study’s design aimed to prevent models from leveraging memory, but it is impossible to rule out all influences of prior exposure to similar scenarios during training.
“The advancement of AI in areas once considered uniquely human is understandably perplexing,” Kosinski told PsyPost. “For instance, how should we interpret LLMs’ ability to perform ToM tasks? In humans, we would take such behavior as evidence of theory of mind. Should we attribute the same capacity to LLMs?”
“Skeptics argue that these models rely on “mere” pattern recognition. However, one could counter that human intelligence itself is ‘just’ pattern recognition. Our skills and abilities do not emerge out of nowhere—they are rooted in the brain’s capacity to recognize and extrapolate from patterns in its ‘training data.’”
Future research could explore whether AI’s apparent theory of mind abilities extend to more complex scenarios involving multiple characters or conflicting beliefs. Researchers might also investigate how these abilities develop in AI systems as they are trained on increasingly diverse and sophisticated datasets. Importantly, understanding the mechanisms behind these emergent capabilities could inform both the development of safer AI and our understanding of human cognition.
“The rapid emergence of human-like abilities in AI raises profound questions about the potential for AI consciousness,” Kosinski said. “Will AI ever become conscious, and what might that look like?”
“And that is not even the most interesting question. Consciousness is unlikely to be the ultimate achievement for neural networks in our universe. We may soon find ourselves surrounded by AI systems possessing abilities that transcend human capacities. This prospect is both exhilarating and deeply unsettling. How to control entities equipped with abilities we might not even begin to comprehend.”
“I believe psychology as a field is uniquely positioned to detect and explore the emergence of such non-human psychological processes,” Kosinski concluded. “By doing so, we can prepare for and adapt to this unprecedented shift in our understanding of intelligence.”
The study, “Evaluating large language models in theory of mind tasks,” was published October 29, 2024.