Researchers recently evaluated the ability of advanced artificial intelligence (AI) models to answer questions about global history using a benchmark derived from the Seshat Global History Databank. The study, presented at the Neural Information Processing Systems conference in Vancouver, revealed that the best-performing model, GPT-4 Turbo, achieved a score of 46% on a multiple-choice test, a marked improvement over random guessing but far from expert comprehension. The findings highlight significant limitations in current AI tools’ ability to process and understand historical knowledge, particularly outside well-documented regions like North America and Western Europe.
The motivation for the study stemmed from a desire to explore the potential of artificial intelligence (AI) tools in aiding historical and archaeological research. History and archaeology often involve analyzing vast amounts of complex and unevenly distributed data, making these fields particularly challenging for researchers.
Advances in AI, particularly in large language models (LLMs), have demonstrated their utility in fields like law and data labeling, raising the question of whether these tools could similarly assist historians by processing and synthesizing historical knowledge. Researchers hoped that AI could augment human efforts, providing insights that might otherwise be missed or speeding up labor-intensive tasks like data organization.
Peter Turchin, a project leader at the Complexity Science Hub, and his collaborators developed the Seshat Global History Databank, a comprehensive repository of historical knowledge. They recognized the need for a systematic evaluation of AI’s understanding of history. The researchers hoped the study would not only reveal the strengths and weaknesses of current AI but also guide future efforts to refine these tools for academic use.
The Seshat Global History Databank includes 36,000 data points about 600 historical societies, covering all major world regions and spanning 10,000 years of history. Data points are drawn from over 2,700 scholarly sources and coded by expert historians and graduate research assistants. The dataset is unique in its systematic approach to recording both well-supported evidence and inferred conclusions.
To evaluate AI performance, the researchers converted the dataset into multiple-choice questions that asked whether a historical variable (e.g., the presence of writing or a specific governance structure) was “present,” “absent,” “inferred present,” or “inferred absent” during a given society’s time frame. Seven AI models were tested, including GPT-3.5, GPT-4 Turbo, Llama, and Gemini. Models were provided with examples to help them understand the task and were instructed to act as expert historians in their responses.
The researchers assessed the models using a balanced accuracy metric, which accounts for the uneven distribution of answers across the dataset. Random guessing would result in a score of 25%, while perfect accuracy would yield 100%. The models were also tested on their ability to distinguish between “evidenced” and “inferred” facts, a critical skill for historical analysis.
“We wanted to set a benchmark for assessing the ability of these LLMs to handle expert-level history knowledge,” explained first author Jakob Hauser, a resident scientist at the Complexity Science Hub. “The Seshat Databank allows us to go beyond ‘general knowledge’ questions. A key component of our benchmark is that we not only test whether these LLMs can identify correct facts, but also explicitly ask whether a fact can be proven or inferred from indirect evidence.”
GPT-4 Turbo outperformed the other models, achieving a balanced accuracy of 43.8% on the four-choice test. While this score exceeded random guessing, it still fell well short of expert-level performance. In a simplified two-choice format (“present” versus “absent”), GPT-4 Turbo performed better, with an accuracy of 63.2%. These results suggest that while the models can identify straightforward facts, they struggle with more nuanced historical questions.
“One surprising finding, which emerged from this study, was just how bad these models were. This result shows that artificial ‘intelligence’ is quite domain-specific. LLMs do well in some contexts, but very poorly, compared to humans, in others,” Turchin remarked.
The study also revealed patterns in the models’ performance across regions, time periods, and types of historical data. Models generally performed better on earlier historical periods (e.g., before 3000 BCE) and struggled with more recent data, likely due to the increasing complexity of societies and historical records over time. Regionally, performance was highest for societies in the Americas and lowest for Sub-Saharan Africa and Oceania, highlighting potential biases in the models’ training data.
“LLMs, such as ChatGPT, have been enormously successful in some fields—for example, they have largely succeeded by replacing paralegals. But when it comes to making judgments about the characteristics of past societies, especially those located outside North America and Western Europe, their ability to do so is much more limited,” explained Turchin, who leads the Complexity Science Hub’s research group on social complexity and collapse.
Interestingly, the models exhibited relative consistency across different types of historical data, such as military organization, religious practices, and legal systems. However, performance varied significantly between models. GPT-4 Turbo consistently outperformed others in most categories, while smaller models like Llama-3.1-8B struggled to achieve comparable results.
The researchers acknowledged several limitations in their study. The Seshat Databank, while comprehensive, reflects the biases of its sources, which are predominantly in English and focused on well-documented societies. This linguistic and regional bias likely influenced the models’ performance. Additionally, the study only tested a limited number of AI models, leaving room for future evaluations of newer or more specialized tools.
The study also highlighted challenges in interpreting historical data. Unlike fields with clear-cut answers, history often involves ambiguity and debate, making it difficult to design objective benchmarks for AI evaluation. Furthermore, the models’ underperformance in regions like Sub-Saharan Africa underscores the need for more diverse training data that accurately represents global history.
Looking ahead, the researchers plan to expand the Seshat dataset to include more data from underrepresented regions and to incorporate additional types of historical questions. They also aim to test newer AI models to assess whether advancements in AI technology can address the limitations identified in this study.
“The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history. They’re great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they’re not yet up to the task,” said Maria del Rio-Chanona, the study’s corresponding author and an assistant professor at University College London.
The paper, “Large Language Models’ Expert-level Global History Knowledge Benchmark (HiST-LLM),” wa authored by Jakob Hauser, Daniel Kondor, Jenny Reddish, Majid Benam, Enrico Cioni, Federica Villa, James S. Bennett, Daniel Hoyer, Pieter François, Peter Turchin, and R. Maria del Rio-Chanona.