Large language models outperform experts in predicting neuroscience discoveries

Large language models surpass human experts in predicting neuroscience results, according to a study published in Nature Human Behaviour.

Scientific research is increasingly challenging due to the immense growth in published literature. Integrating noisy and voluminous findings to predict outcomes often exceeds human capacity. This investigation was motivated by the growing role of artificial intelligence in tasks such as protein folding and drug discovery, raising the question of whether LLMs could similarly enhance fields like neuroscience.

Xiaoliang Luo and colleagues developed BrainBench, a benchmark designed to test whether LLMs could predict the results of neuroscience studies more accurately than human experts. BrainBench included 200 test cases based on neuroscience research abstracts. Each test case consisted of two versions of the same abstract: one was the original, and the other had a modified result that changed the study’s conclusion but kept the rest of the abstract coherent. Participants—both LLMs and human experts—were tasked with identifying which version was correct.

The study involved 171 human participants, all neuroscience experts with an average of 10 years of experience, including doctoral students, postdoctoral researchers, and academic staff. On the computational side, general-purpose LLMs were tested alongside BrainGPT, a specialized model fine-tuned with over 1.3 billion tokens from neuroscience literature. BrainBench covered five major subfields of neuroscience (behavioral/cognitive, cellular/molecular, systems/circuits, neurobiology of disease, and development/plasticity/repair), to ensure a comprehensive assessment.

To evaluate LLMs, the researchers used a metric called “perplexity,” which measures how well the models predict text sequences, while human accuracy was measured based on correct answers. The researchers also ensured the test items were not present in the LLMs’ training data, eliminating concerns about memorization.

LLMs significantly outperformed human experts in predicting neuroscience study outcomes. On average, LLMs achieved 81.4% accuracy, compared to 63.4% for human participants. BrainGPT, the model fine-tuned with neuroscience knowledge, performed even better, improving accuracy by 3% over general-purpose LLMs. This specialized training allowed BrainGPT to excel across all five neuroscience subfields included in the benchmark.

One key advantage of the LLMs was their ability to integrate information from the entire abstract, including the background and methods, rather than relying on isolated details. When tested with only the results section, their accuracy dropped, demonstrating the importance of contextual understanding. Human experts, by contrast, struggled to achieve the same level of integration. Additionally, both humans and LLMs showed higher accuracy when they were confident in their predictions, but LLMs displayed better alignment between confidence and correctness.

Importantly, the study confirmed that LLMs’ success was not due to memorization but rather their ability to recognize patterns in neuroscience research, highlighting their potential to assist in scientific discovery.

The authors acknowledge that BrainBench, while innovative, is labor-intensive to create. Moreover, there is a risk that reliance on LLM predictions could discourage researchers from pursuing studies that contradict AI predictions, potentially stifling innovation.

The study, “Large language models surpass human experts in predicting neuroscience results,” was authored by Xiaoliang Luo, Akilles Rechardt, Guangzhi Sun, Kevin K. Nejad, Felipe Yáñez, and colleagues.