A recent study published in PLOS One provides evidence that artificial intelligence can be just as helpful as a human tutor when it comes to learning mathematics. Researchers discovered that students using hints generated by ChatGPT, a popular artificial intelligence chatbot, showed similar learning improvements in algebra and statistics as those receiving guidance from human-authored hints.
Educational technology is increasingly looking towards advanced artificial intelligence tools like ChatGPT to enhance learning experiences. The chatbot’s ability to generate human-like text has sparked interest in its potential for tutoring and providing educational support. Many believe this technology could make personalized learning more accessible and efficient. However, there has been limited research to understand just how effective and reliable these artificial intelligence systems are in actual learning scenarios, particularly in academic subjects like mathematics.
Creating helpful learning materials for online education, such as hints and worked examples, is a time-consuming and expensive process. Traditionally, educators and subject matter experts must manually develop, refine, and check these resources. This often involves many rounds of revisions and quality control. If artificial intelligence like ChatGPT could automatically generate high-quality and effective learning support, it could dramatically reduce the effort and cost involved in developing educational tools. This could pave the way for wider access to tutoring systems and more personalized learning experiences across various subjects and educational levels.
“As a researcher in the space of AI in education, there were a lot of burning questions that the introduction of ChatGPT provoked that were not yet answered,” said study author Zachary A. Pardos, an associate professor at UC Berkeley School of Education.
“While OpenAI provided some report cards on performance, hallucination rates at the granularity level of granular academic subjects were not well established. The essential questions being asked were how often does this technology make mistakes in key STEM areas and can its outputs lead to learning.”
“Also shaping these questions for us was our development of an open source adaptive tutoring system (oatutor.io) and curation of content for that system. We, a research lab, were basically a small publisher and content production was time consuming. From an efficiency and scaling perspective, the role of AI, ChatGPT in particular, to help our team produce materials more quickly without measurable decrease in quality was an important question.”
The researchers conducted an online study involving 274 participants recruited through Amazon Mechanical Turk, a platform for online tasks. All participants had at least a high school degree and had a designation on the platform indicating a history of successful task completion. This ensured they possessed the basic math skills necessary to potentially benefit from the study and that they were reliable online participants.
The study used a carefully designed experiment where participants were randomly assigned to one of three conditions: a control group with no hints, a group receiving hints created by human tutors, and a group receiving hints generated by ChatGPT. Within each of these hint conditions, participants were further randomly assigned to work on problems from one of four mathematics subjects: Elementary Algebra, Intermediate Algebra, College Algebra, or Statistics. The math problems were taken from freely available online textbooks.
The researchers used an open-source online tutoring system as the platform for the study. This system delivered math problems and, depending on the assigned condition, provided hints. For the human tutor hint condition, the system used pre-existing hints that had been developed by undergraduate students with prior math tutoring experience. These human-created hints were designed to guide students step-by-step through the problem-solving process. For the ChatGPT hint condition, the researchers generated new hints specifically for this study. They prompted ChatGPT with each math problem and used its text-based output as the hint.
Before starting the problem-solving section, all participants completed a short pre-test consisting of three questions to assess their initial knowledge of the assigned math topic. Following the pre-test, participants worked through five practice problems in their assigned subject. In the hint conditions, students could request hints while working on these problems. After the practice problems, participants took a post-test, which used the exact same questions as the pre-test, to measure any learning gains. The control group received correctness feedback during the practice problems but no additional hints. They could, however, request a “bottom-out hint” which simply gave them the answer to the problem so they could move forward. Participants in the hint conditions had access to full worked solution hints in addition to this bottom-out option. The time participants spent on the task was also recorded.
To ensure the quality of the ChatGPT-generated hints, the researchers performed quality checks. They evaluated whether the hints provided the correct answer, showed correct steps, and contained appropriate language. Initially, they found that ChatGPT-generated hints contained errors in about 32% of the problems. To reduce these errors, they used a technique called “self-consistency.” This involved asking ChatGPT to generate ten different hints for each problem and then selecting the hint that contained the most common answer among the ten responses. This method significantly reduced the error rate, particularly for algebra problems, bringing it down to near zero for algebra and to about 13% for statistics problems.
“The high hallucination rate of ChatGPT in the subject areas we tested was surprising and so too was the ability to reduce that to near 0% with a rather simple hallucination mitigation technique,” Pardos told PsyPost.
The researchers found that ChatGPT-generated hints were indeed effective in promoting learning. Participants who received ChatGPT hints showed a statistically significant improvement in their scores from the pre-test to the post-test, indicating they had learned from the hints.
Secondly, the learning gains achieved by students using ChatGPT hints were comparable to those who received human-authored hints. There was no statistically significant difference in learning improvement between these two groups. Both the ChatGPT hint group and the human tutor hint group showed significantly greater learning gains than the control group, which received no hints. Interestingly, while both hint conditions resulted in similar learning, participants in both hint conditions spent more time on the task compared to the control group. However, there was no significant difference in time spent between the ChatGPT hint group and the human tutor hint group.
“ChatGPT used for math educational content production is effective for learning and speeds up the content authoring process by 20-fold,” Pardos said.
But the researchers acknowledged some limitations to their study. One limitation was that, due to the artificial intelligence model’s limitations at the time, they could only use math problems that did not include images or figures. Future research could explore newer versions of these models that can handle visual information. Another point is that the study used Mechanical Turk workers, not students in actual classroom settings. While this allowed for faster data collection and experimentation, future studies should ideally be conducted with students in schools to confirm these findings in real educational environments.
The researchers also pointed out that they used a specific, closed-source artificial intelligence model (ChatGPT 3.5). Future research could investigate the effectiveness of more openly accessible artificial intelligence models. Finally, the study focused on a particular type of learning support – worked example hints. Future studies could explore how artificial intelligence can be used to generate other types of pedagogical strategies and more complex tutoring interactions.
In addition, it remains uncertain whether ChatGPT and other artificial intelligence models can effectively tutor academic subjects beyond mathematics. “This pedagogical approach of tutoring by showing examples of how to solve a problem, generated by AI, may not lend itself to domains that are less procedural in nature (e.g., creative writing),” Pardos noted.
Looking ahead, this study suggests that artificial intelligence has the potential to revolutionize the creation of educational resources and tutoring systems. The fact that ChatGPT can generate math help that is as effective as human-created help, and do so much more quickly, opens exciting possibilities for making high-quality education more accessible and scalable.
“One-on-one human tutoring is very expensive and very effective,” Pardos said. “Incidentally, one-on-one computer tutoring is also expensive to produce. We’re interested in exploring how GenAI-assisted tutor production can change the cost structure and accessibility of tutoring and potentially increase its efficacy through greater personalization that is reasonably achievable with legacy computational approaches.”
“We’ve recently published a study evaluating how well ChatGPT (and other models) can produce questions of appropriate difficulty, compared to textbook questions. Placing teachers in driver’s seat of GenAI is also a research thread we’re making progress on. That emerging research, accepted at Human Factors in Computing Systems conference (CHI), and other threads can be found on our website: https://www.oatutor.io/resources#research-paper.”
The study, “ChatGPT-generated help produces learning gains equivalent to human tutor-authored help on mathematics skills,” was authored by Zachary A. Pardos and Shreya Bhandari.