Can ChatGPT Really Teach You Algebra? What the Research Says About AI Math Tutors
Introduction: AI's New Role in the Classroom
Over the past couple of years, platforms like ChatGPT have become household names. From writing essays to generating code, these large language models (LLMs) are doing things many of us didnât think were possible. Now, thereâs buzz about whether they can go even furtherâsay, stepping into the role of a tutor and helping students master complex subjects like algebra.
Sounds impressive, right?
But before we hand over our math homework to AI, it's important to ask: Are LLMs ready to be good math tutors? Are they accurate? Do they explain things in a clear, step-by-step way? Or do they just spit out final answers without doing much actual "teaching"?
A recent study by researchers Adit Gupta, Jennifer Reddig, Tommaso Calo, Daniel Weitekamp, and Christopher J. MacLellan dives deep into exactly these questions. Using a smart college algebra tutoring system called Apprentice Tutors and a bunch of popular large language models (including ChatGPTâs latest versions), they put AI tutors to the testâand the results are as fascinating as they are cautionary.
Letâs break it down.
Why This Study Matters
LLMs like ChatGPT are increasingly showing up in online education tools, acting as tutors that can help students step through math problems and understand difficult concepts. While tools like Duolingo or Khan Academy are already integrating AI in smart and helpful ways, the question remains:
Can LLMs actually support high-quality learning without leading students astray?
Math tutoring is especially tricky because itâs not just about getting to the right final answer. Good tutoring involves:
- Breaking down problems into simple, manageable steps;
- Explaining why each step works;
- Spotting and correcting misconceptions;
- Giving targeted hints without just revealing the answer.
If an AI gets even one step wrong, it could confuse a student orâeven worseâteach them incorrect ideas. Thatâs a big deal.
How the Study Worked
To tackle this, the researchers used a two-part strategy:
1. Testing LLMs as Solvers
First, they treated the LLMs like students and gave them a pile of math problems to solveâproblems generated from the Apprentice Tutors platform covering 22 different types of algebra questions. They asked five different LLMs to solve these problems and checked whether the answers were right.
This tests how good the AI is at getting the correct final answer. But tutoring is more than solving problems correctlyâŚ
2. Testing LLMs as Tutors
Then, they flipped the scriptâtreating the AI as the tutor. Human evaluators acted like students and asked the AI for help solving problems, recording the conversation as the AI walked them through each question.
They looked at two things:
- Quality of tutoring: Was the explanation helpful? Did it follow good teaching practices?
- Accuracy: Were all steps correct, or did the AI make mistakes along the way?
By looking at both anglesâperformance as a problem-solver and performance as an interactive tutorâthe researchers built a much clearer picture of how ready (or not) LLMs are for tutoring roles.
Key Findings: The Good, the Bad, and the Surprising
Letâs get into the results, starting with the AIâs raw math skills.
đ Final Answer Accuracy Is⌠Decent
When given algebra problems to solve outright, the LLMs got the right final answer about 85.5% of the time. In interactive sessions (where a human worked with the AI across multiple steps), the final answer accuracy bumped up slightly to 88.6%, thanks to the AI having multiple chances to refine its answers.
The winner in both tests? GPT-4o, with up to 97.3% final answer accuracy in the single-shot problem-solving test.
That might sound greatâuntil you consider that traditional intelligent tutoring systems aim for 100% accuracy. And in tutoring, each mistake matters a lot.
đ¤ But When Acting as Tutors, the AIs Made Lots of Mistakes
Hereâs where the picture gets messier.
While 90% of tutoring chats were considered âhigh qualityâ (meaning they were helpful and aligned with good teaching techniques), only 56.6% of those sessions were completely free of errors.
That means nearly 1 in every 2 tutoring sessions still had a mistake somewhereâeither in the math itself or in how the AI interpreted the problem.
Ouch.
𤡠Newer Models Arenât Always Better
Oddly enough, GPT-4 performed worse than GPT-4o in terms of final answer accuracy, and newer models didnât consistently outperform older ones. Some of the newer LLMs evaluated (like o1 Preview and o1 Mini) were good at personalization but made more frequent factual mistakes.
This is important because it breaks a common assumption: bigger or newer doesnât always mean better, especially when it comes to specific educational tasks like teaching math.
đ§ AI Tutors Struggled with the âTeachingâ Part
In several cases, the AIs got the right answer but fumbled through the explanation. Examples include:
- Using overly basic hints (like explaining how multiplication works) when the problem required advanced strategies;
- Refusing to accept correct answers from students;
- Miscalculating sub-steps but still arriving at the correct final answer;
- Not using specific methods students were supposed to learn (like "slip-and-slide" for factoring).
This shows that while LLMs are decent calculators, theyâre not yet great teachers. They often skip or mess up the steps that actually matter for learning.
Why It Matters More Than You Think
Letâs say youâre a high school student getting tutored by ChatGPT.
The AI gets the final answer right. Cool.
But what if the middle steps are wrong, oversimplified, or misleading?
You might walk away thinking you understand the method, when in fact youâve picked up a misconception. Repeated over time, that could completely derail your future learning.
The studyâs authors note that intelligent tutoring systems (ITS)âwhich are hand-crafted and based on cognitive scienceâtrack every step a student takes and adapt accordingly. ChatGPT canât really do that.
In fact, the researchers call this a possible âregressionâ in AI for education: substituting precise, step-aware teaching tools with LLMs that look smart, but are prone to error.
Practical Uses: Where LLMs Can Still Shine
Thatâs not to say LLMs are useless in the classroom. Quite the opposite! Here are some of the smart ways schools and ed-tech developers can (and maybe should) use LLMs:
- Hint generation: LLMs are good at coming up with helpful prompts or clues based on a studentâs mistake.
- Positive reinforcement: Many dialogues were filled with motivational phrases like âYouâre almost there!â or âThatâs not quite right, but donât give up!â That kind of feedback can go a long way in keeping students engaged.
- Flexible formats: LLMs accept different input types (like â0.5â instead of â1/2â), which is great for accommodating how students work.
- Alternative explanations: They can reframe concepts if a student doesnât get it the first time.
The real takeaway here is: LLMs can support tutoring systemsâbut they probably shouldnât replace them. Not yet, anyway.
What Comes Next?
The researchers are now encouraging future work that:
- Tests even more models, like Claude, Gemini, and open-source LLMs;
- Integrates LLMs into intelligent tutoring systems (using the best of both worlds);
- Analyzes real student interactions to understand how actual learners engage with AI tutors;
- Looks beyond mathâhow do AI tutors perform in other subjects?
Most importantly, theyâre calling for more careful evaluation methods. Just checking whether the answer is right isnât good enough; we need to dig into how well the AI gets there and whether the explanation helps or harms learning.
Key Takeaways
Correct final answers aren't enough. Even when LLMs "get the right answer," nearly half of their tutoring dialogues still contain mistakes.
LLMs are good assistants, not full tutors. Theyâre useful for generating practice problems, offering hints, rephrasing explanations, and providing encouragementâbut human oversight is still vital.
Interactive prompting improves performance. If youâre using ChatGPT for math, ask follow-up questions and double-check explanations. One-shot prompts are less accurate.
Be cautious of overconfidence. LLMs sound confident even when theyâre wrong. Students might not know theyâre being misled.
Errors during teaching matter more than wrong answers. A single incorrect step can introduce a long-lasting misconception.
The newer model isnât always better. Check performance for your specific needs before assuming the latest update is the best tutor.
If youâre exploring how to use AI to support learningâor if youâre just curious about whether ChatGPT can teach you mathâthis study is a treasure trove of insights.
Could ChatGPT help you finish your algebra homework? Sure. But until it learns to teach as carefully as it calculates, think of it as a helpful sidekickânot the main teacher.
Want to try using ChatGPT more effectively as a learning tool? Try these tips:
- Ask it to explain each step like you're a beginner;
- Question its methods: âWhy did you do that in Step 2?â
- Compare its steps to those from a real textbook or a known tutoring system;
- Donât stop at the final answerâmake sure the steps make sense.
Better prompting = better learning.
This blog is based on the research titled âBeyond Final Answers: Evaluating Large Language Models for Math Tutoringâ by Adit Gupta, Jennifer Reddig, Tommaso Calo, Daniel Weitekamp, and Christopher J. MacLellan.