Can ChatGPT Really Teach You Algebra? What the Research Says About AI Math Tutors

Introduction: AI's New Role in the Classroom

Over the past couple of years, platforms like ChatGPT have become household names. From writing essays to generating code, these large language models (LLMs) are doing things many of us didn’t think were possible. Now, there’s buzz about whether they can go even further—say, stepping into the role of a tutor and helping students master complex subjects like algebra.

Sounds impressive, right?

But before we hand over our math homework to AI, it's important to ask: Are LLMs ready to be good math tutors? Are they accurate? Do they explain things in a clear, step-by-step way? Or do they just spit out final answers without doing much actual "teaching"?

A recent study by researchers Adit Gupta, Jennifer Reddig, Tommaso Calo, Daniel Weitekamp, and Christopher J. MacLellan dives deep into exactly these questions. Using a smart college algebra tutoring system called Apprentice Tutors and a bunch of popular large language models (including ChatGPT’s latest versions), they put AI tutors to the test—and the results are as fascinating as they are cautionary.

Let’s break it down.

Why This Study Matters

LLMs like ChatGPT are increasingly showing up in online education tools, acting as tutors that can help students step through math problems and understand difficult concepts. While tools like Duolingo or Khan Academy are already integrating AI in smart and helpful ways, the question remains:

Can LLMs actually support high-quality learning without leading students astray?

Math tutoring is especially tricky because it’s not just about getting to the right final answer. Good tutoring involves:

Breaking down problems into simple, manageable steps;
Explaining why each step works;
Spotting and correcting misconceptions;
Giving targeted hints without just revealing the answer.

If an AI gets even one step wrong, it could confuse a student or—even worse—teach them incorrect ideas. That’s a big deal.

How the Study Worked

To tackle this, the researchers used a two-part strategy:

1. Testing LLMs as Solvers

First, they treated the LLMs like students and gave them a pile of math problems to solve—problems generated from the Apprentice Tutors platform covering 22 different types of algebra questions. They asked five different LLMs to solve these problems and checked whether the answers were right.

This tests how good the AI is at getting the correct final answer. But tutoring is more than solving problems correctly…

2. Testing LLMs as Tutors

Then, they flipped the script—treating the AI as the tutor. Human evaluators acted like students and asked the AI for help solving problems, recording the conversation as the AI walked them through each question.

They looked at two things:

Quality of tutoring: Was the explanation helpful? Did it follow good teaching practices?
Accuracy: Were all steps correct, or did the AI make mistakes along the way?

By looking at both angles—performance as a problem-solver and performance as an interactive tutor—the researchers built a much clearer picture of how ready (or not) LLMs are for tutoring roles.

Key Findings: The Good, the Bad, and the Surprising

Let’s get into the results, starting with the AI’s raw math skills.

📈 Final Answer Accuracy Is… Decent

When given algebra problems to solve outright, the LLMs got the right final answer about 85.5% of the time. In interactive sessions (where a human worked with the AI across multiple steps), the final answer accuracy bumped up slightly to 88.6%, thanks to the AI having multiple chances to refine its answers.

The winner in both tests? GPT-4o, with up to 97.3% final answer accuracy in the single-shot problem-solving test.

That might sound great—until you consider that traditional intelligent tutoring systems aim for 100% accuracy. And in tutoring, each mistake matters a lot.

🤔 But When Acting as Tutors, the AIs Made Lots of Mistakes

Here’s where the picture gets messier.

While 90% of tutoring chats were considered “high quality” (meaning they were helpful and aligned with good teaching techniques), only 56.6% of those sessions were completely free of errors.

That means nearly 1 in every 2 tutoring sessions still had a mistake somewhere—either in the math itself or in how the AI interpreted the problem.

Ouch.

🤷 Newer Models Aren’t Always Better

Oddly enough, GPT-4 performed worse than GPT-4o in terms of final answer accuracy, and newer models didn’t consistently outperform older ones. Some of the newer LLMs evaluated (like o1 Preview and o1 Mini) were good at personalization but made more frequent factual mistakes.

This is important because it breaks a common assumption: bigger or newer doesn’t always mean better, especially when it comes to specific educational tasks like teaching math.

🧠 AI Tutors Struggled with the “Teaching” Part

In several cases, the AIs got the right answer but fumbled through the explanation. Examples include:

Using overly basic hints (like explaining how multiplication works) when the problem required advanced strategies;
Refusing to accept correct answers from students;
Miscalculating sub-steps but still arriving at the correct final answer;
Not using specific methods students were supposed to learn (like "slip-and-slide" for factoring).

This shows that while LLMs are decent calculators, they’re not yet great teachers. They often skip or mess up the steps that actually matter for learning.

Why It Matters More Than You Think

Let’s say you’re a high school student getting tutored by ChatGPT.

The AI gets the final answer right. Cool.

But what if the middle steps are wrong, oversimplified, or misleading?

You might walk away thinking you understand the method, when in fact you’ve picked up a misconception. Repeated over time, that could completely derail your future learning.

The study’s authors note that intelligent tutoring systems (ITS)—which are hand-crafted and based on cognitive science—track every step a student takes and adapt accordingly. ChatGPT can’t really do that.

In fact, the researchers call this a possible “regression” in AI for education: substituting precise, step-aware teaching tools with LLMs that look smart, but are prone to error.

Practical Uses: Where LLMs Can Still Shine

That’s not to say LLMs are useless in the classroom. Quite the opposite! Here are some of the smart ways schools and ed-tech developers can (and maybe should) use LLMs:

Hint generation: LLMs are good at coming up with helpful prompts or clues based on a student’s mistake.
Positive reinforcement: Many dialogues were filled with motivational phrases like “You’re almost there!” or “That’s not quite right, but don’t give up!” That kind of feedback can go a long way in keeping students engaged.
Flexible formats: LLMs accept different input types (like “0.5” instead of “1/2”), which is great for accommodating how students work.
Alternative explanations: They can reframe concepts if a student doesn’t get it the first time.

The real takeaway here is: LLMs can support tutoring systems—but they probably shouldn’t replace them. Not yet, anyway.

What Comes Next?

The researchers are now encouraging future work that:

Tests even more models, like Claude, Gemini, and open-source LLMs;
Integrates LLMs into intelligent tutoring systems (using the best of both worlds);
Analyzes real student interactions to understand how actual learners engage with AI tutors;
Looks beyond math—how do AI tutors perform in other subjects?

Most importantly, they’re calling for more careful evaluation methods. Just checking whether the answer is right isn’t good enough; we need to dig into how well the AI gets there and whether the explanation helps or harms learning.

Key Takeaways

Correct final answers aren't enough. Even when LLMs "get the right answer," nearly half of their tutoring dialogues still contain mistakes.
LLMs are good assistants, not full tutors. They’re useful for generating practice problems, offering hints, rephrasing explanations, and providing encouragement—but human oversight is still vital.
Interactive prompting improves performance. If you’re using ChatGPT for math, ask follow-up questions and double-check explanations. One-shot prompts are less accurate.
Be cautious of overconfidence. LLMs sound confident even when they’re wrong. Students might not know they’re being misled.
Errors during teaching matter more than wrong answers. A single incorrect step can introduce a long-lasting misconception.
The newer model isn’t always better. Check performance for your specific needs before assuming the latest update is the best tutor.

If you’re exploring how to use AI to support learning—or if you’re just curious about whether ChatGPT can teach you math—this study is a treasure trove of insights.

Could ChatGPT help you finish your algebra homework? Sure. But until it learns to teach as carefully as it calculates, think of it as a helpful sidekick—not the main teacher.

Want to try using ChatGPT more effectively as a learning tool? Try these tips:

Ask it to explain each step like you're a beginner;
Question its methods: “Why did you do that in Step 2?”
Compare its steps to those from a real textbook or a known tutoring system;
Don’t stop at the final answer—make sure the steps make sense.

Better prompting = better learning.

This blog is based on the research titled “Beyond Final Answers: Evaluating Large Language Models for Math Tutoring” by Adit Gupta, Jennifer Reddig, Tommaso Calo, Daniel Weitekamp, and Christopher J. MacLellan.