When AI Joins the Coding Desk: How LLM Assistants Are Changing Student Practices
SEO note: this piece digs into how LLMs (AI coding assistants) reshape coding behavior and learning outcomes, based on new research.
Explore the latest insights from a quasi-longitudinal study that tracks student code submissions across ten semesters, spanning the period before and after the rise of chat-driven AI coding tools. This work, titled Changes in Coding Behavior and Performance Since the Introduction of LLMs, offers a rare empirical lens on how AI-enabled coding affects both what students do and how well they learn. For context and deeper details, you can read the original paper here: Changes in Coding Behavior and Performance Since the Introduction of LLMs.
Table of Contents
- Introduction
- Why This Matters
- Changes in Coding Behavior After LLMs
- Performance and Learning Outcomes in Flux
- The Centaur Model: Humans + AI in Education
- Practical Takeaways for Educators and Developers
- Key Takeaways
- Sources & Further Reading
Introduction
The rise of large language models (LLMs) and AI-powered coding tools is not just a flashy tech story; it’s changing how students approach problem solving and how instructors assess learning. The paper in focus leverages a unique, ten-semester window (Fall 2020 to Spring 2025) in a graduate cloud computing course at Carnegie Mellon University to study how student behavior and performance evolved around the mass adoption of AI coding assistants. The task analyzed—the PageRank assignment from an individual project—was kept constant across the entire period, enabling a cleaner view of how student submissions changed over time.
Key data points drive the story: 2,066 total submissions from 721 enrollments (718 unique students) for the PageRank task, with the bulk of students submitting before the end of each semester and with a notable bump in engagement after AI tools became widely available. The researchers tracked metrics like the number of submissions, total edit distance, and average edit distance (using the classic Myers algorithm as a proxy for how much code content was changed between submissions). They paired these behavioral signals with performance metrics such as Task Score (final PageRank submission), IP Score (score across the top five individual projects), and TP Score (team project score).
The headline takeaway is clear: since the arrival of LLMs, students tend to make more edits between submissions, submit slightly more often, and produce longer, more bloated solutions. Importantly, these changes in coding behavior do not translate into straightforward improvements in learning outcomes. The study invites us to rethink how we teach, assess, and pair humans with AI in software development. For a deeper dive, see the original paper linked above.
Why This Matters
This research hits a sweet spot of relevance for today’s classrooms, bootcamps, and even early-career software teams. Here’s the unique perspective:
Why it’s significant right now: AI-powered coding assistants are ubiquitous, from chat-based help to inline code generation. Understanding how real students adapt—whether these tools accelerate learning or mask gaps—has immediate implications for curriculum design, pacing, and fairness in assessment. The study’s long horizon helps separate short-term novelty effects from deeper shifts in learning pathways.
A real-world scenario you can act on today: Instructors designing a hands-on, project-based course can plan for AI-assisted work by calibrating assessments to emphasize understanding of abstractions, design decisions, and the ability to critique and refine generated code. In industry, teams evaluating junior developers might use these insights to shape onboarding, code-review rituals, and QA processes, recognizing that AI-assisted contributions can inflate code size while potentially masking skill gaps.
How this builds on prior AI research: Prior work documents mixed effects of LLM use on learning and performance, with some studies showing gains and others showing risks of over-reliance. What’s new here is the longitudinal, artifact-centered view—tracking concrete code submissions over years, not just snapshots. The study highlights a consistent pattern: higher edit distances and more iterative submissions accompany AI-assisted coding, yet improvements in core learning outcomes are not guaranteed. For a broader lens, you can compare these findings with other contemporary investigations into AI-assisted learning and software development.
If you want the full methodological and numerical picture, the original paper provides the data backbone for these claims. It also frames the discussion around the “centaur” idea—humans and machines collaborating, not competing. Read more here: Changes in Coding Behavior and Performance Since the Introduction of LLMs.
Changes in Coding Behavior After LLMs
In this section, we break down how students’ coding practices shifted after LLM-powered assistance became widespread, with a focus on concrete metrics you can imagine watching in a grading dashboard or learning analytics tool.
More Submissions and Longer Edits
One striking trend is a bump in engagement: the median number of submissions for the PageRank task rose starting in the post-ChatGPT semesters. Before s23, half of the students submitted no more than two attempts; by s24, the ceiling rose to three attempts. In practical terms, students were more willing to iterate—likely because AI tools made it easier to generate and refine code quickly, lowering the friction of trying multiple approaches.
Alongside more attempts, the length of the code being edited grew. The study reports a substantial rise in both total edit distance and average edit distance after the AI wave hit. The total edit distance jumped by nearly an order of magnitude from Fall 2022 to Spring 2025, and the average edit distance tripled in the same window. What this signals is that students were pushing more content into their solutions, and AI-assisted edits were contributing to larger, longer changes between submissions.
Practical implication: if you’re evaluating AI-augmented work, don’t assume more edits equal better learning. The pattern here shows heavy iteration and expansion of code scope, which might reflect scaffolding, experimentation, or automated generation rather than deep, incremental mastery.
You can see these shifts reflected in the paper’s figures and data narrative. For a broad view, the authors connect these edits to a known tendency of LLMs to revise text in ways that extend parts of the content not specifically requested—which aligns with the observed longer code in submissions. For more context on how such behaviors emerge in language-model-assisted coding, check the original study: Changes in Coding Behavior and Performance Since the Introduction of LLMs.
The Edit Distance Phenomenon
Edit distance, here measured via the Myers algorithm, serves as a proxy for how much a submission changes from the previous one. A higher distance means bigger, more sweeping edits between attempts. The study’s headline on this front is striking: the rise in edit distance after AI tools became common suggests students are engaging with AI outputs that require substantial revisions, or even that they are copying and adapting AI-generated code rather than building solutions incrementally themselves.
But there’s nuance. While more code churn and larger edits might signal active problem solving (or aggressive iteration guided by AI output), the paper also shows that these edits don’t necessarily translate into better learning outcomes. In other words, bigger edits aren’t automatically better learning signals; they can reflect reliance on AI that doesn’t improve core understanding.
Again, the link to the full dataset and methodology is in the original paper: Changes in Coding Behavior and Performance Since the Introduction of LLMs.
Performance and Learning Outcomes in Flux
Behavioral shifts are only half the story. The study carefully tracks how those shifts align (or misalign) with actual learning outcomes, using multiple performance metrics.
Task Scores, IP, and TP: A Mixed Picture
Task Score: For the PageRank task, almost all students who attempted it achieved a perfect Task Score in every semester except Fall 2022 (f22). This suggests that the task, as designed, remained easy for the cohort across most periods, even as strategies for solving it evolved. In other words, the AI-era adaptation didn’t derail the ability to reach the target on this specific task.
IP Score (Individual Projects): The average IP Score shows no strong upward or downward trend over time, but there is a slight dip before and after f22. The median IP Score sits around a high level (roughly 60 out of 63, i.e., about 94%), indicating that many students continued to excel on the individual project portfolio regardless of the AI shift. The authors discuss a ceiling effect here: even as coding behavior changed, many students were already near the top of the scale.
TP Score (Team Project): This metric rose after f22. The median TP Score hovered around 200 out of 235 (about 85%) since s23, suggesting that team performance remained robust or even improved in the AI era. The authors hypothesize that AI-assisted coding may have helped competent students contribute more effectively within teams—almost like having an invisible extra teammate who helps with implementation and thought processes. Whether that “extra teammate” is truly beneficial or masks individual contributions is left for future work.
Key takeaway on performance: while the AI shift did not erase strong outcomes, it did alter the relationship between how students edited code and how well they performed, particularly at the individual level. The metrics show stability in some areas (Task Score), slight declines in others (IP Score), and gains in team outcomes (TP Score). This mixed picture invites educators to interpret performance in the broader context of AI-assisted workflows.
For readers who want to see the exact patterns, the study reports correlations and cross-epoch comparisons, including a negative association between IP Score and Average Edit Distance (roughly: each 60-line increase in average edits corresponded to about a 1% drop in IP Score). Interestingly, this correlation persists both before and after ChatGPT. The pre-ChatGPT relationship between higher edits and TP Score also existed, but it appeared to flatten after AI assistance became common, suggesting AI changes how team outcomes relate to individual editing behavior.
If you’d like to see the full numerical story, the original paper keeps a detailed map of these shifts and their statistical framing: Changes in Coding Behavior and Performance Since the Introduction of LLMs.
Subtle Shifts: Over-Reliance and Learning Gaps
Beyond raw scores, the study emphasizes subtler educational signals:
Average Submission Effect: This is the average cross-submission change in score, per submission, and it declined steadily since s24. In practice, students were making more edits but achieving diminishing incremental improvements between consecutive submissions. This pattern tracks with the idea of relying on AI to generate or refine code, rather than building improvements through learning and practice alone.
Negative Delta Submissions: The data show more submissions with negative score deltas in the post-AI era, reinforcing the notion that AI-assisted edits aren’t uniformly driving up scores. Some AI-generated edits may lower alignment with the task’s true intent if not carefully guided by the learner.
Cross-Relationship Dynamics: The presence of a potential “AI teammate” in team projects raises questions about how to disentangle individual learning from collaborative outcomes. The paper notes that pre-ChatGPT, higher edits predicted lower TP scores, but this association faded post-ChatGPT, hinting at how AI support can alter mentorship, collaboration, and accountability dynamics within teams.
These nuanced findings matter for anyone designing assessments or feedback loops in AI-rich classrooms. They imply that traditional single-metric targets (like raw task completion) may not capture the depth of a learner’s developing capabilities in an AI-assisted world.
For the deeper data story and the figures, the paper provides the full narrative and statistical details: Changes in Coding Behavior and Performance Since the Introduction of LLMs.
The Centaur Model: Humans + AI in Education
A central frame in the paper is the “centaur” concept—humans and intelligent machines collaborating to achieve better outcomes than either could alone. In education, this reframes what mastery looks like and how we measure it.
Implications for Assessment and Curriculum
Rethinking assessment: If AI tools are now co-authors of code, assessments should reward students for guiding, critiquing, and integrating AI outputs, not just for the ability to write syntactically perfect code from scratch. The study’s own discussion mirrors this: as AI handles more of planning, scaffolding, testing, and implementation, the human student’s role shifts toward defining abstractions, architectural decisions, and critical evaluation of automatically generated results.
Curriculum redesign: Courses might need explicit training on how to work with AI partners—how to prompt effectively, how to verify AI outputs, and how to design systems with AI as a collaborator rather than as a shortcut. The centaur framing suggests a future where the best performers combine domain knowledge with smart tool usage, leveraging AI to handle routine or exploratory coding while humans focus on higher-level design and troubleshooting.
Quality assurance and maintainability: If team outputs increasingly reflect AI contributions, organizations may see larger codebases with more boilerplate or repetitive editing. This has direct implications for QA pipelines, testing coverage, and maintainability standards. The paper’s broader workforce implications touch on longer-term costs and the need for rigorous testing and review processes.
Transparency and verification: The study acknowledges that it’s difficult to verify the extent of individual AI usage. It’s a reminder for educators to build rubrics and audits that can better distinguish human learning from AI-generated artifacts.
If you want to explore this centaur perspective in more actionable terms, the authors discuss it as a call to redefine excellence in programming in the AI era. The same ideas are echoed in the paper’s conclusion and reflection on future directions: Changes in Coding Behavior and Performance Since the Introduction of LLMs.
What to Teach in a World of AI Co-Authors
- Focus on higher-order skills: abstraction, system design, module integration, and critical evaluation of automated outputs.
- Teach AI literacy alongside coding: how to prompt, how to assess generated code, and how to identify when AI outputs don’t meet intents.
- Emphasize collaboration with tools: workflows, versioning strategies, and debugging practices that assume AI as a teammate.
Crucially, this is not about banning AI but about elevating the human role in human-AI collaboration. The “centaur” analogy reinforces that the strongest developers will be those who can guide, interpret, and synthesize machine output—skills that remain uniquely human even in the age of AI.
Practical Takeaways for Educators and Developers
Bringing these findings into classrooms, curricula, and workplaces can help you design better experiences and better assessments in AI-rich environments.
Real-World Scenarios Today
Course design: When incorporating AI coding tools, design assignments that explicitly require students to justify design decisions, compare multiple approaches, and critique AI-generated code. This reduces over-reliance and helps students demonstrate understanding beyond surface-level correctness.
Assessment strategies: Pair automated grading with reflective work, such as write-ups explaining why a particular AI-generated solution matches the problem’s intent, or a critique of an AI-generated alternative. This aligns with the centaur model’s emphasis on human-guided evaluation.
Team-based projects: Recognize AI-assisted contributions in team settings, but also implement peer reviews and individual reflections to surface how each member contributed to the final result. The study’s observation that TP scores rose post-f22 in the presence of AI assistance hints at the upside of AI-enabled collaboration when coupled with thoughtful assessment.
Instructor tooling: Use analytics dashboards that track not just final scores but also submission patterns, edit distances, and the trajectory of progress between submissions. Such metrics offer early signals about shifts in learning pathways.
For readers seeking a direct link to the academic groundwork behind these ideas, see the original paper: Changes in Coding Behavior and Performance Since the Introduction of LLMs.
Aligning Metrics with AI-Augmented Work
- Rethink “progress”: If students are making more edits and longer code changes, you should interpret progress through a lens that values strategic refinement, not just incremental numeric gains.
Quality vs quantity: The data suggest more content changes don’t always translate into better learning outcomes. Emphasize the quality of design decisions and the learner’s ability to justify their choices.
Calibrate expectations: Acknowledge that some performance metrics (like high Task Scores on a fixed task) may not fully capture learning depth in an AI-enabled landscape. Use a mix of tasks, portfolio work, and design critiques to paint a fuller picture.
Key Takeaways
- AI-enabled coding tools have a measurable impact on student behavior: more frequent submissions, larger edits, and longer code changes after AI tools become common.
- These behavioral shifts do not guarantee better learning outcomes; in some cases, there are signs of learning gaps or reduced incremental learning, even as performance on some metrics remains high.
- The relationship between individual and team performance evolves in the AI era. AI-assisted work can raise team scores while masking individual contributions, underscoring the need for transparent assessment practices.
- The centaur model—humans collaborating with AI—offers a strong framework for rethinking programming mastery. The goal shifts from “write code solo” to “design, guide, and critique AI-assisted code effectively.”
- For educators and developers, the key is to align curricula, assessments, and QA practices with AI-enabled workflows. This means teaching AI literacy, emphasizing higher-order design skills, and designing metrics that capture genuine understanding in an AI-augmented world.
If you’re curious to explore the original data and nuanced findings, the study remains a valuable resource for shaping how we teach, assess, and practice software engineering in an age of intelligent assistants: Changes in Coding Behavior and Performance Since the Introduction of LLMs.
Sources & Further Reading
- Original Research Paper: Changes in Coding Behavior and Performance Since the Introduction of LLMs
- Authors: Yufan Zhang, Jaromir Savelka, Seth Copen Goldstein, Michael Conway
Notes: The study analyzes 2,066 submissions from 721 enrollments (718 unique students) in a CMU graduate cloud computing course from Fall 2020 to Spring 2025, focusing on a fixed PageRank task to compare pre- and post-LLM eras. Key takeaways include a rise in total and average edit distances, more submissions, stable Task Scores, slight IP Score decreases, and a post-ChatGPT uplift in TP Scores, with nuanced correlations between editing behavior and learning outcomes. The authors discuss the centaur model as a guiding lens for future education and AI integration in software development.