What a Massive GitHub Scan Reveals About Security in AI-Generated Code

AI-generated code can accelerate development, but security is not uniform across tools or languages. This field guide summarizes a large GitHub analysis of 7,703 files from four AI tools, 4,241 CWE instances across 77 types, and language-specific patterns, with practical security takeaways for teams.

What a Massive GitHub Scan Reveals About Security in AI-Generated Code

AI is everywhere in software development these days—from copilots offering quick completions to general-purpose chatbots helping with snippets and explanations. But as teams lean more on AI-generated code, what does that mean for security? A recent large-scale study mined public GitHub repositories to answer this question, analyzing thousands of AI-generated code samples across four major tools. The verdict is nuanced: AI can accelerate development, but there are real, language‑ and tool‑dependent security patterns that teams need to know about.

In this post, I’ll break down the study’s approach, share the key findings in plain language, and translate them into practical takeaways you can apply in real-world software development and security practices. Think of this as a field guide for responsibly using AI-generated code.

Why this study matters

Generative AI can speed up coding, documentation, and even debugging. GitHub reportedly boosts developer productivity and confidence with tools like Copilot, while users turn to other AI services for quick code or notes. But as more code appears in production, we have to ask: are we packaging security risks along with convenience? This study answers with data at a scale not seen before in the literature: nearly 8,000 files initially flagged as AI-generated, narrowed down to a robust set of around 7,700 files analyzed for security vulnerabilities using a well-established static analysis tool.

Key idea: not every AI-generated line is a security flaw, but some vulnerability patterns emerge that you’d want to catch during code reviews and security scans.

How the study was done (in plain terms)

Here’s a high-level view of the approach, translated from the paper’s method section into everyday language:

  • Tools examined: ChatGPT, GitHub Copilot, Amazon CodeWhisperer, and Tabnine. These cover a mix of general-purpose LLMs and code-focused assistants.
  • Real-world data source: public GitHub repositories. Instead of running toy prompts in a lab, researchers looked at code that developers actually used—with explicit attributions to indicate which AI tool helped write it.
  • Attribution: To identify AI-generated content, they looked for specific indicators in code comments and the surrounding context that clearly linked a snippet to one of the four tools.
  • Filtering pipeline: from an initial pool of over 82,000 candidate files down to about 7,700 analyzable files. They removed duplicates, non-executable content, files in languages not supported by the analysis tool, and tiny snippets likely not meaningful for vulnerability analysis.
  • Language focus: they analyzed 10 languages but zeroed in on Python, JavaScript, and TypeScript because together they cover about three-quarters of the dataset and are highly relevant for security concerns in web and app development.
  • Static analysis tool: CodeQL. It’s a widely used, open-source analysis engine that maps code patterns to known weakness taxonomies (CWEs) and can connect those to vulnerability databases (like CVEs).
  • How vulnerability severity was ranked: CWEs found by CodeQL were linked to CVEs via the National Vulnerability Database (NVD). They used the CVSS v3.x scoring system to quantify severity.
  • What counts as a finding: the study distinguished between three severities—errors (critical), warnings (potential), and recommendations (quality/maintenance-oriented checks like unused code). They then looked at how many files contained vulnerability-related CWEs versus benign findings.

Big picture takeaway from the methodology: the study provides a consistent, instrumented view of vulnerabilities in AI-generated code as it appears in the wild, not just in controlled experiments.

What the study found in the real world

Here are the standout findings, translated into takeaways you can use when thinking about AI-assisted development.

  • Scale and attribution

    • Initial pool: 82,413 potential AI-generated files across four tools.
    • After filtering and quality checks: 7,696–7,117 analyzable files depending on the exact step.
    • The vast majority of AI-attributed samples came from ChatGPT (about 91% in the filtered dataset), with Copilot second (about 7–8%), and CodeWhisperer and Tabnine far smaller.
  • Language distribution and tool specializations

    • Python is the dominant language in the dataset overall (roughly 38%), with notable tool-specific patterns.
    • GitHub Copilot shows a strong footprint in C/C++, but when looking at Python, Copilot stood out for higher security density in Python and TypeScript.
    • ChatGPT’s Python output is substantial, but its JavaScript results show relatively stronger security performance in some contexts.
    • Amazon CodeWhisperer and Tabnine contribute far fewer samples, so conclusions about them are less statistically powerful, though they show interesting patterns (e.g., CodeWhisperer lacking CWE-flag vulnerabilities in this dataset).
  • Vulnerability presence: how much risk actually shows up

    • About 87.9% of AI-generated files did not contain CWE-mapped vulnerabilities. In other words, most AI-generated code in this sample was free of the specific kinds of vulnerabilities they looked for.
    • However, vulnerability patterns are not negligible. Across all analyzed files, 861 files contained at least one CWE-related vulnerability, totaling 4,241 CWE occurrences across 77 CWE types.
  • Tool and language interactions

    • The distribution of vulnerable files was highly skewed by language and tool. ChatGPT accounted for the overwhelming majority of vulnerable files and CWE counts, but Copilot contributed a non-trivial share as well. CodeWhisperer and Tabnine contributed very little to vulnerabilities in this dataset, though their small sample sizes call for cautious interpretation.
    • The language effect is strong: Python consistently showed higher vulnerability rates (roughly 16% to 18%), while JavaScript hovered around 8% to 9%, and TypeScript trailed at 2.5% to 7.1% depending on the tool.
  • Security density: how much code you get per vulnerability

    • A striking metric the study used is “LOC per CWE” (lines of code per vulnerability). Higher numbers mean you get more lines of code before encountering a vulnerability—better security density.
    • Copilot showed the best security density for Python (about 1,739 LOC per CWE) and for TypeScript (about 905 LOC per CWE).
    • For JavaScript, ChatGPT came out on top with about 932 LOC per CWE.
    • This pattern reinforces the idea that there isn’t a single “most secure” AI tool across all languages; its effectiveness depends on the language you’re using.
  • Which vulnerabilities appeared most

    • Across 861 vulnerable files, researchers identified 77 different CWE types.
    • Some vulnerabilities were very language- or tool-specific. For example:
      • In Python, CWE-772 (Missing Release of Resource after Effective Lifetime) was notable in ChatGPT-generated code.
      • In JavaScript, CWE-676 (Use of Potentially Dangerous Functions) appeared frequently in Copilot’s output.
      • In TypeScript, CWE-020 (Improper Input Validation) showed up in ChatGPT’s output.
    • Severity: they mapped CWEs to CVEs and calculated average CVSS scores. The five most critical CWEs (based on average scores) included SQL Injection (CWE-89), OS Command Injection (CWE-78), Code Injection (CWE-94), and hard-coded credentials (CWE-259/798). Four of these appear on MITRE’s 2024 Top 25 list of dangerous software weaknesses, underscoring that some classic, high-severity problems still show up in AI-generated code.
  • How often vulnerabilities appear relative to comments and attribution

    • On average, the first safety-relevant vulnerability tends to show up a bit over 120 lines after the attribution point (mean). The median distance is much shorter (43 lines), indicating a long tail where some vulnerable cases show up later in a file.
    • About half of all files with findings had no findings at all (roughly 53.6%), while roughly a quarter had only recommendations (quality/maintenance issues) and not security-vulnerability findings.
  • Documentation and other AI-assisted tasks

    • An important and under-explored part of the study is the use of AI tools for documentation. About 39% of collected files were documentation-related formats (Markdown, TeX, etc.), with many file names containing readme-like artifacts.
    • This hints at AI’s broader role in development workflows: generating docs can be faster, but it also introduces maintainability considerations and potential security implications if documentation leaks sensitive details or misdescribes secure practices.

What these findings mean for developers and teams

Here are the practical implications you can translate into day-to-day practices.

  • Security patterns depend strongly on language, not just tool

    • If your team relies on Python for critical components, be extra vigilant. The study found higher vulnerability rates in Python across tools, so integrating stronger static checks, input validation, and resource management reviews is wise.
    • For JavaScript and TypeScript, the patterns differ. Certain dangerous-function patterns or input-validation gaps showed up with specific tools. Tailor your review checklist by language, not just by tool.
  • Tool choice should be context-aware

    • GitHub Copilot sometimes offers better security density for Python and TS; ChatGPT can outperform in JavaScript in some respects. The takeaway isn’t “one tool is best,” but rather: pick tools strategically for the task and language at hand, and don’t rely on a single tool to solve security.
    • Don’t assume that higher adoption or more usage equals safer code. The data show varied patterns across tools, and tool-specific vulnerability profiles can emerge.
  • Don’t skip documentation generation

    • The high percentage of AI-generated docs means you need to review generated content too. Docs can either improve maintainability or, if they propagate insecure assumptions or outdated guidance, become a risk in themselves.
    • Treat AI-generated documentation as part of your code review, not as a separate afterthought.
  • Attribution matters, but it’s only part of the story

    • The study’s attribution-based approach captures explicit AI-generated code. In practice, many developers might use AI assistance without explicit attributions. That means real-world risk could be larger than the study suggests in some contexts.
  • Classic security weaknesses still show up

    • Several high-severity CWEs common in traditional software security (like SQL injection, OS command injection, and code injection) appear in AI-generated code. Teams should maintain familiar defenses: input validation, proper parameterized queries, safe execution practices, and escaping/whitelisting where appropriate.
  • Language-specific security practices can guide reviews

    • If you’re reviewing AI-generated code in Python, prioritize checks around resource handling and release patterns.
    • For JavaScript, watch for potentially dangerous function calls and unsafe eval-like patterns.
    • For TypeScript, emphasize input validation and ensuring type-safe handling of external data.
    • These targeted checks can be integrated into pull request templates and automated reviews.
  • Security density isn’t universal; plan a mixed-tool approach

    • Given that no single tool dominates security across all languages, a mixed approach—combining multiple AI tools with human code review and automated security checks—can yield better coverage.
  • Governance and maintenance matter

    • The study found a lot of files with recommendations (not direct vulnerabilities) and a sizable chunk with commented code or unused bits. This is a reminder that AI-generated code adds to technical debt if left unchecked. Cleanups, refactoring, and consistent review policies help keep AI-generated content sustainable and secure over time.

Real-world prompts and workflows you can adopt

Based on the study’s insights, here are practical workflow tweaks and prompt strategies to improve security when using AI-generated code.

  • Prompt for explicit security requirements

    • Include requests like: “Provide only safe, standard library-based solutions; avoid hard-coded credentials; use parameterized queries; validate all external inputs; avoid dangerous functions.”
    • Example prompt: “Generate Python code to read user input safely, use prepared statements for database access, and include comments explaining the security choices.”
  • Separate code generation from documentation

    • Use one prompt to generate code and a separate prompt to generate docs. Then review both together. The documentation may reveal assumptions or insecure patterns that aren’t immediately visible in code alone.
  • Request security-focused reviews within prompts

    • Ask the AI to annotate potential security hotspots and explain decisions. For example: “Mark lines that could suffer from resource leaks (like CWE-772 in Python) or input validation gaps (CWE-20/TypeScript equivalents).”
  • Enforce explicit attributions and traceability

    • If attribution matters in your governance, configure your workflow to require explicit AI-tool comments for any generated snippet included in a PR. This makes it easier for security engineers to target AI-assisted changes during reviews.
  • Integrate multi-tool checks into CI

    • Combine static analysis (CodeQL or equivalent) with language-specific linters and security scanners in CI to catch vulnerability patterns that might slip through human review.
  • Treat AI-generated maintenance separately

    • Because the study shows a substantial fraction of findings are maintenance-oriented (unused code, comments, etc.), apply separate cleanup passes to AI-generated content focused on maintainability and clarity, not just security.

Limitations to keep in mind

No study is perfect, and this one has important caveats:

  • Attribution bias

    • The dataset relied on explicit attribution comments. If developers don’t document AI usage, the study underestimates AI-generated code. The real-world risk could be higher in practice.
  • Sample size for some tools

    • ChatGPT and Copilot dominated the data, while CodeWhisperer and Tabnine had relatively small samples. Conclusions about the latter should be interpreted with caution until more data accumulate.
  • Static analysis limits

    • CodeQL looks for known vulnerability patterns, but it can’t catch runtime issues, logical flaws, or vulnerabilities that only appear when code runs in production.
  • Language and tooling scope

    • The study focused on 10 languages, with a particular emphasis on Python, JavaScript, and TypeScript. Other languages and ecosystems may exhibit different patterns.
  • Snapshot in time

    • The data come from a February 2024 snapshot. As AI models evolve quickly, newer tools and updated models could shift vulnerability patterns.

Putting it all together: taking a pragmatic, security-aware approach

  • AI will speed up coding and documentation, but consistently applying security checks matters more than ever.
  • If you’re using AI-generated code, treat it like any other code you add to your project: subject it to the same rigorous review, testing, and security checks.
  • Use language-tailored reviews and keep in mind that vulnerability patterns can differ by language and tool. Don’t rely on a single AI tool for all tasks.
  • Don’t overlook the “docs” side of AI-assisted work. Clear, accurate documentation is a security and maintainability asset—review AI-generated docs as part of your standard code QA.
  • Build governance around attribution and traceability so that security teams can quickly locate and review AI-assisted changes.

Key Takeaways

  • AI-generated code is not uniformly risky, but language- and tool-specific patterns matter. Python showed higher vulnerability rates across tools, while JavaScript and TypeScript exhibited different patterns, underscoring the need for language-specific security focus.
  • A large share of AI-generated code (about 87.9%) did not map to CWE vulnerabilities in this study, but a non-trivial portion did, across 77 CWE types. The biggest concrete threats align with classic, high-severity issues (SQL injection, OS command injection, code injection, hard-coded credentials).
  • Tool performance isn’t one-size-fits-all. GitHub Copilot demonstrated strong security density for Python and TypeScript, while ChatGPT showed strengths for JavaScript in this dataset. The best choice depends on language and task, not just popularity or assumed security.
  • The study highlights an important byproduct of AI-assisted development: documentation generation is widespread (about 39% of files). This can affect maintainability and security if docs misstate practices or reveal sensitive details.
  • The findings point to practical actions: adopt language-aware security reviews, blend multiple AI tools strategically, ensure explicit attribution where possible, and bake security into AI-assisted CI/CD pipelines. Also, treat AI-generated code maintenance as a separate, ongoing priority.

If you’re curious about prompting techniques to reduce risk, consider prompts that explicitly ask for safe patterns, validation, and code hygiene, and always couple AI-generated outputs with human reviews and automated security checks.

And as AI tools continue to evolve, ongoing, large-scale analyses like this one will be essential to keep our software secure while we lean into the productivity gains AI can offer.

Key Takeaways (short version)
- Security patterns in AI-generated code vary by language and tool; Python is a hotspot, others show language-specific quirks.
- No single AI tool is universally safer; a strategic mix tuned to language/tasks works best.
- Most AI-generated code in this study was free of CWE vulnerabilities, but a meaningful minority carried serious risks.
- Documentation generated by AI is widespread and should be reviewed for maintainability and security implications.
- Practical steps: language-aware reviews, secure-by-default prompts, attribution, and CI-based security checks to harness AI safely.

Frequently Asked Questions