LLMs in Smart Contracts: Security Gaps in AI-Generated Code

Smart contracts powered by AI are accelerating development, but the new study reveals that LLMs can introduce security gaps in Solidity code used by DeFi, governance tokens, and more. This post summarizes how researchers measured risk, what vulnerabilities stood out, and practical safeguards for devs
1st MONTH FREE Basic or Pro • code FREE
Claim Offer

LLMs in Smart Contracts: Security Gaps in AI-Generated Code

Table of Contents
- Introduction
- Why This Matters
- Main Content Sections
- The AI-Driven Contracting Era: Promise and Peril
- What the Study Did: How They Measured Security
- Unpacking the Vulnerability Landscape
- Practical Implications and Safeguards
- Key Takeaways
- Sources & Further Reading

Introduction
If you’ve been watching software development evolve, you’ve probably noticed a recurring theme: AI is not just helping write code anymore—it’s increasingly drafting the very contracts that run on blockchains. The main topic here is big: large language models (LLMs) generating smart contracts. But as the paper “Evaluating the Vulnerability Landscape of LLM-Generated Smart Contracts” shows, syntactic correctness and plausible logic aren’t enough to trust AI-authored blockchain code for production. This is new research, and it digs into how vulnerable AI-generated Solidity can be, across DeFi, governance, tokens, and more. For context, this work explicitly analyzes LLMs like ChatGPT, Gemini, and Sonnet, and evaluates their output with established security tooling. If you want to read the original research, you can check the paper here: Evaluating the Vulnerability Landscape of LLM-Generated Smart Contracts.

The paper’s take is straightforward but important: smart contracts are hard to fix once deployed (they’re immutable and self-executing on a public chain), so mistakes in AI-generated code can have outsized, irreversible consequences. The authors set up a careful, reproducible evaluation pipeline to see how often AI-generated contracts harbor vulnerabilities, what kinds they are, and how risky they become as contracts grow larger or belong to different application domains. The results are sobering and instructive for developers, auditors, and policy-makers alike.

Why This Matters
Here’s the practical lens through which to view this research.

  • Why now, in 2024-2025? AI-assisted coding is mainstream in practice, and smart contracts are a high-stakes domain where errors aren’t just about bugs—they’re about security, theft, and loss of user trust. The study pushes us to ask: should AI-generated contracts be deployed directly, or should they always undergo rigorous review? The implications are immediate for anyone building DeFi protocols, NFT marketplaces, or governance tools that rely on Solidity and the Ethereum Virtual Machine.

  • A real-world scenario you might recognize: a startup uses an AI assistant to draft an ERC-20 token contract and a governance mechanism in one sitting. The prompt is broad—“create a simple token with transfer, approve, delegated voting”—and the generated contract compiles and passes basic tests. But once it hits production, subtle flaws—like flawed access control, mispriced fees, or unsafe external calls—can be exploited. The research shows that such risks aren’t hypothetical; they’re demonstrated across three major LLMs in a controlled study.

  • How this builds on prior AI and security work: previous work shows humans and AI can introduce classical Solidity vulnerabilities (reentrancy, overflow, improper initialization), but this paper is among the first to quantify vulnerability prevalence in AI-generated contracts across domains, analyze how contract size correlates with vulnerability, and classify vulnerability types in an LLM-generated supply chain. It complements existing frameworks like LLM-SmartAudit and broader smart contract security literature by adding a production-oriented, AI-assisted perspective.

Main Content Sections

The AI-Driven Contracting Era: Promise and Peril

Smart contracts are the backbone of decentralized apps: self-executing code on a immutable ledger. Ethereum’s rise popularized this model, but with immutability comes risk: a single bug can cause permanent loss. The authors remind us that the deployment phase is irreversible, so any vulnerability in an AI-generated contract becomes a systemic risk.

AI brings a few notable dynamics to the table:
- Accessibility and speed: LLMs reduce the technical ceiling, letting non-experts draft contract functionality rapidly.
- Style and structure: AI tends to adhere to standards (ERC-20, ERC-721) and often yields well-structured, readable code.
- Hallucinations and mismatches: AI can invent non-existent functions or misapply primitives, potentially creating logic gaps or security blind spots.

The study treats LLMs as black-box code generators—ChatGPT-esque tools whose outputs are then audited with standard static analysis. The central question is simple but critical: are AI-generated contracts ready for production, or do they need a security-first workflow before deployment? The authors answer with a data-driven verdict: no, not on trust-alone. AI-generated contracts frequently harbor vulnerabilities that require human-led security reviews.

What the Study Did: How They Measured Security

To transform the question into measurable evidence, the authors designed a rigorous evaluation pipeline and gathered a sizable dataset of Solidity outputs for security testing.

  • Data and generation: They started by collecting 50 representative contracts from public sources (GitHub, Etherscan) and used them to define functional prompts. They turned 52 functional features into prompts, then mutated those prompts with GPT-5-mini to create 34 distinct prompt variants. Using three commercial LLM-based coding agents—GPT-4.1, Gemini-2.5, and Sonnet-4—the team generated a total of 1,033 smart contracts, across 10 iterations per model.

  • Analysis: Each contract was run through Slither, a popular Solidity static analysis tool that maps findings to a taxonomy of vulnerabilities and severity levels (low, medium, high). The researchers mapped Slither detections to concrete vulnerability categories and built a structured dataset containing model identity, iteration, vulnerability type, and severity.

  • Costs and reproducibility: The study emphasizes cost-conscious, reproducible experimentation. They used available datasets (SmartBugs, OWASP, SWC) and built an analysis pipeline with standard tools, documenting the prompts, model IDs, and vulnerability outputs. This makes the study a useful baseline for future researchers who want to measure AI-assisted smart contract security at scale.

  • Key metrics: The authors report vulnerability spawn rates (the share of contracts deemed vulnerable), the distribution of vulnerabilities by type and severity, and how these metrics correlate with lines of code (LoC). A notable methodological point is their decision to treat only actual vulnerabilities (low/medium/high) as exceptions to their “non-vulnerability” filter, excluding purely informational or optimization warnings from counting as vulnerabilities.

  • Core findings (highlights you’ll likely remember):

    • Vulnerability prevalence across models: GPT-4.1 had the lowest share of vulnerable contracts at 47.4%; Gemini-2.5 followed at 53.2%; Sonnet-4 had the highest, with more than 75% of its contracts flagged as vulnerable.
    • Contract size matters: about half of generated contracts contained 75 LoC or fewer. There’s a positive relationship between contract size and vulnerability count: every 100 LoC increases the vulnerability count by about 15% (a Poisson GLM with a log link supported by the study’s results).
    • Domain variation: DeFi contracts showed the highest vulnerability counts, while Government & Voting contracts tended to have fewer, though this difference is influenced by how many prompts were aimed at each domain.
    • Vulnerability types and severity: Sonnet-4.5 tended to produce more “Others,” Reentrancy, and Logical Error vulnerabilities; GPT-4.5 and Gemini-2.5 yielded more “Lack of input validation” and DoS-related issues. Unchecked external calls appeared in Gemini-2.5 outputs, while access control issues were broadly common across models.
    • Severity distribution: Low-severity vulnerabilities dominated across all models, but the standout was Gemini-2.5 reporting 41 high-severity vulnerabilities—the largest single-model high-severity count. Sonnet-4.5 carried more low and medium vulnerabilities but had the fewest high-severity cases among the three.

For context, the authors also connect these findings to broader risk: once AI-generated code enters production, attackers can study it, replicate vulnerabilities, and exploit them at scale. The vulnerability signal isn’t just academic; it reflects real attack surfaces in a world where smart contracts govern billions in value.

The Vulnerability Landscape: Patterns You’ll Want to Know

  • Size and risk: Larger contracts aren’t automatically unsafe, but longer codebases give more room for mistakes. The 15% per 100 LoC figure is a concrete, testable trend that developers can use to calibrate risk when deciding how much auditing to allocate.

  • Domain-driven risk: DeFi apps, with complex financial logic and automated market mechanisms, appear more prone to vulnerability clusters than governance or identity-oriented contracts. This aligns with intuition: DeFi contains more edge cases, price formulas, and tokenomics that can be misimplemented or exploited.

  • Model-specific tendencies: The dramatic gap between Sonnet-4 and the other two models isn’t just a quirk of a single dataset. It suggests that depending on the model, the same high-level prompt can yield very different security quality. The upshot: when you’re deploying AI-assisted development, you should not treat all AI tools as interchangeable; their security footprints differ meaningfully.

  • Vulnerability taxonomy implications: The fact that a single model (Gemini-2.5) produced a relatively high number of high-severity issues signals that some AI generators may be emitting code that looks good on the surface but harbors critical flaws under the hood. The presence of “unsafe” patterns like unchecked external calls in Gemini-2.5 highlights why auditors should not rely on AI outputs alone.

  • The limits of AI self-auditing: The paper emphasizes that even with AI-assisted generation, you cannot rely on the model to serve as a complete security analyst. The authors explicitly caution that LLM-based review can miss vulnerabilities or introduce new ones, and it does not replace rigorous, independent auditing.

To connect with the original study: for a deeper dive into the methodology, datasets, and the full spectrum of results, see the original paper: Evaluating the Vulnerability Landscape of LLM-Generated Smart Contracts.

Unpacking the Vulnerability Landscape: Why It Randomizes the Risk

To translate numbers into intuition, think of a contract as a machine with many tiny levers. AI-generated contracts often come with clean, readable wiring and standard patterns, which is a definite plus. But the wiring can still be wrong in subtle ways: reentrancy guards misapplied, funds routed through unsafe paths, or input validation gaps that let attackers craft bad inputs. The study demonstrates that in practice, those issues appear across all three major AI code generators, with some models more prone to certain classes of vulnerability than others.

The authors also describe an adversarial lens: once a contract is public, attackers can study it, test it, and point the model back at it to identify and exploit weaknesses. In other words, the same AI tooling that helps you write code can empower attackers to find and weaponize flaws, especially when the contract’s design involves complex tokenomics or governance rules.

Practical Implications and Safeguards

So what should developers, auditors, and platform teams take away from this?

  • Treat AI-generated contracts as untrusted code: even when the output compiles and passes basic checks, don’t deploy without formal security workflows. The paper’s core takeaway is a stark warning: syntactic correctness is not security.

  • Integrate security early and often: pair AI-assisted generation with layered safeguards—static analysis (as used in the study with Slither), dynamic analysis, fuzzing, formal verification for critical paths, and targeted security reviews by human experts. In production environments, you’ll want a pipeline that might include automated testing, formal methods, and independent audits rather than a single pass by an AI code generator.

  • Use domain-aware prompts and guardrails: the study highlights that prompt quality and domain focus matter. If your use case is DeFi, you’ll want prompts that steer the AI toward safer, audited patterns and explicit constraints around price, liquidity, and failure states.

  • Consider multi-model cross-checking: since model-specific vulnerabilities differ, a practical safeguard is to generate multiple candidate contracts with different models and compare them, then run cross-auditing checks. If outputs diverge in critical areas (e.g., access control, external calls), it’s a red flag that warrants deeper review.

  • Augment with governance and testing: the authors point to the broader literature on AI-assisted auditing, including frameworks like LLM-SmartAudit. Combining AI-driven drafting with multi-agent, collaborative analysis and automated testing can reduce the chance of missed issues.

  • Realistic threat modeling for teams: the threat model in the paper assumes a novice developer using an AI assistant. In real ecosystems, you’ll often have multi-agent setups with elevated privileges or complex workflows. The security implications scale with deployment context, so tailor your auditing rigor to the risk profile of the project.

  • Stay anchored to standards and continuous improvement: the study aligns with established vulnerability taxonomies (e.g., OWASP, SWC). Following up-to-date standards and integrating them into your CI/CD guardrails helps keep AI-generated code aligned with current best practices.

Linking back to the broader research landscape: this work complements earlier security studies of smart contracts (e.g., broad vulnerability taxonomies and static analysis pipelines) while injecting a critical AI-assisted perspective. It also points to actionable directions for the development of safer AI-driven coding tools, such as integrating formal verification early in the generation workflow and building better prompt libraries that explicitly constrain risky patterns.

Key Takeaways
- AI-generated smart contracts are not plug-and-play safe. Across GPT-4.1, Gemini-2.5, and Sonnet, vulnerability presence ranged from about 47% to over 75% of outputs, underscoring the need for security-first workflows.

  • Contract size matters: larger contracts tend to harbor more vulnerabilities, with the study estimating roughly a 15% increase in vulnerability count per 100 additional lines of code.

  • Domain and model interact to shape risk: DeFi contracts showed higher vulnerability counts; some models (notably Gemini-2.5) included high-severity issues like unchecked external calls, while others showed different vulnerability profiles.

  • The AI-aided code path is not a substitute for audits: LLMs can help speed up development, but they cannot reliably substitute formal verification, professional audits, and rigorous testing.

  • The path forward is hybrid: security-aware AI pipelines that couple generation with automated testing, formal verification, and human review are the practical route to safer AI-assisted smart contract development.

If you’re a developer or security engineer, the takeaway is clear: embrace AI for productivity, but implement a security-first workflow that treats AI-generated contracts as draft code needing expert review before production deployment.

Sources & Further Reading
- Original Research Paper: Evaluating the Vulnerability Landscape of LLM-Generated Smart Contracts
- Authors:
- Hoang Long Do
- Nasrin Sohrabi
- Muneeb Ul Hassan

Additional context and related work referenced in the article (for deeper exploration):
- General smart contract security: diverse sources discussing vulnerability classes and auditing approaches.
- Static analysis for Solidity: Slither and related detection pipelines.
- LLM-assisted programming and education: literature on how LLMs shape coding education and tooling.
- LLM-SmartAudit and related multi-agent vulnerability detection frameworks: approaches that use AI to augment vulnerability detection.

If you’d like, I can tailor a version of this post for a specific audience (developers, founders, security practitioners) or convert it into a slide deck or newsletter format.

Frequently Asked Questions

Limited Time Offer

Unlock the full power of AI.

Ship better work in less time. No limits, no ads, no roadblocks.

1ST MONTH FREE Basic or Pro Plan
Code: FREE
Full AI Labs access
Unlimited Prompt Builder*
500+ Writing Assistant uses
Unlimited Humanizer
Unlimited private folders
Priority support & early releases
Cancel anytime 10,000+ members
*Fair usage applies on unlimited features to prevent abuse.