What is the WildCode study about?

WildCode is an empirical analysis of real-life code generated by ChatGPT, focusing on both correctness and security, and examining how users interact with prompts in practice.

What security gaps were found in AI-generated code?

The study identifies recurring vulnerabilities in AI-generated code, including unsafe patterns, insufficient input validation, and insecure handling of data, especially when users prioritize getting working code over security.

Why do developers prioritize workability over security in prompts?

The research shows a tendency to value speed and functionality in the moment, often due to tight deadlines or lack of awareness about secure coding practices, leading to safer prompts being asked less frequently.

What practical steps can teams take to code more securely with AI?

Adopt secure-by-default prompts, verify AI-produced code with static/dynamic analysis, implement code reviews focusing on security, and educate developers on common AI-related pitfalls and secure patterns.

How can this study influence AI-assisted development going forward?

By highlighting real-world gaps between function and security, teams can integrate better prompt design, rigorous verification, and regular security checks into AI-assisted workflows to reduce risk in production code.

Code That Speaks Back: Real-World Prompts Reveal Security Gaps in ChatGPT-Generated Code

In the fast-moving world of AI-assisted development, many of us have leaned on chatbots to draft code, debug snippets, or brainstorm solutions. But how reliable is that code when it actually runs in the real world? A recent, large-scale study dives into this question by analyzing authentic, real-world conversations with ChatGPT and the code it produced. The takeaway is eye-opening: while AI-generated code is convenient, it often carries security flaws, and users frequently focus on getting something that works rather than ensuring it’s secure. Here’s what the study found, why it matters, and what you can do to code more safely in practice.

Introduction: Why a Real-World Look Matters

Over the past few years, large language models (LLMs) have transformed coding from a lonely, line-by-line grind into a collaborative, dialog-driven process. Surveys show many developers now rely on code generated by AI helpers, sometimes to the point of reshaping hiring and workflow norms. Yet there’s a disconnect: most early studies on AI-produced code used synthetic prompts or toy examples, which may not capture how people actually interact with these tools in the wild.

This study shifts the lens to real life. It mines WildChat, a public dataset with more than a million conversations with ChatGPT, and pulls out the conversations where code appears. Unlike synthetic prompts, these are actual user–model exchanges, including how people phrase requests, follow up on responses, and react when the code is buggy or insecure. The researchers also provide a curated, annotated collection of conversations and code snippets—plus the rules they used to extract patterns—so others can reproduce and extend the work.

In short, it’s a grounded, big-picture look at what happens when people use ChatGPT to generate code in the real world: what kinds of code get produced, what security problems show up, and how users talk about it.

What the Study Analyzes (In Plain Terms)

The team built a dataset called WildCode_EN by focusing on English-language conversations where code is involved. They pulled:

A huge variety of programming languages, but with Python by a wide margin (it turns out to be the default language many prompts end up with).
Snippets that ChatGPT produced in conversations, some of which were syntactically valid code, some of which contained errors.
Security-related patterns by running automated checks across languages, using a large set of rules to flag issues like unsafe hash use, SQL injection risk, weak random numbers, and unsafe deserialization.

They didn’t just look at whether the code ran; they looked at what kinds of mistakes and vulnerabilities appear in real prompts, and whether users even ask about security. They also checked for a phenomenon known as “hallucinated” modules—situations where the model suggests third-party libraries or packages that don’t actually exist, which could mislead a developer into pulling in unsafe or non-existent dependencies.

Key pieces of the study include:

The scale: 82,843 code-containing conversations from WildChat across multiple model versions, with a focus on the six most common languages in the dataset (Python, JavaScript, C/C++, Java, PHP, C#) and additional languages where relevant.
Language dynamics: Python is the most common output, the longest code blocks are often in C/C++, and many languages show distinctive patterns in how users annotate and reuse code.
Security tools: OpenGrep and other rule sets to assess vulnerabilities across hash functions, SQL, RNG, deserialization, memory-safety pitfalls, and more. They also examined regex patterns for ReDoS vulnerabilities with multiple detection tools.
Hallucinations: Analysis of module names and dependencies to identify “package hallucinations,” i.e., references to non-existent packages that the model includes in its code.

In other words, it’s a real-world audit of how people prompt, how the model responds, and how secure that output tends to be in practical coding tasks.

Main Findings: What Real-World Prompts Reveal

Below are the big themes, translated into everyday takeaways.

1) Language and length patterns: Python as the default, and short-but-varied code

Python dominates the code produced by ChatGPT in natural conversations. The study found tens of thousands of Python snippets across thousands of conversations, making Python the go-to language for many initial tasks.
When other languages appear, the corresponding code tends to be longer on average (e.g., C/C++ snippets are the longest on average), but overall, most generated programs are relatively short, with multiple snippets per conversation indicating iterative refinement rather than one-off bursts of code.
The model often starts with Python if the user doesn’t specify a language—an implicit bias or practical default that shapes how users approach coding sessions. When users subsequently request code in other languages, they do so in follow-up queries, reflecting a back-and-forth that can drift across languages.

Analogy: It’s like starting a car with a familiar default setting (Python) and then adjusting the ride by asking the system to switch gears (languages) as your task evolves.

2) Security in practice: Things are riskier than many users think

Hashing and cryptography: Among 264 sessions that included hash-related code, more than 20% triggered insecure patterns. The main culprits were continuing to use older, weaker algorithms like MD5 or SHA-1 or cryptographic choices that lack authentication guarantees.
SQL injection: For nearly 1,000 conversations containing SQL-like code, about 4% showed vulnerable patterns. Common issues included raw query execution or string concatenation that is susceptible to injection attacks.
Random number generation: In about 0.5% of the samples, the code used weak RNGs for security-sensitive tasks (like generating tokens or nonces). While this sounds small, any weakness in randomness can undermine security in practice.
Deserialization: Java code showed a notable deserialization risk, with several cases that appeared vulnerable due to missing checks when handling serialized data.
Memory-safety issues in C/C++: A big concern surfaced around classic danger zones—functions like memcpy or strcat used in ways that invite buffer overflows. About 14.85% of C/C++ code snippets sampled contained at least one such issue, and some files carried dozens of violations.
ReDoS (regular expression denial of service): Roughly a third of the regexes found in the code were flagged as potentially vulnerable to ReDoS, depending on the detection tool. While regex analysis can overcount (and depends on input context), the result signals a real worry: pattern choices can become a gateway for DoS attacks if misused.
Package hallucinations: Python and JavaScript commonly showed hallucinated module names. In Python, about 14.4% of modules were hallucinated across a large sample, with most appearing only once. JavaScript hallucinations were rarer (about 3.5%). These fake modules can derail a project, leading to broken builds or, worse, insecure dependencies if the names are substituted without verification.

Analogy: It’s like following a recipe that looks authentic but uses ingredients that don’t exist or don’t hold up in the kitchen. The result may “mostly” taste okay, but you’re at risk of a messy, unsafe dish.

3) Hallucinations and the dependency trap

The study’s pipeline checks Python and JavaScript imports against official package repositories (PyPI and NPM). Even with a verification step, a surprising portion of the code still includes fake modules. This isn’t just a nuisance—it can derail projects if developers rely on non-existent libraries or miss safer, real alternatives.
The takeaway is simple: even when the code “compiles” or runs, the underlying dependencies might be dubious. That means an extra round of manual checks is essential before you integrate any AI-generated script into a real project.

4) User intent: Bug fixing and code generation dominate, security awareness is thin

The researchers categorized user intents and followed how those intents evolved over multi-turn sessions. Bug fixing and code generation were the most common initial intents, and they continued to dominate follow-ups.
Secure coding—explicit requests to add security checks or design for security—was surprisingly rare in both initial and follow-up queries.
A chi-square test suggested no strong statistical link between the initial intent and the follow-up categories, meaning that, regardless of what users started asking for, many sessions drift toward refining or expanding code rather than focusing on security.

Analogy: Even when you start by asking for a quick bug fix, the conversation often morphs into “just add a little more code here” rather than “let’s harden this for attackers.”

5) Conversation dynamics: Longer chats often mean more complex security considerations

Conversations that began with security-focused goals tended to become longer, suggesting that users who start by worrying about security tend to dig deeper. Yet, even in longer chats, explicit secure-coding topics did not dominate the entire session.
Follow-up questions about explanations or debugging tended to extend dialogues, indicating that multi-turn collaborations tend to grow in complexity—sometimes at odds with a secure-by-default mindset.

6) What the study says about reliability and validity

The security analysis used lightweight, scalable pattern-matching approaches. While practical for big data, these methods can overcount or miss nuanced, context-sensitive flaws. The authors caution that their vulnerability counts are likely lower bounds and may not capture every subtle issue.
Language labeling relied on automated classification for non-annotated snippets, with some margin of error (a few percent). The researchers took extra steps to validate language labeling for the six languages where syntax checks were applied, but there’s always some residual risk in automatic labeling.

In short: the results are robust enough to reveal meaningful patterns, but not a precise census of every possible vulnerability.

Real-World Implications: What This Means for Developers and Teams

Security by default is not guaranteed in AI-generated code. Even when a code snippet appears correct, it may hide insecure patterns, weak randomness, unsafe deserialization, or unsafe dependencies.
Relying on AI-generated code without due diligence is risk-prone. The presence of hallucinated modules is a concrete hazard that can waste time, break builds, or introduce hard-to-track vulnerabilities.
User education matters. The data show a disconnect between the tasks people want help with and their attention to security concerns. If you’re using a code assistant, you may need a personal or team discipline to explicitly check security, not just functionality.
Tooling should evolve. Static analysis, dependency auditing, and security-aware prompts should be integrated into AI-assisted coding workflows to catch issues early in the iteration cycle.

Practical implications for teams:
- Treat AI-generated code as a draft, not a finished product. Pass it through your standard checks: syntax, type safety, linters, tests, and security scanners.
- Add a dependency vetting step. If a model suggests a library, verify its existence, trustworthiness, and security track record before installation.
- Build a security prompt habit. When you ask for code, explicitly request secure-by-default patterns, input validation, safe APIs, and minimization of unsafe functions.
- Use multi-tool pipelines. Don’t rely on a single source of truth; combine code generation with automated security checks and human review, especially for critical or public-facing software.

Practical Tips: How to Prompt and Review for Safer Code

Be explicit about security: Start prompts with a security requirement, e.g., “Provide a Python function that validates user input, uses parameterized SQL queries, and avoids risky string concatenation.”
Request defenses, not just features: Ask for input validation, error handling, and defensive programming practices as a default.
Verify dependencies: If the code imports libraries, check that those libraries exist, are maintained, and come from trusted repositories. Look for “hallucinated” package names and verify them before installation.
Use static analysis early: Run linting, syntax checks, and security scanners on every snippet before integrating. If a snippet lacks a language tag, use your own ID checks or request the model to clarify the language.
Keep it iterative but cautious: Expect multiple snippets per task; review each one for security implications before stacking them into a larger solution.
Consider security-first prompts for education: If you’re learning, practice by asking for “secure coding practices” or “how to defend against common web vulnerabilities” as part of the prompt.

Real-World Applications: Where This Matters

Startups and startups-in-disguise that rely on AI for rapid prototyping can benefit from embedding security checks into their coding workflow—reducing the risk of early-stage vulnerabilities slipping into production.
Education and training programs can use these findings to teach developers to pair AI-generated code with systematic security reviews, making students aware that “works on my machine” is not the same as “secure and robust.”
Platform designers and tool integrators can use these insights to design prompts, checklists, and automated pipelines that nudge users toward safer code generation, including warning flags for potentially hallucinated dependencies.

Key Takeaways

Real-world AI-generated code is convenient but not automatically secure. Across a large, authentic dataset, vulnerabilities appeared in multiple categories (hashes, SQL, RNG, deserialization, memory safety) with notable frequency in some languages.
Hashing and cryptography issues are surprisingly common. About 20% of hash-related code triggered insecure patterns, with many using weak algorithms or lacking authentication means.
SQL and deserialization pose tangible risks. SQL-related snippets showed vulnerabilities around raw queries and unsafe patterns, while Java deserialization often lacked essential security checks.
Memory-safety concerns in C/C++ are prevalent. A substantial portion of C/C++ snippets contained risky memory operations, underscoring the limits of AI in generating memory-safe code for these languages.
Regex can be a quiet vulnerability vector (ReDoS). Roughly a third of regexes could be susceptible to ReDoS, depending on the tool, signaling a real but context-sensitive threat.
Package hallucinations are a practical hazard. A non-trivial share of Python modules called out by the model do not exist in official repositories, creating a trap for developers who assume everything suggested by the model is trustworthy.
Users often focus on getting code that runs rather than code that is secure. Bug fixing and code generation dominate initial and follow-up intents, while explicit Secure Coding questions are relatively rare, even as code quality issues surface.
Default behavior and language drift matter. When users don’t specify a language, the model leans heavily toward Python, shaping the downstream development experience and the kinds of security concerns that emerge.
A robust workflow helps. Integrating static analysis, dependency verification, and security prompts into AI-assisted coding can mitigate many of the risks shown in the study.

If you’re building or using AI-assisted coding tools, these takeaways are a reminder: treat generated code as a draft, verify dependencies, and actively prompt for security considerations. The Prompts We Write Today shape the Security We Get Tomorrow.

Key takeaways aside, the broader message is clear: AI-assisted coding is a powerful ally, but not a substitute for careful engineering and security-minded workflows. By combining real-world insights with disciplined review practices, developers can reap the productivity benefits of AI while keeping their software safer, more reliable, and more resilient in the face of real-world threats.

Code That Speaks Back: Real-World Prompts Reveal Security Gaps in ChatGPT-Generated Code