What is the WildCode study?

WildCode is a large-scale empirical analysis of code generated by ChatGPT in real-world prompts, evaluating both correctness and security in practical contexts.

What kinds of security gaps were found?

The study found vulnerabilities such as insecure authentication patterns, weak input validation, unsafe dependencies, and poor error handling in AI-generated code.

Are there differences across programming languages?

Yes. The quality and security of generated code vary by language, with some languages being more prone to insecure constructs based on common prompt patterns.

How can developers mitigate risk when using AI-generated code?

Treat generated code as a starting point: review with security-focused analysis, run static/dynamic testing, enforce security checks, and iterate prompts to emphasize safe patterns.

What does this mean for AI-assisted coding going forward?

AI-assisted coding offers efficiency but remains fragile. The findings call for built-in safety checks, better developer education, and stronger prompts to improve security.

Code in the Wild: Real-World Security Gaps in ChatGPT-Generated Code

If your team is leaning on AI to draft or refactor code, you’re not alone. Large language models (LLMs) like ChatGPT have become popular copilots for developers and hobbyists alike. But what happens when the code you’re relying on comes from a model that learned from open-source code, and you’re asking it to write something quickly, iteratively, and in different languages? A new large-scale study dives into real-world, real-time conversations where ChatGPT actually generated code, shedding light on both how good that code is and where security slips through the cracks. The takeaway? The world of AI-assisted coding is exciting, but fragile—especially when it comes to security.

In this post, we’ll unpack the core findings from the study “WildCode: An Empirical Analysis of Code Generated by ChatGPT” in plain language. We’ll cover what real users actually ask for, what the produced code looks like across languages, where security problems tend to hide, and what this all means for people building, using, or supervising AI-assisted coding workflows.

Introduction: Why this matters right now

Within just a few years, coding with the help of LLMs has shifted from a niche experiment to a daily practice for many developers. The appeal is obvious: get a quick starter, explanations, or a scaffold to build on. But security is the quiet casualty in many early studies that test LLMs with synthetic prompts—prompts crafted just for the study rather than reflecting how people actually interact with the model in the wild.

This study stands out because it uses a massive, real-world dataset of actual ChatGPT conversations to extract code. The researchers pulled about a million real conversations from WildChat, focusing on those where the model generated code. They didn’t just ask the model to write toy programs; they looked at what people actually asked for, how they followed up, and what happened when the code was buggy or insecure. They also compiled a curated, annotated collection of conversations and code samples to help other researchers reproduce or extend the work.

What they found isn’t entirely surprising if you’ve been around the security side of software—but the scale and realism matter. In short: AI-generated code often falls short on security, and users aren’t consistently prioritizing security in their prompts or follow-ups. That mismatch between function and safety has real-world implications for how we design, use, and govern AI-assisted coding.

From data to code: How the researchers built the WildCode dataset

Big data, real users, real languages

Source: WildChat, a dataset with more than 1 million real-world ChatGPT conversations collected between spring 2023 and mid-2024.
What they pulled: All conversations that included code generated by the model. This yields a specialized dataset called WildCode, plus language-labeled snippets.
Language labeling: Since not all code blocks came with reliable language tags, they used an automated language identification tool to classify snippets into languages. For extra confidence, they validated six major languages (Python, JavaScript, Java, C, C#, PHP) with language-specific syntax checkers.
English focus for intent: Because language tooling is strongest in English, they focused the user-intent analysis on English conversations, forming a dataset called WildCode_EN.

Languages and how code looks

Python dominates: The lion’s share of snippets is Python—tens of thousands of Python snippets across many conversations.
Other big players: JavaScript, C/C++, Java, and C# follow, with PHP mostly not producing valid PHP source code in this dataset.
Length and structure: Most generated programs are short, but there’s wide variation. C/C++ snippets tend to be the longest on average; .NET languages (C#, VB, etc., though primarily C# here) tend to have more inline comments per block.
Multi-turn coding: On average, a single conversation contains roughly 1.5 to 2.5 code snippets, suggesting that users frequently ask for refinements or iterative improvements rather than one-off solutions.

A note on quality checks

Syntax sanity: A subset of code across six languages was checked for syntax correctness with language-specific tools (Python’s py_compile, JavaScript ESLint, Java’s compiler, GCC for C/C++, PHP’s -l, and Roslyn for C#). PHP didn’t yield valid code in the sample; most PHP fragments tidied up as command-line tool scripts rather than full PHP source files.
Language labeling caveat: Some snippets lacked explicit language tags; the language-id model isn’t perfect (accuracy around 95%), so there’s still a margin for labeling errors.

Security: where the alarms go off

The researchers used OpenGrep, a regex-based security analyzer, and a curated set of 648 rules spanning six languages to flag insecure patterns. They also drew on additional vulnerability detectors and manual inspection to shape a broad view of security issues. Here are the standout findings.

Hash functions and cryptography

Scope: 264 unique ChatGPT conversations included code with hash functions across multiple languages.
Vulnerability rate: About 20.6% of these conversations triggered at least one rule—mostly pointing to older or broken crypto choices.
What these rules flagged: Continued use of MD5, SHA-1, or algorithms without authentication guarantees.
Takeaway: Even when people ask for “hashing” or “security” in high-level terms, the model and users often default to outdated or weak cryptographic practices. This is a classic security trap: not all “hashing” is created equal, and older algorithms remain easy to misapply.

SQL and data access

Scope: 970 conversations contained SQL code, scanned with 42 rules targeting SQL injection and insecure data handling.
Vulnerability rate: 61 conversations flagged (about 3.93%).
Common patterns: Raw SQL queries embedded directly, string concatenation used to assemble queries, and JDBC usage patterns that can invite injection or mishandling of user input.
Takeaway: Even when you’re just string-building queries in a high-level context, the risk surfaces quickly if you’re not strictly parameterizing inputs and validating queries.

Random number generation

Scope: 3,032 conversations included code using randomness, which matters if the RNG affects security-sensitive tasks (like nonces or keys) or non-sensitive tasks (where it’s less critical).
Vulnerabilities found: 17 instances of weak RNG usage (15 in Java, 2 in Python); overall, about 0.47% in the examined sample.
Takeaway: The model isn’t consistently failing here, but the stakes are high for security-sensitive randomness. If you’re generating tokens or cryptographic nonces, you want a cryptographically secure RNG, not a casual, general-purpose one.

Deserialization

Scope: 30 Java code samples opened up deserialization concern.
Finding: All looked vulnerable—none included security checks to validate or sanitize deserialized data.
Takeaway: Deserialization remains a notoriously risky pattern in Java; when an LLM contributes code in contexts involving serialized data, you should treat it as a red flag and add strict validation or safer alternatives.

Memory-safety in C/C++

The study highlights that memory corruption vulnerabilities are a known weakness in LLM-generated C/C++ code. This aligns with earlier findings that these languages are particularly unforgiving of memory mistakes and unsafe patterns.

Regex and ReDoS

ReDoS risk: Regex-based DoS vulnerabilities were assessed with four different detectors (SafeRegex, Rescue, Redoshunter, Revealer).
Finding: About one-third of the regex patterns in the dataset were flagged as vulnerable to ReDoS, though the detectors can produce false positives. The researchers note this is likely an over-count, but it signals a non-trivial risk in regex-heavy code.
Practical angle: If you’re processing user input with regexes generated by an LLM, you should test for potential ReDoS scenarios and consider safer regex patterns or input bounds.

Package hallucinations: fake libraries that aren’t real

What happened: The study found “hallucinated” package names—Python modules often created for less familiar tasks that don’t exist in real repositories. JavaScript saw far fewer such hallucinations, but it wasn’t zero.
Statistics: About 14.4% of Python imports were hallucinated (285 distinct fake modules among 1,984 real or used modules); JavaScript hallucinations occurred in about 3.5% of cases (21 out of 606).
Pattern: Hallucinations were more common for less common tasks or user-defined functions; common modules stayed real.
Risk: If you auto-install dependencies based on AI output, you risk pulling in non-existent or malicious modules. The authors even note that users often don’t catch these hallucinations, letting the code run with broken or dangerous dependencies.

User intent and how people interact with code

Beyond raw code quality, the study maps how real people interact with code-generation prompts. This is important because intent shapes what you get and how much you’ll need to intervene.

Default language preferences

When users don’t specify a language in their initial prompt, ChatGPT tends to generate Python code most often (roughly one-third of cases). Other frequent defaults include Bash, C++, HTML, and JavaScript.
Follow-up language choices: Users often request the same task in a different language in follow-ups, indicating that the task evolves or that developers want cross-language solutions.

Intent categories and multi-turn dynamics

Core intents: Bug fixing and code generation dominate both initial and follow-up queries. This isn’t surprising—practical coding help is the main driver of these conversations.
Security intentions: Secure coding is surprisingly rare in both initial queries and follow-ups. Even when there’s buggy code, explicit security concerns are not commonly raised.
Interaction length: Conversations tend to get longer when the initial intent is Secure Coding or when follow-ups involve Code Explanation or Bug Fixing (explanation work tends to require more back-and-forth). Surprisingly, even if security topics arise, they don’t always drive lengthy security-focused dialogue.

Language and intent caveats

The intent analysis focused on English conversations to ensure reliable language processing. That’s sensible for research, but it means non-English coding interactions aren’t represented in the intent results.
Zero-shot intent classification is scalable, but it can miss nuanced intents. The authors acknowledge this as a limitation and invite future work with more tailored models.

Hallucinations and user behavior

Users generally don’t detect missing or fake modules ("package hallucinations") when they appear in the code. They keep asking about the errors they see, rather than flagging the module names as suspicious.
The model itself doesn’t reliably flag or correct hallucinated dependencies in these cases, which points to a broader governance gap: the tool isn’t proactively preventing unsafe or unsound dependencies from slipping into code.

What this means for developers, teams, and tool designers

Security is unevenly prioritized in real-world AI-assisted coding. If you’re using ChatGPT to draft code, you can’t rely on the model to magically “get security right.” You need explicit security checks as part of your workflow.
Treat the output as a draft, not a drop-in solution. Use static analysis, linting, and security scanners as gatekeepers before you run or deploy AI-generated code.
Don’t trust dependency names at face value. Hallucinated modules are a real risk in Python (and to a lesser extent in JavaScript). Cross-check imports against official repositories (PyPI, npm) and consider adding a dependency-audit step that catches nonexistent or suspicious packages before installation.
Be careful with cryptography in generated code. Don’t assume MD5 or SHA-1 are acceptable for secure hashing; avoid weak RNGs for anything security-sensitive; favor current best practices and libraries for cryptography and randomness.
Prefer safer patterns for data handling. For Java, watch out for deserialization vulnerabilities; for SQL, prefer parameterized queries and input validation; for regex, test for potential ReDoS and keep patterns bounded or restricted by timeouts.
Leverage prompts that emphasize security up front. The study suggests that users rarely asked for security features, so you could design prompts that explicitly request secure-by-default patterns, input validation, and documentation of security trade-offs.
Use a layered approach to correctness: model output is just one layer; add human review, code style checks, and security verification in a continuous integration pipeline.
Invest in reproducible research tooling. The authors provide datasets and code for replication (HuggingFace and GitHub). If you’re a researcher or security-minded developer, this kind of transparency helps you measure improvements and design better safeguards.

Practical implications and future directions

Safer defaults and guardrails: Tool designers can build safer-by-default configurations, prompting the model to produce code with security checks or to avoid certain risky patterns unless explicitly requested.
Integrating security checks into coding assistants: Embedding static analyzers, dependency-checkers, and secure-by-default templates directly into IDEs or chat assistants could shift the behavior toward safer outcomes.
Education and prompting: Teaching users how to prompt for secure code (and how to spot hallucinations or unsafe patterns) could raise the baseline for safety in AI-assisted coding.
Language-specific considerations: The prevalence of Python and its ecosystem means particular attention to Python package hallucinations and common insecure patterns in Python scripts. Other languages may have different hotspots (e.g., deserialization in Java; memory-safety concerns in C/C++).
Real-world data vs. synthetic prompts: This study’s emphasis on authentic conversations is a strong reminder that synthetic benchmarks may understate or misstate the security gaps in AI-assisted coding.

Limitations to keep in mind

The analysis relies on regex-based vulnerability detection and a curated rule set. While scalable, this approach may miss some context-sensitive flaws and can produce false positives or false negatives.
The English-focused intent analysis, while sensible, excludes multilingual interactions that could reveal different patterns of user behavior and security awareness.
The study notes an undercount of certain types of vulnerabilities, particularly in longer, multi-file C/C++ programs, due to the scale and pattern coverage of the detectors.

Conclusion: A call for more security-aware AI-assisted coding

WildCode provides a sober, data-driven snapshot of how real people interact with ChatGPT when they’re asking for code, and what actually shows up in the resulting code. The headline findings are a clear signal: while AI-generated code can speed up development and support iterative workflows, it often falls short on security. Hashing choices lean on older, weak algorithms; SQL handling can invite injections; randomness may be insufficient for security-sensitive tasks; deserialization remains a major pitfall; and the lurking presence of hallucinated packages adds a layer of risk. Most tellingly, user behavior tends to prioritize getting code that works or is easy to fix over explicitly insisting on security features.

The silver lining is actionable: with better tooling, better prompts, and stronger governance around AI-assisted coding, we can push this field toward safer, more reliable practice. The study’s authors even provide their data and annotated conversations to help others replicate and extend this work—a valuable invitation to researchers and practitioners alike.

Key Takeaways

Real-world prompts reveal security gaps: In authentic chats, security is often deprioritized in favor of getting code that works or fixes quickly.
Security issues are non-trivial across languages: Hashing choices, SQL handling, weak RNGs, deserialization risks, and memory-safety concerns show up across Python, Java, C/C++, and JavaScript.
Hallucinated dependencies are a real risk: Python imports frequently include non-existent modules, with some impact in JavaScript. Don’t auto-install dependencies based on AI output without validation.
Regex-based ReDoS vulnerability is common but uncertain: About one-third of regexes flagged by detectors could be vulnerable to ReDoS; detectors may over-count, so treat this as a caution rather than a precise tally.
Language defaults shape outcomes: When users don’t specify a language, ChatGPT tends to pick Python first, then users often request translations or equivalents in other languages in follow-ups.
Users rarely push for secure coding: Even when buggy code is produced, explicit security concerns are not commonly raised in follow-ups.
Practical workflow implications: Treat AI-generated code as a draft—integrate static analysis, dependency validation, and security checks before running or deploying. Prompt design should explicitly request secure patterns, and teams should consider guardrails around dependency installation.

If you’re interested in the raw material and want to explore the data yourself, you can check the project’s repositories and datasets on HuggingFace and GitHub. This is a living space for practitioners and researchers who want to understand and improve the security of AI-assisted coding in the wild.

Code in the Wild: Real-World Security Gaps in ChatGPT-Generated Code

Frequently Asked Questions

Related Topics

About the Author

Unlock the full power of AI.