Turning Vulnerability Descriptions into Quick Risk Scores: What Broad-Model Tools Reveal

Backlogs of CVEs threaten timely fixes. This post reviews a study testing general-purpose LLMs on 31,000 CVEs to see if descriptions can yield automated CVSS-like scores. It finds gains on some metrics but notes that missing context and ambiguity limit reliability, calling for richer descriptions, standardized templates, and safeguards.
1st MONTH FREE Basic or Pro • code FREE
Claim Offer

Turning Vulnerability Descriptions into Quick Risk Scores: What Broad-Model Tools Reveal

If you’ve ever wrestled with a backlog of cybersecurity vulnerabilities and wondered whether machines could help you triage them faster, you’re not alone. The National Vulnerability Database (NVD) and its CVSS scoring system are essential for prioritizing fixes, but the sheer volume of new vulnerabilities each year has created a real bottleneck. A study by Jafarikhah, Thompson, Deans, Siadati, and Liu asks a bold question: can general-purpose large language models (LLMs) like GPT-5 or Gemini actually read vulnerability descriptions and assign the right CVSS metrics? And if so, how reliable would those automated scores be in real-world security operations?

Here’s a digest of what they did, what they found, and why it matters for anyone wrestling with vulnerability management.

The big idea: can descriptions become scores?

Think of CVSS (Common Vulnerability Scoring System) as a weather forecast for software vulnerabilities. It translates a vulnerability’s technical details into a numeric score and a set of sub-scores that help teams decide what to fix first. The catch is that getting consistent, timely CVSS scores is labor-intensive and often subjective. The researchers wanted to see if six general-purpose LLMs—GPT-4o, GPT-5, Llama-3.3-70B-Instruct, Gemini-2.5-Flash, DeepSeek-R1, and Grok-3—could predict the eight base CVSS metrics using only the vulnerability’s textual description (the “description” field in CVE entries). No CVE IDs, no lookup tricks. Just language understanding and reasoning.

Why is this important? If LLMs can do this well enough, organizations could automate large parts of the triage process, cutting backlogs and helping security teams focus on what truly matters. But there are big caveats. The study also digs into where models struggle, how descriptions influence results, and how far ensemble approaches can take us beyond any single model.

The data and the setup: what they used

  • Dataset: Over 31,000 CVEs published from 2019 onward, all with CVSS v3.1 base metrics and English descriptions. This is a substantial sample aimed at reflecting current vulnerability reporting.
  • Ground truth: The CVSS base metrics provided by MITRE’s CVE List serve as the ground truth for evaluation.
  • The six models: GPT-4o, GPT-5, Llama-3.3-70B-Instruct, Gemini-2.5-Flash, DeepSeek-R1, Grok-3. They used Azure AI Foundry for most models and Google AI Studio for Gemini.
  • Prompt design: A two-step prompt. First, extract the eight base metrics. Second, output them in a fixed format. They stuck to zero temperature for determinism but tested multiple few-shot variants; two-shot prompts turned out to be best.
  • Evaluation: They used accuracy, precision, recall, F1-score, and mean absolute error (MAE) to measure performance. Because CVSS metrics are imbalanced (some classes appear far more often than others), they report per-class and weighted averages to get a fair read on performance.

In short: they built a pipeline to feed thousands of CVE descriptions to several top LLMs, forced a structured, metric-by-metric output, and then compared how close the model-generated CVSS base metrics were to the official scores.

What they found: performance, patterns, and surprises

The results aren’t uniform, but they shine a few important lights on the promise and limitations of automated vulnerability scoring.

Strong performers on certain metrics

  • Attack Vector (remote vs local vs network etc.): Gemini led with 89.42% accuracy, far above a baseline of 72.65%. GPT-5 wasn’t far behind (87.96%), with excellent recall, meaning it often got the positive cases right.
  • Attack Complexity: GPT-5 topped this metric with 84.66% accuracy (baseline 83.85%), achieving a balanced precision and a high F1-score. This suggests GPT-5 was particularly good at distinguishing how hard an attack would be to pull off.
  • User Interaction and other usability-oriented cues: GPT-5 again performed very well, especially for User Interaction, where it achieved about 88.95% accuracy.
  • Overall consistency: GPT-5, Gemini, and Grok stood out relative to the rest across several base metrics.

These results show that some models are consistently strong at identifying how a vulnerability could be exploited or interacted with a user, which are highly actionable in triage.

Weaker areas: the hard parts

  • Availability, Confidentiality, Integrity Impacts: These “impact” dimensions were tougher. GPT-5 did best on Availability Impact (67.95% accuracy) but many models struggled with clearly distinguishing between Low vs High impact. The other models fell off more noticeably on these metrics.
  • Privileges Required and other nuanced distinctions: Privileges Required and similar dimensions were trickier, with wider gaps between model performance and ground truth. There was also a pattern of bias toward predicting majority classes (e.g., “High” or “Unknown” too often) in some metrics.

Overall, even the best single model (GPT-5) showed only modest gains over the baseline on some metrics and sometimes struggled with the minority classes (the less frequent CVSS scores).

Meta-classification: a modest but meaningful win

To push beyond any single model’s strengths, the researchers built a meta-classifier. They fed a range of features:

  • The six LLMs’ raw predicted labels
  • How much those models agreed with each other (pairwise consensus, majority-vote signals)
  • Reliability indicators (whether a model produced a valid output)
  • A simple weighting idea to account for historical performance

They tested several ensemble methods: Voting, Random Forest, Gradient Boosting, Logistic Regression, SVM, and a Neural Network. Across all eight CVSS metrics, the meta-classifier consistently beat the strongest individual model.

  • Average accuracy: 79.54% for the meta-classifier versus 78.55% for the best single model (GPT-5).
  • Notable gains: Scope and Attack Vector metrics benefited the most from ensembling; Scope improved by about 3 percentage points, Attack Vector by roughly 1 point.
  • Takeaway: There’s value in letting multiple models “vote” and in designing features that capture how much models agree or disagree.

That said, the improvements aren’t gigantic. The biggest gains still hinge on providing better context, fewer ambiguous phrases, and richer descriptive cues in the vulnerability descriptions themselves.

What about the quality of the descriptions?

A core insight is that the description content matters more than the model’s size or fancy architecture in this particular task. The study dug into several hypotheses:

  • Description length: No strong correlation with accuracy. Longer descriptions didn’t automatically yield better predictions.
  • Named entities in the description: The count of organizations or product names had only a very weak relationship with accuracy.
  • Information content (a measure of how informative the text is): Again, no strong link to model precision.

This suggests that what matters isn’t fluff or verbosity, but whether the description contains sufficient, unambiguous context about how the vulnerability can be exploited and what controls are involved.

What about other data features?

They also experimented with adding structured data like CPE (Common Platform Enumeration) and CWE (Common Weakness Enumeration) to prompts. Those didn’t improve results beyond what the description alone provided. And they tried including the CVE ID—ironically, that created a big artificial boost because models could memorize known vectors and simply retrieve the right CVSS. To keep things honest and generalizable, they omitted CVE IDs from the final setup.

The takeaway: a description-only input is the right ballpark for evaluating a model’s true reasoning ability on this task. It’s about inference, not lookup.

What this means for real-world vulnerability triage

So, can LLMs actually help security teams? The answer is a cautious yes—with clear caveats.

  • They can handle scale: When you’re facing tens of thousands of CVEs, a capable model can quickly produce provisional CVSS base metrics for many entries, enabling faster triage and triaging workflows.
  • They’re most reliable on certain dimensions: Exploitability-related metrics (like Attack Vector and Attack Complexity) and some usability-focused metrics (like User Interaction) tend to be more predictable from text descriptions.
  • They struggle with rare cases and ambiguous language: When descriptions are vague, conflicting, or missing critical details (e.g., whether user interaction is required, or what privileges are needed), models can misclassify. This is a real-world risk if teams rely on these scores without validation.
  • Ensemble helps, but only modestly: A meta-classifier that aggregates multiple LLMs improves overall accuracy, but the gains aren’t dramatic. The complexity and cost of running multiple models may not always be justified, depending on resources and risk tolerance.
  • Context matters: The research emphasizes that richer, better-contextual vulnerability descriptions are the biggest levers for improving automated scoring. If vulnerability reports include more detail about libraries, dependencies, configurations, and potential exploits, automated scoring becomes far more reliable.

Practical implications for security teams:
- Use as a first-pass triage tool: An automated pass to generate CVSS base metrics can accelerate triage, especially for high-volume feeds, but always couple with human review for minority classes or unclear cases.
- Invest in better reporting: Encouraging reporters (vendors, researchers, or internal teams) to add explicit context (e.g., required privileges, potential user interactions, affected configurations) could dramatically improve automated scoring.
- Be mindful of data leakage risks: Avoid including identifiers like CVE IDs in prompts if you’re evaluating the model’s ability to generalize—otherwise you risk the model “looking up” known cases rather than truly reasoning from description.
- Plan for human-in-the-loop workflows: Automating scoring is valuable, but final risk judgments should still involve skilled analysts, especially for high-stakes or ambiguous vulnerabilities.

Where this work points next

The study doesn’t claim that LLMs can perfectly replace human CVSS scoring. Instead, it demonstrates a clear potential path: trained, thoughtful automation can scale vulnerability triage, provided we address description quality and contextual gaps.

  • Instruction tuning and domain adaptation: The authors note that instruction tuning specifically for vulnerability assessment could push model reliability higher.
  • External contextual signals: Integrating signals like dependency graphs, library usage, and public PoCs could help models reason about impact and exploitability more accurately.
  • Robust evaluation in operational settings: Real-world deployment would require monitoring model confidence, fallback strategies, and continuous validation against human expert judgments.
  • Balancing speed and accuracy: Organizations need to balance the speed of automated scoring with the reliability of the final risk assessment, especially for critical systems.

Real-world applications: how teams could start experimenting

If you’re curious about trying this in your org, here are practical steps inspired by the study:

  • Start with a description-only prompt: Build a small pipeline that takes CVE descriptions and returns the eight CVSS base metrics in a fixed format. Use a couple of strong LLMs to compare outputs.
  • Use a two-shot prompt as a baseline: The study found two-shot prompts offered the best performance among their tested settings. Provide two example mappings (description → CVSS metrics).
  • Keep a human-in-the-loop: Flag low-confidence cases (e.g., when the model’s outputs conflict with prior triage notes) for expert review.
  • Track metric-specific performance: Some CVSS metrics will be reliable, others not. Maintain dashboards that show precision/recall/F1 per metric so you know where automation helps most.
  • Improve descriptions: Create reporting templates that require/encourage more explicit contextual details (privileges, user interaction, impacted components, and dependencies). This is the biggest lever for improving automated scoring.
  • Avoid lookup tricks in prompts: Don’t include CVE IDs in prompts if your goal is to measure true reasoning; that can artificially inflate performance by letting the model pull known vectors.

Final takeaways from the study

  • General-purpose LLMs can predict CVSS base metrics from vulnerability descriptions at a scale that would be impractical for humans alone, with GPT-5 and Gemini showing particularly strong performance on several key metrics.
  • Some CVSS dimensions (notably exploitability-related metrics like Attack Vector and Attack Complexity) are more amenable to text-based inference than others (notably certain impact metrics like Availability and Privileges Required).
  • A meta-classifier that ensembles multiple LLM outputs provides a modest but consistent improvement across metrics, highlighting the value of leveraging complementary strengths.
  • The quality and richness of vulnerability descriptions are the biggest determinant of accurate automated scoring. Longer texts or more named entities don’t automatically improve results; clarity and context do.
  • Attempts to add extra structured data (like CPE/CWE) didn’t meaningfully improve performance, while including CVE IDs dramatically improved results through memorization rather than genuine reasoning. This underscores the importance of careful experimental design when evaluating AI for security tasks.
  • In practice, automated CVSS scoring can help reduce triage backlogs and accelerate risk assessment, but it should be used with guardrails and human oversight. Richer, context-rich vulnerability descriptions are the most promising path to more reliable automated scoring.

Key Takeaways

  • Automated CVSS scoring from vulnerability descriptions is feasible at scale, with notable strength in certain metrics (e.g., Attack Vector, User Interaction) shown by models like GPT-5 and Gemini.
  • A meta-classifier that ensembles multiple LLMs offers a reliable, if modest, improvement over any single model, reinforcing the idea that “wisdom of the crowd” helps in nuanced classification tasks.
  • The biggest bottleneck isn’t model size or sophistication; it’s the quality of the vulnerability context. enrichment of CVE descriptions with richer contextual signals can substantially boost accuracy.
  • Avoid prompts that rely on CVE IDs or other lookup shortcuts if your goal is to test genuine reasoning, and be cautious of potential memorization in model evaluations.
  • For real-world use, pair automated scoring with human review, prioritize automation for high-volume, high-reliability metrics, and invest in better vulnerability reporting templates to maximize the impact of automated triage.

If you’re exploring prompt engineering to improve vulnerability scoring in your organization, a pragmatic approach is to start with a robust, description-only prompt trained on a representative sample, implement a two-shot setup, and gradually layer in validation checks and richer contextual cues. The future of vulnerability triage might well hinge on how clearly we can describe the stories behind vulnerabilities—and how well we can teach machines to reason about those stories.

Frequently Asked Questions

Limited Time Offer

Unlock the full power of AI.

Ship better work in less time. No limits, no ads, no roadblocks.

1ST MONTH FREE Basic or Pro Plan
Code: FREE
Full AI Labs access
Unlimited Prompt Builder*
500+ Writing Assistant uses
Unlimited Humanizer
Unlimited private folders
Priority support & early releases
Cancel anytime 10,000+ members
*Fair usage applies on unlimited features to prevent abuse.