What is the hybrid ML-LLM framework for IoT security?

It combines fast ML-based attack detection with LLM-driven reasoning to understand attack sequences and propose tailored mitigations that can be deployed on edge devices and networks.

Which datasets were used to evaluate the framework?

Edge-IIoTset and CICIoT2023 datasets were used to benchmark ML/DL classifiers and to test prompt engineering with retrieval-augmented generation for context-aware responses.

What were the key findings of the study?

Random Forest emerged as the strongest detection model, while ChatGPT-o3 provided superior attack analysis and mitigation guidance in the tested scenarios.

How does the hybrid approach improve IoT security in practice?

By combining rapid detection with reasoning, it translates insights into concrete steps for mitigating attacks on real edge devices and networks, improving response times and relevance.

Can this framework be deployed on existing IoT infrastructure?

Yes, the design aims to integrate with common edge computing stacks and provide context-aware recommendations that can be translated into deployment actions.

Edge-Driven IoT Security Gets a Brain: A Hybrid ML-LLM Framework for Attack Detection and Tailored Mitigations

The Internet of Things (IoT) is everywhere—from smart homes to industrial plants—driving efficiency, innovation, and new business models. But with billions of connected devices comes a big security challenge: attackers are constantly probing, probing, probing until they find a weak link. Enter a fresh take on IoT defense that blends two worlds you usually hear about separately: fast, feature-based machine learning (ML) for spotting what’s happening, and thoughtful, context-aware reasoning from large language models (LLMs) to understand how attacks unfold and what to do about them. This hybrid approach isn’t just “more AI”; it’s about making security tools that actually reason about behavior and translate that reasoning into practical steps you can deploy on real devices. Here’s what this research is all about and why it matters.

Why this hybrid approach matters

IoT/IIoT networks expand the attack surface in ways traditional IT security never did. Devices range from tiny sensors to industrial controllers, each with its own constraints: limited CPU, memory, power, and sometimes patching cycles that are far from ideal. The research behind this framework recognizes two core gaps:

ML/DL models excel at detecting known attacks when given well-chosen features, but they struggle to explain the attacker’s behavior in a way security teams can act on. They don’t readily translate detection into concrete mitigation guidance tailored to device limits.
LLMs promise richer attack-behavior analysis and mitigation suggestions, but evaluating their outputs in a standardized, objective way has been a challenge. Without benchmarks, it’s hard to compare which model helps more in real-world settings.

The proposed solution pairs ML for precise attack detection with LLM-powered, context-aware analysis and mitigation, all guided by retrieval-augmented generation (RAG). In short: a smart duo that detects what’s happening and then explains why it’s happening, what it could lead to, and exactly how to respond—on devices with modest resources.

The Framework: Four Building Blocks

The authors outline a four-part architecture that sits between raw traffic data and actionable defense measures.

1) Attack Detection (ML/DL Classifier)

Purpose: Identify the class of attack from network traffic features.
How it works: A suite of ML/DL models is trained on IoT/IIoT traffic to label 13 common attack types, drawn from two well-known datasets.
What topped the charts: Random Forest (RF) emerged as the best multi-class detector across both datasets, with an F1-score of 0.9253 on Edge-IIoTset and 0.8101 on CICIoT2023. This means RF was the most reliable at both catching attacks and not crying wolf when there isn’t one.
Other contenders: XGBoost (XGB) performed well, with notable performance in some classes; several deep learning models (CNN, LSTM, DNN) didn’t consistently beat RF across all scenarios.

Why RF? IoT traffic is often noisy, heterogenous, and feature-rich. RF handles mixed data types well, is robust to outliers, and doesn’t require heavy feature engineering, making it well-suited for the IoT edge where resources are precious.

2) Retrieval-Augmented Generation (RAG) for context

Purpose: Enrich LLMs with concrete, context-specific knowledge at inference time.
How it works: The system builds two knowledge bases:
- Attack knowledge base: Maps each attack label to concise, research-backed descriptions (sourced from standards and literature like NIST).
- Device knowledge base: Captures device specs (CPU, memory, OS, network interfaces) so mitigations fit the actual hardware.
Retrieval engine: They encode attack descriptions and device specs with a sentence-embedding model (all-MiniLM-L6-v2) and store them in FAISS indices for fast nearest-neighbor lookup.
Why this matters: LLMs can be impressive but sometimes hallucinatory or out of date. With RAG, the LLMs receive precise, up-to-date, device-aware, attack-specific information to ground their analyses and recommendations.

In practice, when the ML component flags an attack, the RAG system fetches the most relevant attack write-up and device details and feeds them into the LLM prompt, so the LLM can reason with correct context.

3) Prompt Engineering for Attack Analysis (ChatGPT-o3 and DeepSeek-R1)

Role-play prompts: The LLMs are cast as cybersecurity analysts in an IoT environment, given the specific attack class and the JSON-formatted traffic features.
What they produce: Detailed attack behavior analyses and tailored mitigations, designed to be immediately actionable and device-aware (e.g., suitable for a Raspberry Pi or other edge device).
The design goal: Generate insightful analyses, explain which traffic indicators matter, and propose mitigations that align with typical IoT hardware constraints.

The study extended an existing “ShieldGPT” approach to cover 13 attack types (not just DDoS) and ensured the prompts guided LLMs to consider both the attack mechanics and the particular device in play.

4) Evaluation by an Ensemble of Judge LLMs (and Humans)

Objective: Move beyond subjective impressions and toward objective, diverse evaluation of LLM-generated content.
How it works: The authors used an evaluation prompt to have eight different judge LLMs—plus human experts—score the LLM outputs across four metrics (see below). This ensemble approach helps mitigate biases that any single model might bring.
The judge LLMs included a mix of widely used and new models: ChatGPT-4o, Mixtral, Gemini, Meta Llama 4, TII Falcon H1, DeepSeek-V3, xAI Grok 3, Claude 4 Sonnet.
The scoring: Each attack scenario gets evaluated on four criteria, with scores totaling 10 points per scenario. The criteria are designed to capture both technical accuracy and practical usefulness.

Together, these four pieces create a pipeline that detects an attack, grounds the subsequent reasoning with retrieved knowledge and device context, and then assesses the quality of the LLM-derived guidance through an objective judging system.

Datasets That Ground the Framework

Two well-known IoT-focused datasets were used to train and benchmark the detectors and to map attack types for cross-dataset evaluation.

Edge-IIoTset: Collected from a seven-layer IoT/IIoT testbed with a mix of devices (from Raspberry Pi edge nodes to industrial controllers and SDN components). It contains normal and malicious traffic across 14 attack types, grouped into five categories: DDoS, information gathering, MITM, injection, and malware. Features include network traffic metrics, system processes, protocol services, and security events. About 1,176 features were extracted, and 61 were selected for modeling.
CICIoT2023: Gathered from a large smart-home testbed with 105 IoT devices. It includes 33 attack classes across seven categories (DDoS, DoS, reconnaissance, web-based, brute force, spoofing, Mirai-like threats). It provides 47 features from packet-flow windows, covering traffic behavior, protocol interactions, and header attributes.

Crucially, the study aligned 13 attack types common to both datasets to enable fair cross-dataset evaluation. This alignment is what makes the hybrid framework practically useful: it can work across different IoT environments and data collection setups.

From Features to Behavior: What Happens Under the Hood

A few practical details help you understand how the system actually operates on real data.

Data preprocessing: They cleaned data by removing irrelevant columns, handling missing values and duplicates, encoding categoricals, and standardizing features. Then they split the data 80/20 into train and test sets.
Feature vs. pattern learning: The comparison covered both traditional feature-based ML and DL approaches. While CNNs, LSTMs, and DNNs were tried, RF consistently provided the best multi-class performance on both datasets.
RAG specifics: The attack knowledge base is anchored by trusted sources (NIST and related research). That content is converted to sentences, embedded, and stored in FAISS. Retrieval uses the detected attack label and device name to pull the most relevant descriptions and device specs, feeding the LLM prompts with precise, relevant context.
Prompt structure (example): The Password Cracking attack scenario includes a JSON of network traffic features, a short description of the attack, and device specs for the Raspberry Pi used to illustrate mitigations. The LLMs then produce a structured analysis and concrete mitigations.

This design ensures the LLMs don’t have to reinvent the wheel—they’re given a concise, trusted context and a focused mission: explain what’s happening, why it matters, and how to respond efficiently.

The Attack Scenario Prompts and the Judge

Two interesting pieces of the approach deserve a quick walkthrough:

Attack scenario prompts: The researchers used a structured template to guide the LLMs to describe attack behavior and propose mitigations. The scenario includes the specific attack class, JSON-formatted traffic features, and RAG-supplied content (attack description and device specs). The goal is to elicit organized, evidence-based responses rather than vague generalities.
The judge prompts: After obtaining LLM responses, judge LLMs evaluate them along four metrics. The scoring is designed to reflect real security needs:
- Attack Analysis and Threat Understanding (0-3 points)
- Mitigation Quality and Practicality (0-3 points)
- Technical Depth and Security Awareness (0-2 points)
- Clarity, Structure, and Justification (0-2 points)
  A perfect total is 10 points per scenario, with judges providing justification for their scores.
Eight judge LLMs participated in the study, offering a broad view of how different models interpret and value the same content. This ensemble approach helps curb individual model biases and showcases the robustness of the evaluation framework.

Key Findings: What Worked and Why It Matters

Here are the headline results and what they imply for IoT security practice.

Detection performance (ML/DL): Random Forest (RF) was the standout detector for multi-class attack classification on both datasets. Edge-IIoTset: RF achieved F1 = 0.9253; CICIoT2023: RF achieved F1 = 0.8101. This suggests RF’s balance of bias-variance and its ability to handle heterogeneous IoT data is particularly well-suited for the attack landscape in these datasets.
Alternative detectors: XGBoost (XGB) performed well and often came in second place, while some DL models lagged in the same tasks, underscoring that for IoT multi-class detection, a strong, well-tuned tree-based method can beat heavier networks in many scenarios.
LLMs for attack analysis: ChatGPT-o3 outperformed DeepSeek-R1 in both attack analysis and mitigation suggestion across both datasets. The numbers from the judge ensemble show a consistent edge for ChatGPT-o3, with average scores favoring its deeper reasoning and more practical mitigations.
Judge ensemble outcomes: When averaged across all attack classes, ChatGPT-o3 achieved higher scores from judge LLMs and human experts compared to DeepSeek-R1. This indicates that, at least in this setup, the “brain” guiding the response tends to be more reliable, contextual, and actionable.
Realistic mitigations: The study highlighted defenses like multi-factor authentication (MFA), strong password policies, and rate-limiting for account lockouts in Password Cracking scenarios. While these are familiar defenses, the key value is tailoring them to the device and traffic context provided by the RAG prompts—making recommendations practical, not theoretical.

What this means for practitioners: you don’t have to settle for a black-box detector. A robust detector (RF in this case) can be paired with a context-aware reasoning system (LLMs guided by RAG) to deliver not just alerts but a full narrative about attacker behavior and a concrete, device-aware path to mitigation. This is especially valuable for edge environments where setting up complex defenses needs to be both effective and lightweight.

Real-World Implications: How This Helps Teams Today

Actionable guidance grounded in context: The RAG layer ensures that the LLM responses aren’t generic. They incorporate attack descriptions and device realities, so mitigations are plausible on devices with limited CPU, memory, or patch windows.
Cross-environment consistency: Since the approach maps multiple attack classes to a common set of 13 types across Edge-IIoTset and CICIoT2023, security teams can adapt the framework to different deployments without reengineering the entire pipeline.
Objective benchmarking for LLMs: By introducing structured evaluation metrics and an ensemble of judge LLMs, the work sets up a way to compare reasoning models beyond hand-wavy judgments. Teams can use similar benchmarks to assess new LLM-powered defenses as models evolve.
Device-aware mitigations: The device knowledge base prompts the LLMs to consider hardware constraints—vital for IoT where a one-size-fits-all solution doesn’t exist. In practice, this means more realistic configurations, code snippets, and security policies that can be deployed on edge devices.

Potential challenges and considerations:
- Resource constraints: LLM inference, even with prompt engineering, can be heavy. The framework assumes a capable edge gateway or a nearby server to run the LLMs, with RF handling the heavy-lifting in detection at the edge.
- Data privacy: Using RAG with external knowledge and device specs means careful handling of sensitive information about devices and networks. An on-premises or tightly controlled retrieval setup is important for enterprise use.
- Maintenance of the knowledge base: Attack descriptions and device specs should be kept current. Threat landscapes evolve, and NIST-style sources are a good anchor, but periodic updates are essential.

Practical Takeaways for Prompts, People, and Process

Start with strong detectors: If you’re implementing a hybrid detection-and-reasoning system, a solid, well-tuned classifier (RF here) provides a dependable foundation. It’s not always about the deepest model—it's about the right tool for the job at hand.
Ground LLMs in context with RAG: The value of retrieval-augmented generation is not just “more data”—it’s better alignment between what the model knows and what the system actually needs to know. For IoT security, that means linking attack descriptions and device specs to the specific incident.
Use role-based prompts: Treating LLMs as cybersecurity analysts with explicit roles helps steer the reasoning process, making outputs more structured and actionable.
Evaluate with a diverse panel: An ensemble of judge models plus human experts offers a more robust view of performance. The goal isn’t a single-number score but a consensus on practicality and depth.
Align with real-world constraints: Always tailor mitigation guidance to the device and network context. Generic “best practices” are less useful than step-by-step, device-aware actions (e.g., MFA considerations for a specific IoT edge device, rate-limiting thresholds tuned to device capabilities).

If you’re thinking about adopting a similar approach, here are pragmatic steps you can start with:

Gather a representative IoT/IIoT dataset that reflects your deployment’s traffic patterns and threat models.
Train a strong, lightweight detector (RF or similar) and validate across multiple classes, not just binary detection.
Build lightweight knowledge bases anchored in reputable sources (NIST, vendor docs) and couple them with a fast embedding-and-indexing pipeline (e.g., FAISS + MiniLM).
Design role-based prompts and a clear output schema for LLM analyses and mitigations.
Create a judge workflow with multiple LLMs (or at least a diverse set of prompts) plus human review for calibration.

Key Takeaways

A four-part hybrid framework merges ML-based attack detection with LLM-driven attack behavior analysis and mitigation suggestions, all grounded by Retrieval-Augmented Generation (RAG).
RF emerged as the top ML detector for multi-class IoT attack classification on two major datasets, underscoring its practicality for heterogeneous IoT traffic.
RAG, built on attack and device knowledge bases, provides LLMs with precise, device-aware context to produce relevant, actionable insights rather than generic responses.
ChatGPT-o3 consistently outperformed DeepSeek-R1 in the study’s attack analyses and mitigations, according to a diverse ensemble of judge LLMs and human experts.
The evaluation framework uses a rigorous, multi-metric approach: traditional detection metrics (Precision, Recall, F1) plus a 10-point judge-based scale across Attack Analysis, Mitigation quality, Technical depth, and Clarity.
Real-world impact: this approach helps security teams translate detection into concrete, device-tailored defenses, a critical capability for resource-constrained edge environments.
The methodology is adaptable to different IoT deployments because it aligns 13 attack types across two major datasets, enabling cross-environment benchmarking and more consistent security improvements.

If you’re exploring prompts to improve your own security workflows, consider adopting the prompt-architecture ideas here: role-playing prompts for analysts, RAG-guided context, and a structured evaluation rubric. It’s not just about catching evil traffic—it’s about understanding it well enough to counter it in practical, real-world settings.

If you’d like, I can tailor this outline into a shorter executive briefing for security leadership or a hands-on how-to guide for building a small-scale RF+RAG-based demo in your own IoT lab.

Edge-Driven IoT Security Gets a Brain: A Hybrid ML-LLM Framework for Attack Detection and Tailored Mitigations