Dynamic Malware Analysis Using AI Agent and Reinforcement Learning

Posted Feb 8, 2026

By ammarlouah

14 min read

Introduction

Malware analysis is a time-consuming and resource-intensive task that requires expert knowledge and careful investigation. Security analysts must decide which analytical techniques to apply from basic file inspection to deep memory dumps while balancing thoroughness against time constraints. Traditional automated systems either perform exhaustive analysis on every sample (slow and expensive) or use rigid rule-based heuristics (fast but inflexible).

This project explores a different approach: What if an AI agent could learn the analysis process itself? Instead of telling the system what to check, we train a reinforcement learning agent that autonomously decides which analytical actions to take, mimicking how human analysts progressively investigate suspicious files.

I developed an AI-powered dynamic malware analysis system that combines:

CAPEv2 Sandbox: Industry-standard malware analysis sandbox for executing samples in isolated VMs
Deep Q-Network (DQN): Reinforcement learning agent that learns optimal analysis strategies
Progressive Analysis: The agent starts with cheap checks and deepens investigation only when needed
Intelligent Decision-Making: Learns to balance analysis depth against computational cost

The system represents a proof-of-concept for applying reinforcement learning to cybersecurity workflows, demonstrating that AI agents can learn complex sequential decision-making processes in the malware analysis domain.

All code, models, and documentation are available in the GitHub repository: Dynamic Malware Analysis using AI Agent and Reinforcement Learning.

This article will walk you through the complete system from CAPEv2 integration and feature engineering, to the RL environment design, DQN architecture, and deployment as a command-line tool.

Project Motivation and Goals

The core idea behind this project is to apply reinforcement learning to the malware analysis workflow itself. Traditional approaches fall into two categories:

Exhaustive Analysis:

Performs all available checks on every sample
Comprehensive but slow (10-15 minutes per sample)
Wastes resources on obviously benign or obviously malicious files

Rule-Based Systems:

Uses predefined heuristics (“if injection detected, classify as malware”)
Fast but inflexible
Struggles with novel malware families
Requires manual rule updates

Our RL Approach:

Agent learns from experience which analytical actions are most informative
Adapts strategy based on what it observes
Can perform cheap checks first, expensive analysis only when necessary
Learns implicitly from thousands of analysis examples

Project Goals:

Build an end-to-end system integrating CAPEv2 sandbox with DQN agent
Design a reinforcement learning environment that models the analysis process
Train an agent that makes intelligent, cost-aware analysis decisions
Deploy as a practical tool for analyzing real malware samples
Demonstrate the viability of RL for cybersecurity workflows

System Architecture

The system consists of four integrated components working together:

Component Overview

┌─────────────────┐
│  Suspicious     │
│  File Sample    │
│ (.exe, .dll,    │
│  .ps1, etc.)    │
└────────┬────────┘
         │
         ▼
┌─────────────────────────────────────────────────┐
│         CAPEv2 Sandbox                          │
│  ┌──────────────────────────────────────┐      │
│  │  Dynamic Execution in VM              │      │
│  │  - Runs sample in isolated Windows VM │      │
│  │  - Monitors process creation          │      │
│  │  - Tracks API calls                   │      │
│  │  - Logs memory operations             │      │
│  │  - Captures network traffic           │      │
│  │  - Records filesystem changes         │      │
│  │  - Execution time: 120-300 seconds    │      │
│  └──────────────────────────────────────┘      │
└────────┬────────────────────────────────────────┘
         │
      Generates
         │
         ▼
┌─────────────────────────────────────────────────┐
│  CAPE JSON Report                               │
│  - Process tree and API call sequences         │
│  - Memory injection indicators                  │
│  - Dropped files and modifications              │
│  - Network DNS/HTTP/TCP connections             │
│  - YARA signature matches                       │
│  - Behavioral indicators                        │
│  - File size: typically 1-50 MB                 │
└────────┬────────────────────────────────────────┘
         │
         │ (Optional)
         ├──── MALVADA Validation ────
         │      - Report quality checks
         │      - Duplicate detection
         │      - Data sanitization
         │
         ▼
┌─────────────────────────────────────────────────┐
│        Feature Extraction                       │
│  ┌──────────────────────────────────────┐      │
│  │  Extract 35-dimensional state vector: │      │
│  │                                       │      │
│  │  Basic Features (6):                  │      │
│  │  - file_size, file_type, CAPE_code    │      │
│  │  - n_processes, n_threads             │      │
│  │  - proc_tree_nodes, total_api_calls   │      │
│  │                                       │      │
│  │  Memory Features (3):                 │      │
│  │  - n_enhanced_events                  │      │
│  │  - n_injection_indicators             │      │
│  │                                       │      │
│  │  Filesystem Features (6):             │      │
│  │  - files_total, read/write/delete     │      │
│  │  - n_dropped, dropped_total_size      │      │
│  │                                       │      │
│  │  Network Features (11):               │      │
│  │  - total_network_calls                │      │
│  │  - dns/http/socket calls & ratios     │      │
│  │  - network_signatures                 │      │
│  │                                       │      │
│  │  Memory Dump Features (7):            │      │
│  │  - n_anomalies, n_encryptedbuffers    │      │
│  │  - n_signatures, alerts, payloads     │      │
│  │                                       │      │
│  │  Metadata (2):                        │      │
│  │  - step_id, last_action               │      │
│  └──────────────────────────────────────┘      │
└────────┬────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────┐
│        AI Agent (Deep Q-Network)                │
│  ┌──────────────────────────────────────┐      │
│  │  Dueling DQN Architecture             │      │
│  │                                       │      │
│  │  Input Layer: 35 features             │      │
│  │          ↓                            │      │
│  │  Feature Extraction:                  │      │
│  │    Linear(35 → 256) + ReLU            │      │
│  │    Linear(256 → 256) + ReLU           │      │
│  │    Dropout(0.1)                       │      │
│  │          ↓                            │      │
│  │  Split into two streams:              │      │
│  │                                       │      │
│  │  Value Stream:                        │      │
│  │    Linear(256 → 128) + ReLU           │      │
│  │    Linear(128 → 1)                    │      │
│  │    Output: V(s) - state value         │      │
│  │                                       │      │
│  │  Advantage Stream:                    │      │
│  │    Linear(256 → 128) + ReLU           │      │
│  │    Linear(128 → 7)                    │      │
│  │    Output: A(s,a) - action advantages │      │
│  │                                       │      │
│  │  Combine:                             │      │
│  │    Q(s,a) = V(s) + [A(s,a) - mean(A)]│      │
│  │                                       │      │
│  │  Output: 7 Q-values (one per action)  │      │
│  └──────────────────────────────────────┘      │
│                     │                           │
│                     ▼                           │
│  ┌──────────────────────────────────────┐      │
│  │  Action Selection                     │      │
│  │                                       │      │
│  │  0: CONTINUE                          │      │
│  │     → Re-examine current features     │      │
│  │                                       │      │
│  │  1: FOCUS_MEMORY                      │      │
│  │     → Unlock memory features          │      │
│  │     → Reveals: injections, enhanced   │      │
│  │                                       │      │
│  │  2: FOCUS_FILESYSTEM                  │      │
│  │     → Unlock filesystem features      │      │
│  │     → Reveals: drops, read/writes     │      │
│  │                                       │      │
│  │  3: FOCUS_NETWORK                     │      │
│  │     → Unlock network features         │      │
│  │     → Reveals: DNS, HTTP, sockets     │      │
│  │                                       │      │
│  │  4: MEMORY_DUMP                       │      │
│  │     → Unlock memory dump features     │      │
│  │    → Reveals: anomalies, payloads    │      │
│  │     → HIGH COST action                │      │
│  │                                       │      │
│  │  5: TERMINATE_MALWARE                 │      │
│  │     → End analysis, classify MALWARE  │      │
│  │                                       │      │
│  │  6: TERMINATE_BENIGN                  │      │
│  │     → End analysis, classify BENIGN   │      │
│  └──────────────────────────────────────┘      │
└────────┬────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────┐
│  Classification Output                          │
│                                                 │
│  - Decision: MALWARE or BENIGN                  │
│  - Number of analysis steps taken               │
│  - Sequence of actions performed                │
│  - Evidence collected                           │
│  - JSON report saved to disk                    │
└─────────────────────────────────────────────────┘

1. CAPEv2 Sandbox

CAPEv2 (Configuration And Payload Extraction) is an advanced malware sandbox that executes suspicious files in isolated virtual machines and records their behavior.

Key Capabilities:

Runs samples in Windows VMs (Windows 10)
Monitors all system calls and API invocations
Tracks process creation, memory operations, file modifications
Captures network traffic (DNS queries, HTTP requests, TCP/UDP connections)
Applies YARA rules for signature matching
Generates comprehensive JSON reports

In This Project:

CAPEv2 runs as a service (cuckoo.py)
Agent submits files via Python API
Analysis typically takes 2-5 minutes per sample
Reports are retrieved and fed to the AI agent

2. Feature Extraction Module

The feature extractor (ai_agent.py: CAPEFeatureExtractor class) converts CAPE’s large JSON reports into fixed-size numerical vectors that the neural network can process.

Feature Categories (35 dimensions total):

Basic Features (6):

file_size: Size of the sample in bytes
cape_type_code: CAPE’s file type classification
file_type_ord: Ordinal encoding (PE32=1, DLL=2, other=0)
n_processes: Number of processes created
n_threads_total: Total threads across all processes
proc_tree_nodes: Size of process tree
total_api_calls: Total API calls recorded

Memory Features (3):

n_enhanced_events: Count of memory events
n_injection_indicators: Count of injection-related API calls (WriteProcessMemory, VirtualAllocEx, CreateRemoteThread, etc.)

Filesystem Features (6):

files_total, read_files, write_files, delete_files
n_dropped: Number of files dropped by sample
dropped_total_size: Total size of dropped files

Network Features (11):

total_network_calls: All network-related API calls
dns_calls, http_calls, socket_calls: By protocol
dns_ratio, http_ratio, socket_ratio: Proportions
network_signatures: Count of network-related YARA matches

Memory Dump Features (7):

n_anomalies: Behavioral anomalies detected
n_encryptedbuffers: Encrypted memory regions
n_signatures: Total YARA matches
signatures_alert_count: Critical signatures
n_payloads: Extracted payloads
payloads_total_size: Total payload size

Metadata (2):

step_id: Current step number in analysis
last_action: Previous action taken

Progressive Revelation: Not all features are available immediately. The agent must take specific actions to “unlock” feature groups:

Always available: Basic features
Requires FOCUS_MEMORY: Memory indicators
Requires FOCUS_FILESYSTEM: File operations
Requires FOCUS_NETWORK: Network details
Requires MEMORY_DUMP: Deep forensic features

3. DQN Agent

The AI agent uses a Dueling Deep Q-Network architecture an advanced reinforcement learning algorithm.

Model Architecture (dqn_agent.py):

  
class DuelingDQN(nn.Module):
    def __init__(self, state_dim=35, action_dim=7, hidden_dim=256):
        # Feature extraction layers
        self.feature_layer = nn.Sequential(
            nn.Linear(35, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Dropout(0.1)
        )
        
        # Value stream: "How good is this state?"
        self.value_stream = nn.Sequential(
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )
        
        # Advantage stream: "How much better is each action?"
        self.advantage_stream = nn.Sequential(
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, 7)
        )

Action Space (7 actions):

  
class Action(IntEnum):
    CONTINUE = 0          # Re-examine without new analysis
    FOCUS_MEMORY = 1      # Deep memory investigation
    FOCUS_FILESYSTEM = 2  # Filesystem analysis
    FOCUS_NETWORK = 3     # Network behavior analysis
    MEMORY_DUMP = 4       # Expensive full memory dump
    TERMINATE_MALWARE = 5 # Final decision: MALWARE
    TERMINATE_BENIGN = 6  # Final decision: BENIGN

Dataset and Training

The WinMET Dataset

Training data comes from WinMET (Windows Malware Execution Traces), a dataset of CAPE analysis reports.

Dataset Details:

Source: MALVADA research project (University of Zaragoza)
Original Size: ~10,000 CAPE reports
Content: Real malware samples and benign software
Availability:
- Original: Zenodo
- Training subset: Kaggle

Training Subset:

Total samples: 1,307 CAPE reports
Malware: 1,039 (79.5%)
Benign: 268 (20.5%)
Malware families: Banking trojans, ransomware, RATs, stealers, droppers

Reinforcement Learning Environment

The training environment models the analysis process as a Markov Decision Process (MDP).

Environment Characteristics:

State Space: 35-dimensional continuous vector
Action Space: 7 discrete actions
Episode: Analyzing one sample from start to classification
Max Steps: 20 actions per episode

Reward Structure:

  
# Terminal rewards
Correct classification: +15.0
False negative (missing malware): -25.0
False positive: -25.0

# Step costs
CONTINUE: -0.1
FOCUS_MEMORY: -0.5
FOCUS_FILESYSTEM: -0.5
FOCUS_NETWORK: -0.5
MEMORY_DUMP: -1 (expensive)

Training Algorithm:

Deep Q-Learning with experience replay
Dueling DQN architecture
ε-greedy exploration (1.0 → 0.01)
Batch size: 128
Learning rate: 0.001

System Deployment and Usage

The trained model is packaged as a command-line tool (run_analysis.py).

Installation

  
# Clone repository
git clone https://github.com/ammarlouah/dynamic-malware-analysis-using-ai-agent-and-rl.git
cd dynamic-malware-analysis-using-ai-agent-and-rl

# Run automated setup
chmod +x setup.sh
./setup.sh

Usage Examples

Basic Analysis:

cd AI_Agent
source venv/bin/activate
python run_analysis.py /path/to/suspicious.exe

Expected Output:

============================================================
AI AGENT ANALYSIS
============================================================

Analysis Result:
   Decision: MALWARE
   Steps: 4

Evidence Found:
   - injection_indicators: 12
   - dropped_files: 3
   - alert_signatures: 8

Actions Taken:
   Step 1: CONTINUE
   Step 2: FOCUS_MEMORY
   Step 3: FOCUS_NETWORK
   Step 4: TERMINATE_MALWARE

============================================================

Batch Processing:

python run_analysis.py /path/to/folder --batch

Using Existing CAPE Report:

python run_analysis.py --report /path/to/report.json

Demonstration Video

Watch the complete system demonstration:

Research Paper

For detailed methodology and academic context, see the full research paper:

Key Takeaways

This project demonstrates:

1. RL for Sequential Decision-Making:

Malware analysis is naturally sequential
Reinforcement learning learns optimal strategies from experience
Agent learns implicitly without explicit programming

2. Cost-Aware Analysis:

Different analytical techniques have different costs
Agent learns resource-aware decision-making
Balances thoroughness with efficiency

3. Progressive Information Revelation:

Models partial observability realistically
Forces agent to reason about information gaps
Mimics human analyst workflow

4. Integration with Security Tools:

Practical integration of ML with CAPEv2
End-to-end pipeline from data to deployed tool
Reusable for similar projects

Limitations and Future Work

Current Limitations

1. Training Data Dependency:

Performance depends on dataset quality
Novel malware families may be challenging
Requires periodic retraining

2. Sandbox Limitations:

Inherits all limitations of dynamic analysis
Malware can detect VMs and remain dormant
Evasion techniques remain challenging

3. Binary Classification:

Only distinguishes MALWARE vs. BENIGN
Doesn’t identify specific malware families

Future Improvements

Short-Term:

Multi-class classification (identify families)
Ensemble methods with traditional ML
Web dashboard for easier usage

Medium-Term:

Adversarial robustness training
Transfer learning to other platforms
Active learning from analyst feedback
Attention mechanisms for interpretability

Long-Term:

Multi-agent collaborative systems
Real-time stream analysis
Automated remediation
Zero-day focused anomaly detection

Conclusion

This project demonstrates that reinforcement learning can be successfully applied to malware analysis workflows. By training a DQN agent to make sequential analytical decisions, we’ve created a system that:

Learns intelligent strategies from experience
Balances thoroughness and efficiency through cost-awareness
Adapts its approach based on observations
Mimics human analyst reasoning through progressive investigation

While this is a proof-of-concept with limitations, it points toward a future where AI agents augment human security analysts handling routine cases efficiently while flagging ambiguous samples for expert review.

The complete code, trained models, and documentation are fully open-source. Whether you’re interested in reinforcement learning, cybersecurity automation, or AI applications in security, this project provides a working example.

Resources and Links

Project Repository:

GitHub: Dynamic Malware Analysis using AI Agent and Reinforcement Learning

Datasets:

WinMET Original: Zenodo - DOI:10.5281/zenodo.12647555
Training Subset: Kaggle - WinMET Dataset

Documentation:

CAPEv2: https://capev2.readthedocs.io/
MALVADA: https://github.com/reverseame/MALVADA
PyTorch: https://pytorch.org/docs/

Key Papers:

DQN: Playing Atari with Deep Reinforcement Learning
Dueling DQN: Dueling Network Architectures
MALVADA: SoftwareX, Volume 30, 2025

Citation:

  
@misc{louah2026dynamic_malware_rl,
  author = {Louah, Ammar},
  title = {Dynamic Malware Analysis using AI Agent and Reinforcement Learning},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/ammarlouah/dynamic-malware-analysis-using-ai-agent-and-rl}
}

Connect:

GitHub: @ammarlouah
LinkedIn: Ammar Louah

Thank you for reading! If you found this project interesting or useful, please consider giving it a ⭐ on GitHub. Questions, issues, and contributions are welcome!

This project is licensed under GNU General Public License v3.0 (GPLv3).

Projects, AI, Cybersecurity, Machine Learning, Reinforcement Learning

This post is licensed under CC BY 4.0 by the author.