Dynamic Malware Analysis Using AI Agent and Reinforcement Learning
Introduction
Malware analysis is a time-consuming and resource-intensive task that requires expert knowledge and careful investigation. Security analysts must decide which analytical techniques to apply from basic file inspection to deep memory dumps while balancing thoroughness against time constraints. Traditional automated systems either perform exhaustive analysis on every sample (slow and expensive) or use rigid rule-based heuristics (fast but inflexible).
This project explores a different approach: What if an AI agent could learn the analysis process itself? Instead of telling the system what to check, we train a reinforcement learning agent that autonomously decides which analytical actions to take, mimicking how human analysts progressively investigate suspicious files.
I developed an AI-powered dynamic malware analysis system that combines:
- CAPEv2 Sandbox: Industry-standard malware analysis sandbox for executing samples in isolated VMs
- Deep Q-Network (DQN): Reinforcement learning agent that learns optimal analysis strategies
- Progressive Analysis: The agent starts with cheap checks and deepens investigation only when needed
- Intelligent Decision-Making: Learns to balance analysis depth against computational cost
The system represents a proof-of-concept for applying reinforcement learning to cybersecurity workflows, demonstrating that AI agents can learn complex sequential decision-making processes in the malware analysis domain.
All code, models, and documentation are available in the GitHub repository: Dynamic Malware Analysis using AI Agent and Reinforcement Learning.
This article will walk you through the complete system from CAPEv2 integration and feature engineering, to the RL environment design, DQN architecture, and deployment as a command-line tool.
Project Motivation and Goals
The core idea behind this project is to apply reinforcement learning to the malware analysis workflow itself. Traditional approaches fall into two categories:
Exhaustive Analysis:
- Performs all available checks on every sample
- Comprehensive but slow (10-15 minutes per sample)
- Wastes resources on obviously benign or obviously malicious files
Rule-Based Systems:
- Uses predefined heuristics (“if injection detected, classify as malware”)
- Fast but inflexible
- Struggles with novel malware families
- Requires manual rule updates
Our RL Approach:
- Agent learns from experience which analytical actions are most informative
- Adapts strategy based on what it observes
- Can perform cheap checks first, expensive analysis only when necessary
- Learns implicitly from thousands of analysis examples
Project Goals:
- Build an end-to-end system integrating CAPEv2 sandbox with DQN agent
- Design a reinforcement learning environment that models the analysis process
- Train an agent that makes intelligent, cost-aware analysis decisions
- Deploy as a practical tool for analyzing real malware samples
- Demonstrate the viability of RL for cybersecurity workflows
System Architecture
The system consists of four integrated components working together:
Component Overview
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
┌─────────────────┐
│ Suspicious │
│ File Sample │
│ (.exe, .dll, │
│ .ps1, etc.) │
└────────┬────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ CAPEv2 Sandbox │
│ ┌──────────────────────────────────────┐ │
│ │ Dynamic Execution in VM │ │
│ │ - Runs sample in isolated Windows VM │ │
│ │ - Monitors process creation │ │
│ │ - Tracks API calls │ │
│ │ - Logs memory operations │ │
│ │ - Captures network traffic │ │
│ │ - Records filesystem changes │ │
│ │ - Execution time: 120-300 seconds │ │
│ └──────────────────────────────────────┘ │
└────────┬────────────────────────────────────────┘
│
Generates
│
▼
┌─────────────────────────────────────────────────┐
│ CAPE JSON Report │
│ - Process tree and API call sequences │
│ - Memory injection indicators │
│ - Dropped files and modifications │
│ - Network DNS/HTTP/TCP connections │
│ - YARA signature matches │
│ - Behavioral indicators │
│ - File size: typically 1-50 MB │
└────────┬────────────────────────────────────────┘
│
│ (Optional)
├──── MALVADA Validation ────
│ - Report quality checks
│ - Duplicate detection
│ - Data sanitization
│
▼
┌─────────────────────────────────────────────────┐
│ Feature Extraction │
│ ┌──────────────────────────────────────┐ │
│ │ Extract 35-dimensional state vector: │ │
│ │ │ │
│ │ Basic Features (6): │ │
│ │ - file_size, file_type, CAPE_code │ │
│ │ - n_processes, n_threads │ │
│ │ - proc_tree_nodes, total_api_calls │ │
│ │ │ │
│ │ Memory Features (3): │ │
│ │ - n_enhanced_events │ │
│ │ - n_injection_indicators │ │
│ │ │ │
│ │ Filesystem Features (6): │ │
│ │ - files_total, read/write/delete │ │
│ │ - n_dropped, dropped_total_size │ │
│ │ │ │
│ │ Network Features (11): │ │
│ │ - total_network_calls │ │
│ │ - dns/http/socket calls & ratios │ │
│ │ - network_signatures │ │
│ │ │ │
│ │ Memory Dump Features (7): │ │
│ │ - n_anomalies, n_encryptedbuffers │ │
│ │ - n_signatures, alerts, payloads │ │
│ │ │ │
│ │ Metadata (2): │ │
│ │ - step_id, last_action │ │
│ └──────────────────────────────────────┘ │
└────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ AI Agent (Deep Q-Network) │
│ ┌──────────────────────────────────────┐ │
│ │ Dueling DQN Architecture │ │
│ │ │ │
│ │ Input Layer: 35 features │ │
│ │ ↓ │ │
│ │ Feature Extraction: │ │
│ │ Linear(35 → 256) + ReLU │ │
│ │ Linear(256 → 256) + ReLU │ │
│ │ Dropout(0.1) │ │
│ │ ↓ │ │
│ │ Split into two streams: │ │
│ │ │ │
│ │ Value Stream: │ │
│ │ Linear(256 → 128) + ReLU │ │
│ │ Linear(128 → 1) │ │
│ │ Output: V(s) - state value │ │
│ │ │ │
│ │ Advantage Stream: │ │
│ │ Linear(256 → 128) + ReLU │ │
│ │ Linear(128 → 7) │ │
│ │ Output: A(s,a) - action advantages │ │
│ │ │ │
│ │ Combine: │ │
│ │ Q(s,a) = V(s) + [A(s,a) - mean(A)]│ │
│ │ │ │
│ │ Output: 7 Q-values (one per action) │ │
│ └──────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ Action Selection │ │
│ │ │ │
│ │ 0: CONTINUE │ │
│ │ → Re-examine current features │ │
│ │ │ │
│ │ 1: FOCUS_MEMORY │ │
│ │ → Unlock memory features │ │
│ │ → Reveals: injections, enhanced │ │
│ │ │ │
│ │ 2: FOCUS_FILESYSTEM │ │
│ │ → Unlock filesystem features │ │
│ │ → Reveals: drops, read/writes │ │
│ │ │ │
│ │ 3: FOCUS_NETWORK │ │
│ │ → Unlock network features │ │
│ │ → Reveals: DNS, HTTP, sockets │ │
│ │ │ │
│ │ 4: MEMORY_DUMP │ │
│ │ → Unlock memory dump features │ │
│ │ → Reveals: anomalies, payloads │ │
│ │ → HIGH COST action │ │
│ │ │ │
│ │ 5: TERMINATE_MALWARE │ │
│ │ → End analysis, classify MALWARE │ │
│ │ │ │
│ │ 6: TERMINATE_BENIGN │ │
│ │ → End analysis, classify BENIGN │ │
│ └──────────────────────────────────────┘ │
└────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Classification Output │
│ │
│ - Decision: MALWARE or BENIGN │
│ - Number of analysis steps taken │
│ - Sequence of actions performed │
│ - Evidence collected │
│ - JSON report saved to disk │
└─────────────────────────────────────────────────┘
1. CAPEv2 Sandbox
CAPEv2 (Configuration And Payload Extraction) is an advanced malware sandbox that executes suspicious files in isolated virtual machines and records their behavior.
Key Capabilities:
- Runs samples in Windows VMs (Windows 10)
- Monitors all system calls and API invocations
- Tracks process creation, memory operations, file modifications
- Captures network traffic (DNS queries, HTTP requests, TCP/UDP connections)
- Applies YARA rules for signature matching
- Generates comprehensive JSON reports
In This Project:
- CAPEv2 runs as a service (
cuckoo.py) - Agent submits files via Python API
- Analysis typically takes 2-5 minutes per sample
- Reports are retrieved and fed to the AI agent
2. Feature Extraction Module
The feature extractor (ai_agent.py: CAPEFeatureExtractor class) converts CAPE’s large JSON reports into fixed-size numerical vectors that the neural network can process.
Feature Categories (35 dimensions total):
Basic Features (6):
file_size: Size of the sample in bytescape_type_code: CAPE’s file type classificationfile_type_ord: Ordinal encoding (PE32=1, DLL=2, other=0)n_processes: Number of processes createdn_threads_total: Total threads across all processesproc_tree_nodes: Size of process treetotal_api_calls: Total API calls recorded
Memory Features (3):
n_enhanced_events: Count of memory eventsn_injection_indicators: Count of injection-related API calls (WriteProcessMemory, VirtualAllocEx, CreateRemoteThread, etc.)
Filesystem Features (6):
files_total,read_files,write_files,delete_filesn_dropped: Number of files dropped by sampledropped_total_size: Total size of dropped files
Network Features (11):
total_network_calls: All network-related API callsdns_calls,http_calls,socket_calls: By protocoldns_ratio,http_ratio,socket_ratio: Proportionsnetwork_signatures: Count of network-related YARA matches
Memory Dump Features (7):
n_anomalies: Behavioral anomalies detectedn_encryptedbuffers: Encrypted memory regionsn_signatures: Total YARA matchessignatures_alert_count: Critical signaturesn_payloads: Extracted payloadspayloads_total_size: Total payload size
Metadata (2):
step_id: Current step number in analysislast_action: Previous action taken
Progressive Revelation: Not all features are available immediately. The agent must take specific actions to “unlock” feature groups:
- Always available: Basic features
- Requires FOCUS_MEMORY: Memory indicators
- Requires FOCUS_FILESYSTEM: File operations
- Requires FOCUS_NETWORK: Network details
- Requires MEMORY_DUMP: Deep forensic features
3. DQN Agent
The AI agent uses a Dueling Deep Q-Network architecture an advanced reinforcement learning algorithm.
Model Architecture (dqn_agent.py):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class DuelingDQN(nn.Module):
def __init__(self, state_dim=35, action_dim=7, hidden_dim=256):
# Feature extraction layers
self.feature_layer = nn.Sequential(
nn.Linear(35, 256),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Dropout(0.1)
)
# Value stream: "How good is this state?"
self.value_stream = nn.Sequential(
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, 1)
)
# Advantage stream: "How much better is each action?"
self.advantage_stream = nn.Sequential(
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, 7)
)
Action Space (7 actions):
1
2
3
4
5
6
7
8
class Action(IntEnum):
CONTINUE = 0 # Re-examine without new analysis
FOCUS_MEMORY = 1 # Deep memory investigation
FOCUS_FILESYSTEM = 2 # Filesystem analysis
FOCUS_NETWORK = 3 # Network behavior analysis
MEMORY_DUMP = 4 # Expensive full memory dump
TERMINATE_MALWARE = 5 # Final decision: MALWARE
TERMINATE_BENIGN = 6 # Final decision: BENIGN
Dataset and Training
The WinMET Dataset
Training data comes from WinMET (Windows Malware Execution Traces), a dataset of CAPE analysis reports.
Dataset Details:
- Source: MALVADA research project (University of Zaragoza)
- Original Size: ~10,000 CAPE reports
- Content: Real malware samples and benign software
- Availability:
Training Subset:
- Total samples: 1,307 CAPE reports
- Malware: 1,039 (79.5%)
- Benign: 268 (20.5%)
- Malware families: Banking trojans, ransomware, RATs, stealers, droppers
Reinforcement Learning Environment
The training environment models the analysis process as a Markov Decision Process (MDP).
Environment Characteristics:
- State Space: 35-dimensional continuous vector
- Action Space: 7 discrete actions
- Episode: Analyzing one sample from start to classification
- Max Steps: 20 actions per episode
Reward Structure:
1
2
3
4
5
6
7
8
9
10
11
# Terminal rewards
Correct classification: +15.0
False negative (missing malware): -25.0
False positive: -25.0
# Step costs
CONTINUE: -0.1
FOCUS_MEMORY: -0.5
FOCUS_FILESYSTEM: -0.5
FOCUS_NETWORK: -0.5
MEMORY_DUMP: -1 (expensive)
Training Algorithm:
- Deep Q-Learning with experience replay
- Dueling DQN architecture
- ε-greedy exploration (1.0 → 0.01)
- Batch size: 128
- Learning rate: 0.001
System Deployment and Usage
The trained model is packaged as a command-line tool (run_analysis.py).
Installation
1
2
3
4
5
6
7
# Clone repository
git clone https://github.com/ammarlouah/dynamic-malware-analysis-using-ai-agent-and-rl.git
cd dynamic-malware-analysis-using-ai-agent-and-rl
# Run automated setup
chmod +x setup.sh
./setup.sh
Usage Examples
Basic Analysis:
1
2
3
cd AI_Agent
source venv/bin/activate
python run_analysis.py /path/to/suspicious.exe
Expected Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
============================================================
AI AGENT ANALYSIS
============================================================
Analysis Result:
Decision: MALWARE
Steps: 4
Evidence Found:
- injection_indicators: 12
- dropped_files: 3
- alert_signatures: 8
Actions Taken:
Step 1: CONTINUE
Step 2: FOCUS_MEMORY
Step 3: FOCUS_NETWORK
Step 4: TERMINATE_MALWARE
============================================================
Batch Processing:
1
python run_analysis.py /path/to/folder --batch
Using Existing CAPE Report:
1
python run_analysis.py --report /path/to/report.json
Demonstration Video
Watch the complete system demonstration:
Research Paper
For detailed methodology and academic context, see the full research paper:
Key Takeaways
This project demonstrates:
1. RL for Sequential Decision-Making:
- Malware analysis is naturally sequential
- Reinforcement learning learns optimal strategies from experience
- Agent learns implicitly without explicit programming
2. Cost-Aware Analysis:
- Different analytical techniques have different costs
- Agent learns resource-aware decision-making
- Balances thoroughness with efficiency
3. Progressive Information Revelation:
- Models partial observability realistically
- Forces agent to reason about information gaps
- Mimics human analyst workflow
4. Integration with Security Tools:
- Practical integration of ML with CAPEv2
- End-to-end pipeline from data to deployed tool
- Reusable for similar projects
Limitations and Future Work
Current Limitations
1. Training Data Dependency:
- Performance depends on dataset quality
- Novel malware families may be challenging
- Requires periodic retraining
2. Sandbox Limitations:
- Inherits all limitations of dynamic analysis
- Malware can detect VMs and remain dormant
- Evasion techniques remain challenging
3. Binary Classification:
- Only distinguishes MALWARE vs. BENIGN
- Doesn’t identify specific malware families
Future Improvements
Short-Term:
- Multi-class classification (identify families)
- Ensemble methods with traditional ML
- Web dashboard for easier usage
Medium-Term:
- Adversarial robustness training
- Transfer learning to other platforms
- Active learning from analyst feedback
- Attention mechanisms for interpretability
Long-Term:
- Multi-agent collaborative systems
- Real-time stream analysis
- Automated remediation
- Zero-day focused anomaly detection
Conclusion
This project demonstrates that reinforcement learning can be successfully applied to malware analysis workflows. By training a DQN agent to make sequential analytical decisions, we’ve created a system that:
- Learns intelligent strategies from experience
- Balances thoroughness and efficiency through cost-awareness
- Adapts its approach based on observations
- Mimics human analyst reasoning through progressive investigation
While this is a proof-of-concept with limitations, it points toward a future where AI agents augment human security analysts handling routine cases efficiently while flagging ambiguous samples for expert review.
The complete code, trained models, and documentation are fully open-source. Whether you’re interested in reinforcement learning, cybersecurity automation, or AI applications in security, this project provides a working example.
Resources and Links
Project Repository:
Datasets:
- WinMET Original: Zenodo - DOI:10.5281/zenodo.12647555
- Training Subset: Kaggle - WinMET Dataset
Documentation:
- CAPEv2: https://capev2.readthedocs.io/
- MALVADA: https://github.com/reverseame/MALVADA
- PyTorch: https://pytorch.org/docs/
Key Papers:
- DQN: Playing Atari with Deep Reinforcement Learning
- Dueling DQN: Dueling Network Architectures
- MALVADA: SoftwareX, Volume 30, 2025
Citation:
1
2
3
4
5
6
7
@misc{louah2026dynamic_malware_rl,
author = {Louah, Ammar},
title = {Dynamic Malware Analysis using AI Agent and Reinforcement Learning},
year = {2026},
publisher = {GitHub},
url = {https://github.com/ammarlouah/dynamic-malware-analysis-using-ai-agent-and-rl}
}
Connect:
- GitHub: @ammarlouah
- LinkedIn: Ammar Louah
Thank you for reading! If you found this project interesting or useful, please consider giving it a ⭐ on GitHub. Questions, issues, and contributions are welcome!
This project is licensed under GNU General Public License v3.0 (GPLv3).