Using a single LLM tool for malware analysis leads to unreliable results

using-a-single-llm-tool-for-malware-analysis-leads-to-unreliable-results
Using a single LLM tool for malware analysis leads to unreliable results
Malware Detected Warning Screen

New research from SentinelOne’s SentinelLABS looks at why many AI-powered malware analysis workflows produce unreliable results. Single-tool LLM analysis often misidentifies malware capabilities as decompiler artifacts, parsing quirks and dead code all create noise that distorts outputs. LLMs will faithfully amplify these errors if unchallenged.

The report notes that, “These failures are not hallucinations in the usual sense. The model is doing what it was asked to do, reasoning over the data it sees. The problem is that the data is noisy. Each reverse engineering tool brings its own parsing quirks.”

To test the limits of current AI models, SentinelLABS researchers built a multi-agent LLM system to analyse macOS malware. The LLM treats multiple reverse-engineering tools as independent analysts that must verify or reject each other’s findings and land on an agreed consensus.

Accuracy improves when tools challenge each other the findings show. The system treats tools like radare2, Ghidra, Binary Ninja and IDA Pro as sceptical analysts that verify or reject prior findings. This serial consensus is crucial to reliable AI output.

The research also shows that architecture matters more than the model. Reliability improvements come from the pipeline design, not fine-tuning or larger LLMs. The quality of the data underpinning a model’s reasoning is more important than the model’s reasoning capability.

Phil Stokes, research engineer at SentinelOne concludes, “The primary challenge with LLM-driven malware analysis is not so much a given model’s reasoning capability but the quality of the data the model reasons over. Decompiler artifacts, string parsing quirks, and dead code all create noise that an LLM will faithfully amplify into a report unless the system is specifically designed to catch and reject those artifacts before they reach the synthesis stage.”

You can read more on the SentinelOne blog.

Image credit: solarseven/depositphotos.com