LLM hesitancy leaves open source risks in place

llm-hesitancy-leaves-open-source-risks-in-place
LLM hesitancy leaves open source risks in place
Web scraping code

LLMs are getting less wrong when analyzing open source code, but they’re becoming more hesitant. That hesitation quietly preserves risks, creating a false sense of safety in which ‘playing it safe’ leaves exploitable (and often Critical or High severity) software in place.

Research from Sonatype looks at how AI agents handle open source software based on a study of 37,000 dependency upgrade recommendations. It uncovers a structural trade-off, when models reduce hallucinations without real data, they default to inaction — hallucination vs. ‘do nothing’ are two failure modes of ungrounded AI.

The research shows that hallucination rates across Anthropic, Google, and OpenAI models have significantly improved as new models were released. Rates are down from roughly one in four hallucinated dependency upgrade recommendations to around 10-13 percent, with Claude Opus 4.6 producing the fewest hallucinations at six percent. However, this is reduction is because models have become cautious rather than informed.

Sonatype observed a near two fold increase in the rate of ‘no change’ or same-version recommendations from around 16 percent in early 2025 to 26–31 percent by early 2026. Without access to quality data, ‘do nothing’ is the only safe alternative to ‘make something up,’ but neither failure mode is acceptable in a production dependency pipeline.

“Larger models may be improving at reasoning, but dependency management is not a reasoning problem alone — it is a data problem. If a model does not know your actual environment, current vulnerability data, and the policies you operate under, it is just making educated guesses,” says Brian Fox, co-founder and CTO at Sonatype. “Grounding AI in that reality is what makes its recommendations useful, credible, and safe for enterprise use.”

The study finds real-time intelligence matters more than model size alone. A small grounded model resulted in significantly lower Critical and High risk flags at up to 71 times lower cost than frontier models.

You can get the full report from the Sonatype site.

Image credit: monsit/depositphotos.com