Popular LLMs share strengths and weaknesses when it comes to creating code

Increasing pressure to build and launch applications quickly has seen a rise in the use of AI to generate code. New analysis from Sonar, looking at the quality and security of software code produced by top Large Language Models (LLMs), finds significant strengths as well as material challenges across the tested models.

The study used a proprietary analysis framework for assessing LLM-generated code, tasking the LLMs with over 4,400 Java programming assignments. The LLMs evaluated in the study include Anthropic’s Claude Sonnet 4 and 3.7, OpenAI’s GPT-4o, Meta’s Llama-3.2-vision:90b, and OpenCoder-8B.

“The rapid adoption of LLMs for writing code is a testament to their power and effectiveness,” says Tariq Shaukat, CEO of Sonar. “To really get the most from them, it is crucial to look beyond raw performance to truly understand the full mosaic of a model’s capabilities. Understanding the unique personality of each model, and where they have strengths but also are likely to make mistakes, can ensure each model is used safely and securely.”

All of the models tested show a strong ability to generate syntactically correct code and boilerplate for common frameworks and functions, which reliably speeds up the initial stages of development. For example, Claude Sonnet 4’s success rate of 95.57 percent on HumanEval demonstrates a very high capability to produce valid, executable code.

The models also possess a strong foundational understanding of common algorithms and data structures. They can create viable solutions for well-defined problems, which serve as a solid starting point for more complex features. They’re highly effective at translating code concepts and snippets from one programming language to another too. This makes them a powerful tool for developers who work with different technology stacks.

There are also shared flaws, however. All evaluated models demonstrate significant gaps in security. Critical flaws like hard-coded credentials and path-traversal injections were common across all models. While the exact number varies between models, all evaluated LLMs produced a high percentage of vulnerabilities with the highest severity ratings. For Llama-3.2-vision:90b, over 70 percent of its vulnerabilities are considered ‘blocker’ level of severity; for GPT-4o, it’s 62.5 percent; and for Claude Sonnet 4, it is nearly 60 percent.

In addition all models tested show a bias toward messy code, over 90 percent of the issues found were ‘code smells ‘ — indicators of poor structure, low maintainability, and future technical debt.

Better performance also comes with more risk, while Claude Sonnet 4 improved its performance benchmark pass rate by 6.3 percent over Claude 3.7 Sonnet, meaning it solved problems more correctly, this performance gain came at a price. The percentage of high-severity bugs rose by 93 percent.

Interestingly the research found unique ‘coding personalities’ for each LLM based on a quantifiable analysis across three personality traits: verbosity, complexity and communication and documentation.

The report highlights a need for a more nuanced understanding that supplements performance benchmark scores with a direct assessment of the code’s security, reliability, and maintainability. Businesses need to adopt a ‘trust and verify’ approach, including robust governance and analysis of all AI-generated code.

The full report is available from the Sonar site.

Image credit: meshcube/depositphotos.com

Related Posts

The cost conundrum of cloud computing

The challenges and opportunities of generative AI [Q&A]

Take control of Windows 11 and save a tasty 10% off the incredible Stardock Fences 6 during beta testing