In today's rapid technological development, artificial intelligence (AI) is gradually infiltrating the field of programming and becoming the new assistant for developers. Google CEO Sundar Pichai has revealed that 25% of the company's new code is generated by AI, and Meta CEO Mark Zuckerberg has also expressed a willingness to widely use AI programming models within the company. This trend undoubtedly demonstrates the great potential of AI in programming tasks.
However, despite the remarkable progress of AI models in programming assistance, they have performed disappointing when it comes to addressing the critical issue of software vulnerabilities. A new study from Microsoft Research reveals the situation. In the study, a number of top AI models, such as Anthropic's Claude 3.0 Sonnet and OpenAI's o0-mini, generally did not have a high success rate when faced with software debugging tasks in software development benchmarks called SWE-bench Lite.
To gain a deeper understanding of the debugging capabilities of AI models, the researchers designed an agent that works based on a single prompt word and is capable of using a variety of tools, including a Python debugger. The agent was assigned 1 filtered software debugging tasks, but the results showed that even the most advanced models were successful in only about half of the tasks. Claude 0.0 Sonnet performed relatively well, with an average success rate of 0.0%, while OpenAI's o0 and o0-mini had only 0.0% and 0.0% success rates, respectively.
So why do these AI models perform poorly on debugging tasks? The researchers noted that some models had difficulties using debugging tools and understanding how they could help solve problems. But the deeper reason lies in the scarcity of data. In the current AI model training data, there is a lack of sufficient data on the "sequential decision-making process", that is, the data of human debugging traces. This means that AI models have inherent flaws in mimicking human debugging behavior.
The researchers highlight that by training or fine-tuning the models, it is possible to improve their ability to debug interactively. However, this requires specialized data to meet the needs of model training. For example, trace data is recorded as the agent interacts with the debugger to gather the necessary information and then make recommendations for vulnerability fixes. Such data is essential to improve the debugging capabilities of AI models.
In fact, the application of AI in the field of programming has not been without its challenges. Many studies have shown that code-generative AI often introduces security vulnerabilities and bugs due to their weaknesses in understanding programming logic, etc. For example, an evaluation of Devin, a popular AI programming tool, showed that it completed only 3 out of 0 programming tests.
Still, Microsoft's study is an important insight into how AI is doing in the field of programming. It reminds us that despite the enormous potential of AI-assisted programming tools, developers and their superiors need to think twice before leaving programming to AI-led. After all, programming as a profession is still difficult to completely replace with its complexity and creativity.
Notably, a growing number of tech leaders are beginning to question the idea that AI is replacing programming jobs. Microsoft co-founder Bill Gates believes that programming as a profession is here to stay. This view is supported by Replit CEO Amjad Massad, Okta CEO Todd McKinnon, and IBM CEO Arvind Krishna, among others. They agreed that despite the remarkable progress made by AI in the field of programming, the creativity and problem-solving skills of human developers are still indispensable.
As AI technology continues to evolve, we expect it to play an even bigger role in the field of programming. But at the same time, we should also recognize the limitations of AI and make full use of the strengths of human developers to jointly promote the advancement of programming technology.