A recent study reveals that software engineers who use code-generating AI systems are more likely to cause security vulnerabilities in the applications they develop. The article, co-authored by a team of researchers affiliated with Stanford, highlights the potential pitfalls of code-generating systems as vendors like GitHub begin to market them in earnest.
“Code generation systems are currently not replacing human developers,” Neil Perry, a Stanford doctoral student and co-lead author of the study, told TechCrunch in an email interview. “Developers who use them to perform tasks outside their own areas of expertise should be concerned, and those who use them to speed up tasks they are already skilled at should carefully double-check the outputs and the context in which they are placed. used throughout the project.”
The Stanford study focused specifically on Codex, the AI code generation system developed by San Francisco-based research lab OpenAI. (Codex powers Copilot.) The researchers recruited 47 developers — ranging from undergraduate students to industry professionals with decades of programming experience — to use Codex to solve security-related issues in programming languages, including Python, JavaScript, and C.
The Codex was trained on billions of lines of public code to suggest additional lines of code and functions given the context of the existing code. The system surfaces a programming approach or solution in response to a description of what a developer wants to accomplish (e.g. “Say hello to the world”), drawing on both its knowledge base and the current context.
According to the researchers, study participants who had access to Codex were more likely to write incorrect and “insecure” (in the cybersecurity sense) solutions to programming problems compared to a control group. More worryingly, they were more likely to say their unsure answers were safe compared to people in the control group.
Megha Srivastava, graduate student at Stanford and second co-author of the study, pointed out that the results do not constitute a complete condemnation of Codex and other code-generating systems. Study participants lacked security expertise that would have allowed them to better spot code vulnerabilities, to begin with. That aside, Srivastava thinks code generation systems are reliably useful for tasks that aren’t high-risk, like exploratory research code, and could, with fine-tuning, improve their coding suggestions. .
“Companies that develop their own [systems]perhaps more trained on their internal source code, might be better off as the model can be encouraged to generate outputs that are more in line with their coding and security practices,” Srivastava said.
So how could vendors like GitHub prevent the introduction of security holes by developers using their code-generating AI systems? The co-authors have a few ideas, including a mechanism to “tune” user prompts to be safer – much like a supervisor reviewing and revising code drafts. They also suggest that developers of cryptography libraries ensure that their defaults are secure, as code-generating systems tend to stick to defaults that are not always exploit-free.
“The AI assistant code generation tools are a really exciting development and it’s understandable that so many people are eager to use them. These tools raise issues to consider moving forward, however… Our goal is to make a broader statement about the use of code generation templates,” Perry said. “More work needs to be done to explore these issues and develop techniques to address them.”
For Perry, the introduction of security vulnerabilities isn’t the only flaw in code-generating AI systems. At least some of the code on which the Codex was formed is under a restrictive license; users were able to request Copilot to generate code from Quake, code snippets in personal codebases, and sample code from books such as “Mastering JavaScript” and “Think JavaScript”. Some legal experts have argued that Copilot could put companies and developers at risk if they unwittingly incorporate the tool’s copyrighted suggestions into their production software.
GitHub’s attempt to rectify this is a filter, first introduced to the Copilot platform in June, that checks code suggestions with their surrounding code of approximately 150 characters against public GitHub code and hides the suggestions. whether there is a match or a “near match”. But this is an imperfect measure. Tim Davis, a professor of computer science at Texas A&M University, discovered that enabling the filter caused Copilot to emit large chunks of its copyrighted code, including all attribution and license text. .
“[For these reasons,] we largely express caution against using these tools to replace training entry-level developers in solid coding practices,” Srivastava added.