IBM CodeNet Enables AI Translation of Computer Languages

During its Think conference this week, IBM debuted Project CodeNet, an open-source dataset for benchmarking around AI for code. Project CodeNet consists of 14 million code examples, which makes it about 10 times larger than the most similar dataset, which has 52,000 examples. Project CodeNet also offers 500 million lines of code and 55 programming languages including C++, Java, Python, Go, COBOL, Pascal and Fortran, making it a Rosetta Stone for AI systems to automatically translate code into other programming languages.

VentureBeat reveals that, according to a report from the University of Cambridge’s Judge Business School, programmers spend almost half of their time debugging, at a cost of $312 billion per year. CodeNet will allow AI systems to not only automatically translate code into other languages but also “identify overlaps and similarities between different sets of code, and customize constraints based on a developer’s specific needs and parameters.”

Currently, programming language translation “requires expertise in both the source and target languages” and can be costly. The Commonwealth Bank of Australia spent about $750 million over five years to convert from COBOL to Java.

The CodeNet dataset also offers “metadata and annotations with a rich set of information spanning code size, memory footprint, CPU run time, and status, which helps to distinguish correct code from problematic code,” with 90 percent of the sample problems containing “a problem statement and specifications of the input and output format.”

“Given its wealth of programs written in a multitude of languages, we believe Project CodeNet can serve as a benchmark dataset for source-to-source translation and do for AI and code what the ImageNet dataset did years ago for computer vision,” said IBM fellow and IBM Research chief scientist Ruchir Puri.

Engadget reports that “CodeNet is essentially the ImageNet of computers … an expansive dataset designed to teach AI/ML systems how to translate code.” Puri noted that, “since the dataset itself contains 50 different languages, it can actually enable algorithms for many pairwise combinations.”

“Having said that, there has been work done in human language areas, like neural machine translation which, rather than doing pairwise, actually becomes more language-independent and can derive an intermediate abstraction through which it translates into many different languages,” he added.

In addition to translating programming languages, CodeNet “can be used for functions like code search and clone detection.” Because each sample comes labeled with CPU run time and memory footprint, researchers can “run regression studies and potentially develop automated code correction systems.” Users can also “run individual code samples ‘to extract metadata and verify outputs from generative AI models for correctness’.”

According to Puri, “more than 80 percent of these presented problems each already have more than 100 variant answers, providing a broad array of possible solutions.” IBM plans to release the CodeNet data to the public domain.