IBM’s AI research unit debuted Project CodeNet, a dataset to develop machine learning models for software programming. The name is a take-off on ImageNet, the influential dataset of photos that pushed the development of computer vision and deep learning. Creating “AI for code” systems has been challenging since software developers are constantly discovering new problems and exploring different solutions. IBM researchers have taken that into consideration in developing a multi-purpose dataset for Project CodeNet.
VentureBeat reports that, “the dataset contains 14 million code samples with 500 million lines of code written in 55 different programming languages,” with code samples taken from “submissions to nearly 4,000 challenges posted on online coding platforms AIZU and AtCoder” and offering correct and incorrect answers.
To make it particularly useful for the challenge at hand, CodeNet includes a large amount of annotation; “every one of the coding challenges included in the dataset has a textual description along with CPU time and memory limits … [and] every code submission has a dozen pieces of information, including the language, the date of submission, size, execution time, acceptance, and error types.”
To curate CodeNet, IBM engineers gathered the code samples from AIZU and AtCoder, one of which had an API interface and the other required the engineers to develop a technique to scrape the data from websites and tabulate it. Those had to be manually merged to form a “unified schema.” They then had to “develop tools to cleanse the data” of duplicates and dead code and develop preprocessing tools to more easily train ML models on the CodeNet corpus.
“All these efforts are a reminder of the huge human effort needed to create efficient machine learning systems,” notes VB.
What distinguishes CodeNet from similar efforts is “the sheer size of the dataset, including the number of samples and the diversity of the languages” as well as the “rich annotations” or metadata that “make it suitable for a diverse set of tasks.”
CodeNet’s potential uses are for language translation, which “can be handy for organizations that want to port old code to new languages and make them accessible to newer generations of programmers and maintainable with new development tools.” It could also be used to develop machine learning models for a variety of code recommendation tools, code optimization systems or systems to “flag potential flaws in source code.”
Because CodeNet is “a rich library of textual descriptions of problems and their corresponding source code,” it could also potentially be used to “generate code from natural language descriptions.”
IBM researchers have already experimented using CodeNet for “code classification, code similarity evaluation, and code completion” with deep learning architectures including “simple multi-layer perceptrons, convolutional neural networks, graph neural networks, and transformers.” They reported that, “they have been able to obtain above 90-percent accuracy in most tasks.”