Microsoft Develops Scalable 2D-to-3D Conversion Technique

Transforming 2D objects into 3D ones is a challenge that has defeated numerous artificial intelligence labs, including those at Facebook, Nvidia and startup Threedy.ai. Now, a Microsoft Research team stated it has created the first “scalable” training technique to derive 3D models from 2D data. Their technology can, furthermore, learn to generate better shapes when trained exclusively with 2D images. The Microsoft team took advantage of software that produces images from display data, as featured in industrial renderers.

VentureBeat reports that the training technique “could be a boon for video game developers, e-commerce businesses, and animation studios that lack the means or expertise to create 3D shapes from scratch.”

The team trains “a generative model for 3D shapes such that rendering the shapes generates images matching the distribution of a 2D data set … The generator model takes in a random input vector (values representing the data set’s features) and generates a continuous voxel representation (values on a grid in 3D space) of the 3D object.” In the next step, it “feeds the voxels to a non-differentiable rendering process, which thresholds them to discrete values before they’re rendered using an off-the-shelf renderer (the Pyrender, which is built on top of OpenGL).”

VB adds that, “a novel proxy neural renderer directly renders the continuous voxel grid generated by the 3D generative model … [which is] trained to match the rendering output of the off-the-shelf renderer given a 3D mesh input.” The team experimented with using a 3D convolutional GAN architecture for the generator, in which the two AI models would attempt to distinguish between “random noise” and “real examples from a training data set.”

The researchers “synthesized images from different object categories, which they rendered from different viewpoints throughout the training process … drawing on a range of synthetic data sets generated from 3D models and a real-life data set.” They stated that this “approach takes advantage of the lighting and shading cues the images provide, enabling it to extract more meaningful information per training sample and produce better results in those settings … [and is] “able to produce realistic samples when trained on data sets of natural images.”

In the future, the Microsoft team plans to focus on incorporating “color, material, and lighting prediction into their system to extend it to work with more ‘general’ real-world data sets.” The pre-print article can be found here.