Nvidia Cuts Video-Conferencing Bandwidth by Factor of Ten

Last month Nvidia launched Maxine, a software development kit containing technology the company claims will cut the bandwidth requirements of video-conferencing software by a factor of ten. A neural network creates a compressed version of a person’s face which, when sent across the network, is decompressed by a second neural network. The software can also make helpful corrections to the image, such as rotating a face to look straight forward or replacing it with a digital avatar. Nvidia is now waiting for software developers to productize the technology.

Ars Technica reports that, “the software comes with an important limitation: the device receiving a video stream needs an Nvidia GPU with Tensor Core technology … [and] to support devices without an appropriate graphics card, Nvidia recommends that video frames be generated in the cloud — an approach that may or may not work well in practice.”

Ars Technica notes that, “regardless of how Maxine fares in the marketplace, the concept seems likely to be important for video streaming services in the future … [since] before too long, most computing devices will be powerful enough to generate real-time video content using neural networks.”

Maxine’s neural network is built on a so-called conditional generative adversarial network (GAN), whereby “neural networks that take an image (or other input data) … then try to produce a corresponding output image.” Conditional GANs have been used to “generate works of art from textual descriptions … photographs from sketches … maps from satellite images, to predict how people will look when they’re older, and a lot more.”

Nvidia referred to a 2019 paper by its researchers that described a conditional GAN that does what Maxine is capable of, but with a slight tweak: “instead of taking a video as input, Maxine takes a set of key points extracted from the source video — data points specifying the location and shape of the subject’s eyes, mouth, nose, eyebrows, and other facial features.”

In the paper, Nvidia researchers stated that, “a single network could be trained to generate videos of different people based on the photos provided as inputs,” meaning that it wasn’t necessary to “train a new network for each user.”

Nvidia is currently “providing developers with a number of different capabilities and letting them decide how to put them together into a usable product,” which will need a “recent Nvidia GPU on the receiving end of the video stream.” Since most video-conferencing tools support a wide variety of hardware, Nvidia suggested that “developers could run Maxine on a cloud server equipped with the necessary Nvidia hardware, then stream the rendered video to client devices.”

The uplink however, doesn’t require an Nvidia GPU. Maxine is in the “early access” stage of development, available only to “a select group of early developers who are helping Nvidia refine Maxine’s APIs” but will eventually be open to “software developers generally.”