Google GPipe Library Speeds Deep Neural Network Training

Google has unveiled GPipe, an open-sourced library that makes training deep neural networks more efficient under the TensorFlow framework Lingvo for sequence modeling. According to Google AI software engineer Yanping Huang, “in GPipe … we demonstrate the use of pipeline parallelism to scale up DNN training,” noting that larger DNN models “lead to better task performance.” Huang and his colleagues published a paper on “Efficient Training of Giant Neural Networks Using Pipeline Parallelism.”

VentureBeat reports Huang added that, “past progress in visual recognition tasks has also shown a strong correlation between the model size and classification accuracy.” GPipe implements two AI training techniques: synchronous stochastic gradient descent, “an optimization algorithm used to update a given AI model’s parameters,” and pipeline parallelism, “a task execution system in which one step’s output is streamed as input to the next step.”

Better memory allocation is responsible for “most of GPipe’s performance gains.” For example, “on second-generation Google Cloud tensor processing units (TPUs), each of which contains eight processor cores and 64GB memory (8GB per core), GPipe reduced intermediate memory usage from 6.26GB to 3.46GB, enabling 318 million parameters on a single accelerator core.”

Huang states that, “without GPipe … a single core can only train up to 82 million model parameters.” GPipe also “partitions models across different accelerators and automatically splits miniature batches (i.e., ‘mini-batches’) of training examples into smaller ‘micro-batches’.” By pipelining execution across the micro-batches, it “enables cores to operate in parallel, and furthermore accumulate gradients across the micro-batches, thereby preventing the partitions from affecting model quality.”

Google conducted an experiment in which it trained AmoebaNet-B, a deep learning algorithm with 557 million model parameters and sample images on TPUs, incorporating 1.8 billion parameters on each TPU (25 times more than is possible without GPipe). Huang reported it performed “well” on popular datasets, “pushing single-crop ImageNet accuracy to 84.3 percent, CIFAR-10 accuracy to 99 percent, and CIFAR-100 accuracy to 91.3 percent.”

The AmoebaNet-D algorithm underwent a second test showing that, “distributing the model across four times the number of second-gen TPU cores achieved a speedup of 3.5 times.” A Google test of Transformer language models with “eight billion parameters on third-generation TPU cores (the newest available), each of which has 16 cores and 256GB of memory (16GB per core) … recorded a speedup of 11 times.”

“The ongoing development and success of many practical machine learning applications, such as autonomous driving and medical imaging, depend on achieving the highest accuracy possible,” said Huang. “As this often requires building larger and even more complex models, we are happy to provide GPipe to the broader research community, and hope it is a useful infrastructure for efficient training of large-scale DNNs.”

Related:
Google Introduces TensorFlow Privacy, a Machine Learning Library With ‘Strong Privacy Guarantees’, VentureBeat, 3/6/19