@Scale machine learning–distributed training concepts


@Scale machine learning–distributed training concepts

Digital Innovation & transformation

What happens if the business problem we are trying to solve for requires a large machine learning model or large amount of training data that won’t fit into the memory of a single server? What if the computation needs are impossible for a single processor to support?

A blog post by Sudi Bhattacharya, Consulting Managing Director, Cloud Engineering.

In real-life machine learning applications, the datasets used for training models can range from multiple terabyte to even petabytes. For example, if we want to analyze datasets from a social network with 100 million users and 1KB data per user, we are dealing with a 100TB dataset.

Digital Innovation & transformation

Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling model training from one GPU to many (so-called distributed training) can enable much faster training and innovation at a lower cost.

Let’s say that you start with a relatively small training dataset that can be trained on a single processor on a single server. As you understand more and more about the business problem you are trying to solve, you realize that you need a more complex model and significantly more data to solve the problem. Consequently, the time needed to train such models on a single GPU single server configuration can grow considerably, sometimes by days or weeks. Distributed training allows you to train large models and models that require large datasets on multiple GPUs and multiple servers in a shorter amount of time and at a much lower computation cost.

Model parallelism vs Data parallelism

If we are working with an immense deep learning model (large number of features, many layers in the neural networks, large number of model parameters), the model may need to be broken down into smaller chunks and distributed across GPUs and servers so that each GPU trains part of the model. This scenario is called model parallelism. A more common scenario is where the dataset is just too large to use one GPU to train it in a reasonable amount of time. To solve the problem using distributed training, we break up the data set and each GPU trains the same model, but with a subset of the total training data, the so-called “data parallelism” scenario. In this article we will focus on data parallel distributed training.

Synchronous distributed training

Distributing training with data parallelism can be performed in two different “modes”: Synchronous or asynchronous. In synchronous training mode, all workers train over different partitions of input data in such a way that all the variables are aggregated at each training step before the next step starts. As we can immediately guess, there is an inherent performance limitation to this approach. When the parameters are being updated, the GPUs are sitting idle. Asynchronous distributed training solves this problem, but more on that later.

Basic steps of synchronous data parallel distributed training

Let’s start with a supervised learning scenario and examine the steps needed to train a neural network model:

  1. Supply data to the model so that the model can infer the output
  2. Compare the output with the actual value and calculate the “loss”, or difference between the inferred value and the actual value, using a loss function
  3. Use an optimization function (e.g., stochastic gradient descent or SGD) to minimize the loss
  4. The optimizer function generates new model parameters at each pass that are fed back into the model
  5. Model converges to a set of stable parameters through multiple iterations of the above process

Let’s take the case of a synchronous data parallel distributed and examine what needs to happen to distribute the training in such a case:

  1. Data set is broken down into partitions
  2. Entire model and a data partition are distributed onto each of the available GPUs
  3. Each GPU processes a data partition and calculates model parameters locally
  4. Model parameters are synchronized using a specific technique (typically via either parameter server or all-reduce)
  5. New parameters are used to update the local models before processing the next batch of training data 
  6. Iterate to convergence of model parameters

Compared to  single-GPU training, this type of distributed training can affectively accelerate the training speed.

Asynchronous distributed training

As we briefly discussed, in the case of synchronous distributed training, while messages are being passed between GPUs to update the parameters, no training is being performed. In asynchronous training mode, the workers are trained independent of each other over their own partition of input data and parameters are updated asynchronously. Typically, a parameter server architecture is used to update the model parameters.

A parameter server is a key-value store where the values are the parameters of the machine-learning model (e.g., a neural network); the keys index the model parameters. The parameter server updates its parameters by using local computations by the other servers, known as workers, and pushes them to the parameter server. Periodically, the parameter server broadcasts its updated parameters to all the other worker machines, so that they can use the updated parameters to continue their optimization of the loss and subsequent convergence of the parameters.

TensorFlow and different strategies of distribution

Let’s discuss how to employ various distribution strategies using TensorFlow, one of the most popular open source libraries for machine learning. “tf.distribute.Strategy” is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs. Even if you start with a model trained on a single GPU, TensorFlow allows you to distribute your existing model and training code with minimal code changes.

The starting point of any distributed strategy is to break down the model training task into multiple units of work. Each unit of work can be distributed following different distribution strategies. There are six distribution strategies available with TensorFlow 2.2. We discuss four of the more commonly encountered distribution strategies below:

  • Mirrored Strategy-Single Server Multi-GPU: The simplest way to distribute the work is to spread them over multiple GPUs in one single server. Each model is “mirrored” on all GPUs. These variables are kept in sync with each other by applying identical updates. Efficient all-reduce algorithms are used to communicate the variable updates across the devices. 
  • TPU Strategy-Single Server Multi-GPU: “TPU Strategy” allows  you to run your TensorFlow training on tensor processing units (TPUs), Google's specialized ASICs designed to accelerate machine learning workloads. In terms of distributed training architecture, TPUStrategy is the same MirroredStrategy-it implements synchronous distributed training. TPUs provide their own implementation of efficient all-reduce and other collective operations across multiple TPU cores, which are used by the TensorFlow API.
  • MultiWorkerMirroredStrategy-Multi Server Multi GPU: The next level of distribution happens over multiple servers and multiple GPUs in each server. This strategy still implements the synchronous data parallel approach by replicating all models to each of the available GPUs and uses a multi-worker all-reduce communication method to keep all variables in sync.
  • Parameter server strategy: This strategy supports parameter servers training on multiple machines. In this setup, some machines are designated as workers and some as parameter servers. Each variable of the model is placed on one parameter server, and computation is replicated across all GPUs of all the workers.

Distributed training frameworks

I will conclude the blog by mentioning a few frameworks and libraries that are used in implementing distributed training strategies:

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make it easy to take a single-GPU training script and successfully scale it to train across many GPUs in parallel.

Spark-tensorflow-distributor is an open-source native package in TensorFlow that helps users do distributed training with TensorFlow on their Spark clusters. It is built on top of “tf.distribute.Strategy”, the API that we discussed early on.

In Summary

Distributed model training is a fascinating and practical area that is developing at a rapid pace. I wanted to provide a basic introduction to the core concepts. If you would like to go deeper in your understanding, a good starting point is Andrew Gibiansky’s 2017 blog post titled “Bringing HPC Techniques to Deep Learning”. Enjoy.

Please visit our cloud services webpage and discover a full suite of service offerings and capabilities available to accompany you throughout your Cloud journey.

Insert CSS fragment. Do not delete! This box/component contains code needed on this page. This message will not be visible when page is activated.

Did you find this useful?