MXNet on the Cloud

Deep learning can require extremely powerful hardware, often for unpredictable durations of time. Moreover, MXNet can benefit from both multiple GPUs and multiple machines. Accordingly, cloud computing, as offered by AWS and others, is especially well suited to training deep learning models. Using AWS, we can rapidly fire up multiple machines with multiple GPUs each at will and maintain the resources for precisely the amount of time needed.

Set Up an AWS GPU Cluster from Scratch

In this document, we provide a step-by-step guide that will teach you how to set up an AWS cluster with MXNet. We show how to:

Use Pre-installed EC2 GPU Instance

The Deep Learning AMIs are a series of images supported and maintained by Amazon Web Services for use on Amazon Elastic Compute Cloud (Amazon EC2) and contain the latest MXNet release.

Now you can launch MXNet directly on an EC2 GPU instance. You can also use Jupyter notebook on EC2 machine. Here is a good tutorial on how to connect to a Jupyter notebook running on an EC2 instance.

Set Up an EC2 GPU Instance from Scratch

Deep Learning Base AMIs provide a foundational image with NVIDIA CUDA, cuDNN, GPU drivers, oneDNN, Docker and Nvidia-Docker, etc. for deploying your own custom deep learning environment. You may follow the MXNet Build From Source instructions easily on the Deep Learning Base AMIs.

Set Up an EC2 GPU Cluster for Distributed Training

A cluster consists of multiple computers. You can use one computer with MXNet installed as the root computer for submitting jobs,and then launch several slave computers to run the jobs. For example, launch multiple instances using an AMI with dependencies installed. There are two options:

  • Make all slaves’ ports accessible (same for the root) by setting type: All TCP, Source: Anywhere in Configure Security Group.

  • Use the same pem as the root computer to access all slave computers, and then copy the pem file into the root computer’s ~/.ssh/id_rsa. If you do this, all slave computers can be accessed with SSH from the root.

Now, run the CNN on multiple computers. Assume that we are on a working directory of the root computer, such as ~/train, and MXNet is built as ~/mxnet.

  1. Pack the MXNet Python library into this working directory for easy synchronization:
  cp -r ~/mxnet/python/mxnet .
  cp ~/mxnet/lib/libmxnet.so mxnet/

And then copy the training program:

  cp ~/mxnet/example/image-classification/*.py .
  cp -r ~/mxnet/example/image-classification/common .
  1. Prepare a host file with all slaves private IPs. For example, cat hosts:
  172.30.0.172
  172.30.0.171
  1. Assuming that there are two computers, train the CNN using two workers:
  ../../tools/launch.py -n 2 -H hosts --sync-dir /tmp/mxnet python train_mnist.py --kv-store dist_sync

Note: Sometimes the jobs linger at the slave computers even though you’ve pressed Ctrl-c at the root node. To terminate them, use the following command:

cat hosts | xargs -I{} ssh -o StrictHostKeyChecking=no {} 'uname -a; pgrep python | xargs kill -9'

Note: The preceding example is very simple to train and therefore isn’t a good benchmark for distributed training. Consider using other examples.

More Options

Use Multiple Data Shards

It is common to pack a dataset into multiple files, especially when working in a distributed environment. MXNet supports direct loading from multiple data shards. Put all of the record files into a folder, and point the data path to the folder.

Use YARN and SGE

Although using SSH can be simple when you don’t have a cluster scheduling framework, MXNet is designed to be portable to various platforms. We provide scripts available in tracker to allow running on other cluster frameworks, including Hadoop (YARN) and SGE. We welcome contributions from the community of examples of running MXNet on your favorite distributed platform.