HyperAI

PyTorch's Killer Weapon: Optimizing EDL in GPU Clusters With AdaptDL

4 years ago
Popular Science
Information
Yang Bai
特色图像

EDL stands for Elastic Deep Learning, incubated by the LF AI Foundation, and is a deep neural network training framework that can dynamically adjust the degree of parallelism.It supports multi-tenant cluster management, can balance model training waiting and completion time, and can improve resource utilization.

Training deep learning models is usually time-consuming and requires high costs in terms of computing resources, storage space, etc.

Taking the BERT model as an example, the training process on the GPU often exceeds 2,000 hours, while training the ResNet and VGG models takes at least 100 hours.

According to today's cloud computing cost accounting, the cost of model training may be as high as thousands or even tens of thousands of yuan.In order to control the cost of model training,Computing resource sharing clusters came into being. Today we introduce AdaptDL developed by the Petuum CASL team, which greatly optimizes EDL in GPU clusters.

Challenges of shared clusters

With shared clusters,Multiple users can submit model training tasks individually.

This not only reduces the waste caused by over-provisioning of computing resources, but also by utilizing idle resources, users can train a complex model in just a few days or even hours on a single workstation.

However, shared clusters also have some problems of their own.

  Typical challenges faced by shared clusters include:

1. Resource allocation:When multiple tasks share a cluster, the allocation of GPU resources needs to be carefully planned. For example, when training a model, the training speed is much faster when using the GPU on the same machine than when using the GPU on multiple machines. In addition, in order to avoid competition for network bandwidth between training tasks, different distributed training tasks should be assigned to GPUs on different machines.

2. Training speed and scalability vary:Choosing the right GPU configuration for a training task requires constant monitoring of the model's training speed and scalability, which change over time. In particular, when approaching convergence, a larger batch size should be used. Therefore, it is best to occupy fewer GPU resources at the beginning of training.

3. Training configuration:Often, we need to know in advance which GPUs are available before we can configure them for some important training. This is sometimes not possible in a shared cluster. For example, batch size and learning rate are often determined based on the number of GPUs, or if it is known that the GPUs are on different machines, gradient accumulation can be used to overcome network bottlenecks, etc.

4. Fairness and availability:During peak GPU usage, some users may need to wait in line for idle GPUs, but some users who are already running tasks want to increase the number of GPUs to speed up. How to balance and resolve the contradiction between the two?

AdaptDL simplifies and accelerates model training on local machines and shared clusters

AdaptDL solves the problems of shared clusters

To address the shortcomings associated with organizations pool computing and shared clusters, the Petuum CASL team created AdaptDL.To simplify and speed up distributed training on shared clusters.

AdaptDL is a resource-adaptive deep learning (DL) training and scheduling framework. It can monitor the performance of training tasks in real time and flexibly adjust resource allocation (such as GPUs, computing instances, etc.) during task execution.

It addresses the problems mentioned above in shared clusters and has the following advantages:

1. Improve the utilization of shared GPU clusters:AdaptDL can analyze all model training tasks and learn how different tasks perform under different GPU resource configurations. Using the knowledge learned, the AdaptDL scheduler can fairly and efficiently configure GPU resources for different training tasks. As the number of training tasks increases and the performance characteristics of different tasks are better understood, AdaptDL will learn to flexibly reconfigure the GPU.

2. Reduce the cost of cloud model training:AdaptDL can provision an appropriate number of GPU instances in the cloud to avoid unnecessary costs, and can also automatically scale the cluster when larger batch sizes are used in training.

3. Easily implement large-scale training:Using a larger batch size can speed up training on many GPUs, but it is not easy to apply. If some models use a batch size that is too large, it may increase the training time due to reduced statistical efficiency, but using a batch size that is too small will not effectively utilize the GPU. AdaptDL can automatically select the appropriate batch size on shared clusters, cloud environments, and local machines.

Models using AdaptDL take less time to train on average compared to Optimus and Tiresias

For each model training task, AdaptDL can automatically adjust the batch size, learning rate, and gradient accumulation. In the cloud service platform, you can also control the number of spot instances.

Practice at Petuum shows that with the help of AdaptDL shared cluster training models,On average, completion times were 2-3 times faster and costs were 3 times lower using Spot Instances in AWS.

start

AdaptDL can be used in two modes.

1. Cluster scheduling:Allows running multiple tasks on a Kubernetes cluster. Using the AdaptDL Python library, the AdaptDL scheduler can be integrated into the PyTorch code to automatically select the optimal number of GPUs and training batch size.

2. Independent training:Train models with adaptive batch size and learning rate on any cluster or local multi-GPU machine. AdaptDL automatically figures out when to use a larger batch size to speed up model training.

  Training with the AdaptDL Python library:

The Adaptdl Python library simplifies PyTorch training code.Make the batch size and learning rate adaptive.No additional settings are required.

python3 –m pip install adaptdl

Taking PyTorch MNIST as an example, only a few lines of code need to be modified. As shown below:

AdaptDL provides a distributed data-parallel interface similar to PyTorch's native one, making it easy to modify existing distributed training code.

first step:

use adaptdl.torch.AdaptiveDataLoader  Alternative torch.utils.data.DataLoader .

AdaptiveDataLoader can automatically select the best batch size during training based on the program's throughput and statistical efficiency. Checkpointing can also save the state so that training can be resumed from where it stopped after restart.

train_loader.autoscale_batch_size(1024)  Making AdaptDL  can automatically select the most effective batch size for training,The maximum global batch size in all training processes is 1024.

Next:

use adaptdl.torch.AdaptiveDataParallel  Package model.

adaptdl.torch.AdaptiveDataParallel  During the training process, the Gradient Noise Scale is calculated, which can be used to calculate statistical efficiency. When the batch size changes,AdaptiveDataParallel  The learning rate will be automatically adjusted according to the rule.

By default,AdaptiveDataParallel   The algorithm used is AdaScale which has good performance in various tasks.

During the checkpoint,AdaptiveDataParallel  Model parameters, optimizer status, and LR scheduler status can be automatically saved, and these settings can be restored with one click after restarting training.

With the above changes, users can run the training code on their local machine or in a distributed cluster. AdaptDL selects the right batch size and learning rate for faster distributed training and automatically performs gradient accumulation to overcome network issues.Comparison of YOLOv3 training on Adaptive and Manual Batch Size machines. Adaptive has a significant advantage in terms of training and batch size comparison.

Without AdaptDL, if you choose a batch size that is too small, the training time will be extended because the GPU is not fully utilized. Conversely, if you choose a batch size that is too large, it will also take more epochs to converge, resulting in longer training time.By comparison, AdaptDL can automatically achieve better training performance without selecting a fixed batch size.

  Cluster management with AdaptDL scheduler:

The AdaptDL scheduler can automatically determine the GPU resources to be used by the training task.This makes training tasks in shared clusters smarter.

With flexibility, when the cluster idle rate is high, the training task will be expanded to use additional GPUs; when the cluster utilization rate is high, it will be shrunk to use fewer GPU resources instead of pausing the training task.

The AdaptDL scheduler also provides other functions,Such as organizing clusters to avoid network contention between different tasks and maintaining fairness between competing training tasks.

Thanks to the coordination between the scheduler and each training task, AdaptDL can keep the shared cluster efficiently utilized.

When a task can effectively use a larger batch size, AdaptDL automatically transfers more GPUs to the job to speed up training. On the other hand, when only a smaller batch size can be used, the idle GPUs will be more effectively allocated to other tasks.

The AdaptDL scheduler can be installed on any Kubernetes instance using Helm with one click. The command is as follows:

helm install adaptdl adaptdl-sched \
-— repo https://github.com/petuum/adaptdl/raw/helm-repo \
-— namespace adaptdl — create-namespace \
-— set docker-registry.enabled=true

After installing the AdaptDL scheduler, you can use the AdaptDL CLI to submit training tasks.Initially, the training task will use a single GPU, and then restart multiple times with different numbers of GPUs, during which AdaptDL will calculate the optimal number of GPUs to use.AdaptDL always chooses the most effective batch size and adjusts the learning rate accordingly.

AdaptDL cluster tracking example

The colored bar chart shows the number of compute instances assigned to different tasks. AdaptDL can dynamically optimize the number of compute instances each task gets.

With AdaptDL, PyTorch training tasks run 2-3 times faster in shared clusters. In addition, the AdaptDL scheduler also supports AWS spot instances, which reduces costs by 3 times.

Finally, you can also use AdaptDL and NNI to accelerate hyperparameter tuning workloads (AdaptDL + NNI Post).

Project address:

https://github.com/petuum/adaptdl

This article is translated from the PyTorch Medium blog.