Accelerated Training with Pipeline Model Parallelism for huge deep learning models.
To overcome the limitation faced with the naive model parallel solution seen in my previous article:
Pipeline model parallelism can be used, where the model is divided into stages and each stage is executed in parallel on different GPUs. This approach reduces the communication overhead as the intermediate results from one stage can be directly fed into the next stage, reducing the need for data transfer between GPUs. Additionally, pipeline parallelism can also lead to improved memory utilization, as each GPU only needs to store the intermediate results for a portion of the model, rather than the entire model as shown in the Figure 1 , which demonstrates the quasi-simultaneous execution and performance improvement achieved through the use of a splitting technique, where the data batch is divided into 4 smaller batches, in a training scenario that is distributed across 2 GPUs.
To perform Pipeline Model Parallelism with PyTorch, you will need to follow these guidelines:
1 - Split the model: same as previously in Distributed Training in Large Deep Learning models with PyTorch Model Parallelism on a ResNet50
2 - Forward Pass: Here the pipeline parallelism is implemented where the input batch of data is divided into smaller micro-batches and each micro-batch is processed in parallel by two GPUs.
In the forward pass, the input batch of data (x) is split into smaller micro-batches using the split function, where the size of each micro-batch is specified by the split_size parameter. The first micro-batch is processed by the first GPU (dev0) and the output is sent to the second GPU (dev1).
For Exmaple:
def forward(self, x):
# split setup for x, containing a batch of (image, label) as a tensor
splits = iter(x.split(self.split_size, dim=0))
s_next = next(splits)
# initialisation:
# - first mini batch goes through seq0 (on dev0)
# - the output is sent to dev1
s_prev = self.seq0(s_next).to(self.dev1)
ret = []
for s_next in splits:
# A. s_prev runs on dev1
s_prev = self.seq1(s_prev)
ret.append(self.fc(s_prev.view(s_prev.size(0), -1)))
# B. s_next runs on dev0, which can run concurrently with A
s_prev = self.seq0(s_next).to(self.dev1)
s_prev = self.seq1(s_prev)
ret.append(self.fc(s_prev.view(s_prev.size(0), -1)))
return torch.cat(ret)
In detail, the processing of each micro-batch consists of two steps:
2.1 - The current micro-batch (s_prev) is processed by the second GPU (dev1) using the “seq1” layer and the final layer (fc).
2.2 - Meanwhile, the next micro-batch (s_next) is processed by the first GPU (dev0) using the “seq0” layer and the result is sent to the second GPU (dev1).
This process continues until all micro-batches have been processed, and the final results are combined and returned as the final result of the forward pass.
Testing it on a Resnet50 Model:
The model developed for this training is:
class PipelineParallelResNet50(ModelParallelResNet50):
def __init__(self, split_size=5, *args, **kwargs):
super(PipelineParallelResNet50, self).__init__(*args, **kwargs)
self.split_size = split_size
def forward(self, x):
splits = iter(x.split(self.split_size, dim=0))
s_next = next(splits)
s_prev = self.seq1(s_next).to('cuda:0')
ret = []
for s_next in splits:
# A. s_prev runs on cuda:0
s_prev = self.seq2(s_prev)
s_prev = self.seq3(s_prev).to('cuda:2')
# B. s_next runs on cuda:1, which can run concurrently with A
s_prev = self.seq4(s_prev)
ret.append(self.fc(s_prev.view(s_prev.size(0), -1)))
# C. next iteration
s_prev = self.seq1(s_next).to('cuda:0')
s_prev = self.seq2(s_prev)
s_prev = self.seq3(s_prev).to('cuda:2')
s_prev = self.seq4(s_prev)
ret.append(self.fc(s_prev.view(s_prev.size(0), -1)))
return torch.cat(ret)
The Results obtained are presented as follows:
The plot shows a comparison of different ResNet50 models in terms of training execution time, where also again is defined as the difference between the performance of parallel models and sequential model.
For the ResNet50 Parallel model, its gain is -40%, which means its performance is 40% worse than the sequential performance. For the ResNet50 Pipeline Parallel model, its gain is 20%, which means its performance is 20% better than the sequential performance on a single GPU.
It can be seen from the plot that the ResNet50 Pipeline Parallel model has the best performance compared to the other two models. On the other hand, the ResNet50 Parallel model has the worst performance. The results suggest that the ResNet50 Pipeline Parallel model is a better choice than the other two models for applications that require high performance and scalability.
In conclusion, pipeline parallelism can effectively improve the training time of a model parallel DNN and is a better approach for achieving faster training time compared to model parallelism.
Let’ connect:
Follow me on Medium: https://medium.com/@josephkettaneh