Pipeline parallelism as a bandaid on memory limitations

Training on multiple GPUs (ML Systems I)

Apr 05, 2023

If you’ve ever owned a device with a small memory, you know storage problems (I’m looking at you, 32GB iPhone kids). Large model training runs into approximately the same issues as a teenage you did — it doesn’t have enough memory to save all its ~~photos~~ parameters.

Some technical context: between the years 2014 and 2020, memory size on GPUs grew ~5x. In contrast, model parameter size grew >100x. Simply put, you cannot fit a modern model on a modern GPU — you have to spread it across many of them. How can you do this while still being able to train the model?

The solution: parallelism. Split the task of a forward pass of training across various GPUs (called workers), then sync and combine the results to effectively create one forward pass. There are two effective ways to do this:

Data parallelism: push the same model onto many GPUs, give each GPU different data, sync and combine gradients and updates at the end. This works best for models with few params and lots of data.
Model parallelism: split the model and give chunks to many GPUs, common data to be trained across al of them. As you can imagine, lots of communication between workers. This works best for models with many, many params — the GPTs of the world.

The fundamental problem with naive model parallelism: it becomes effectively linear.

Do you see why? If each worker has one layer, then workers must sit idle waiting for the previous worker to finish. In the above example, the grey GPU sits idle for 6 cycles.

So how do we fix this? We use an old classic from the computer science textbooks— good old pipelining.

Today, we’re reading “GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism”.

Introduction and Motivation

Goal: Reduce the wasted idle time in inter-layer model parallelism, as different workers wait for the output of the previous one.

Constraint: We cannot remove the dependency of the layers in a forward pass — that will always exist.

Solution: Split a batch of images in the forward pass into many micro-batches that can be pipelined. This process combines data and model parallelism into one.

There are two big things from this: This decreases the clock period, allowing the idle time to be shorter, and allows machines to be kept busy for more.

Development Details

The first thing I want to highlight is that there is still idle time. Note the “bubble” in the picture above. Given the constraints, there is no way to get rid of this, but micro-batching does help us decrease it.

The GPipe library is also fairly nice — it does pipeline parallelism automatically! It also handles the communication between workers, such that you don’t need to worry about which micro-batch is being executed where.

It does another small optimization — it doesn’t store the activations from the forward pass, but slightly recomputes them. This is faster than retrieving them from memory, and a classic technique in architecture design.

Evaluation

The systems optimization technique is only as good as its performance impact! The first thing — they trained the biggest neural net at the time! 8B parameters (ha cute). We also see a linear increase in accuracy with size increase, suggesting that pipelining and micro-batching has not decreased training quality.

Here on AmoebaNet, we see how increasing pipeline depth is linearly increasing the speedup in training as well.

Limitations and Future Work

The partitioning of the model onto various workers is completely heuristics-based. This is a suboptimal way of approaching this problem, and was further improved by ALPA a couple years layer.
The Daily Ink ✍️
Optimal parallelism in ML training is possible, says ALPA
You might heard of data parallelism in model training before. The core idea — have multiple copies of your model, train different data on them, average the gradients, update all weights. That is not what we’re talking about today. We’re talking about…
Read more
2 years ago · AMKS
The evaluation details suggest that increased pipelining improves transfer learning, demonstrated in learning from high to low resource languages. Further work could investigate if this is indeed the case.
It was surprising to me that increasing batch size increased the pipelining performance. I suppose this makes sense — larger batch sizes allow for each micro-batch to be appropriately sized (not too small) and to best utilize our pipeline. Further work could investigate the correlation between batchsize and pipelining.
GPipe currently does not support intra-model parallelism. That is, it assumes that one layer on the neural net can fit on a GPU. This isn’t always true, and even if it was, it doesn’t mean we can’t leverage intra-model parallelism for further speedup.

In Summary

GPipe might be one of the most under-appreciated papers in 2020. It was one of the first that started the trend of system optimizations to allow models to grow beyond a GPU. As we’ll see in the next couple of editions of the Daily Ink, a lot of work owes its origin to GPipe (and its equally under-appreciated predecessor PipeDream).

You need only look at GPipe’s implementation being a first-class citizen in PyTorch to evaluate the importance of this paper. A lot of research we cover here is very interesting and innovative, but few have such a strong claim of being so impactful.

Until next time!

The Daily Ink ✍️