AI Models are Rising to Unprecedented Complexity
Large deep neural network (DNN) models with rising complexity and modeling capacity have achieved unprecedented success in various digitization areas, including natural languages, vision, and audio. However, training a large DNN model can easily exceed a single GPU’s capacity.
– The limited GPU memory cannot hold the huge number of model parameters.
– The limited computing power cannot finish training in a reasonable time.
MindPipe – 4D Parallel Training System
MindPipe, the first 4D parallel training system for large DNN models, has the following objectives:
- Greatly reducing load imbalance in GPU pipeline parallel stages. [vPipe IEEE TPDS 2021]
- Effectively resolving contention of the 3D parallel communication tasks.
- Deterministically scheduling multiple subnets to be trained in supernet parallelism, a novel parallel dimension proposed by MindPipe. [NASPipe ACM ASPLOS 2022]
- Automatic near-optimal 4D configuration of GPUs considering both DNN converging efficiency and GPU utilization.