MindPipe: High-performance and Carbon-efficient Four-dimensional Parallel Training System for Large AI Models
MindPipe, the first 4D parallel training system for large DNN models, has the following objectives:
1. Greatly reducing load imbalance in GPU pipeline parallel stages; 2. Effectively resolving contention of the 3D parallel communication tasks; 3. Deterministically scheduling multiple subnets to be trained in supernet parallelism, a novel parallel dimension proposed by MindPipe; and 4. Automatic near-optimal 4D configuration of GPUs considering both DNN converging efficiency and GPU utilization.