TechTalk – Building Multi-dimensional Parallel Training Systems for Large AI Models
June 24, 2025 (Tuesday) 4:30-5:30pm
The increasing modeling capacities of large DNNs (e.g., Transformer and GPT) have achieved unprecedented successes in various AI areas, including understanding vision and natural languages. The high modeling power a large DNN mainly stems from its increasing complexity (having more neuron layers and more neuron operators in each layer) and dynamicity (frequently activating/deactivating neuron operators in each layer during training, such as Neural Architecture Search, or NAS). Dr. Cui’s talk will present his recent papers (e.g., [PipeMesh, in revision of a journal], [Fold3D TPDS 2023], [NASPipe ASPLOS 2022], and [vPipe TPDS 2021]), which address major limitations in existing multi-dimensional parallel training systems, including GPipe, Pipedream, and Megatron. Fold3D is now the major thousands-GPU parallel training system on the world-renowned MindSpore AI framework.