TechTalk – Optimizing Distributed Large Model Training in AI Clouds
December 12, 2024 (Thursday) 4:30-5:30pm
Distributed training using a large number of devices has been widely adopted for learning large deep learning models. Improving distributed training efficiency is critical for time, resource and energy consumption of large model learning. In this talk, I will introduce recent research works in my group on optimizing distributed training parallelisms for effective training acceleration and maximal resource utilization. Especially, we have designed optimized strategies and systems for operator sharding, computation and communication scheduling for SPMD parallelism (e.g., in Mixture-of-Experts model training) in both homogeneous and heterogeneous AI clusters, as well as dynamic micro-batching and pipelining to tackle sequence length variation in multi-task model training (e.g., Large Language Model training).