Tech Talk – dPRO: A Generic Performance Diagnosis and Optimization Toolkit for Expediting Distributed DNN Training

All members of the HKU community and the general public are welcome to join!
Speaker: Mr Hanpeng Hu, PhD Candidate, Department of Computer Science , HKU

Date: 9th August 2022 (Tuesday)

Time: 4:30pm

Mode: Mixed

About the Tech Talk

All members of the HKU community and the general public are welcome to join!
Speaker: Mr Hanpeng Hu, PhD Candidate, Department of Computer Science, Faculty of Engineering, HKU

Moderator: Mr Junwei Su, PhD Candidate, Department of Computer Science, Faculty of Engineering, HKU

Date: 9th August 2022 (Tuesday)
Time: 4:30pm

Mode: Mixed (both face-to-face and online). Seats for on-site participants are limited. A confirmation email will be sent to participants who have successfully registered.

Language: English

Distributed training using multiple devices (i.e., GPU servers) has been widely adopted for learning DNN models over large datasets. However, the performance of large-scale distributed training tends to be far from linear speed-up in practice. Given the complexity of distributed systems, it is challenging to identify the root cause(s) of inefficiency and exercise effective performance optimizations when unexpected low training speed occurs. To date, there exists no software tool which diagnoses performance issues and helps expedite distributed DNN training, while the training can be run using different machine learning frameworks. This paper proposes dPRO, a toolkit that includes: (1) an efficient profiler that collects runtime traces of distributed DNN training across multiple frameworks, especially fine-grained communication traces, and constructs global data flow graphs including detailed communication operations for accurate replay; (2) an optimizer that effectively identifies performance bottlenecks and explores optimization strategies (from computation, communication and memory aspects) for training acceleration. We implement dPRO on multiple deep learning frameworks (PyTorch, TensorFlow, MXNet) and representative communication schemes (AllReduce and Parameter Server architecture). Extensive experiments show that dPRO predicts performance of distributed training in various settings with<5% errors in most cases and finds optimization strategies with up to87.1%speed-up over the baselines.

Registration

The tech talk “dPRO: A Generic Performance Diagnosis and Optimization Toolkit for Expediting Distributed DNN Training” will be organized in the Tam Wing Fan Innovation Wing Two (G/F, Run Run Shaw Building, HKU) on 9th August 2022 (Tuesday), 4:30 pm.
Stay tuned for the registration.
Seats are limited. Zoom broadcast is available if the seating quota is full.
Registrants on the waiting list will be notified of the arrangement after the registration deadline (with seating/free-standing/other arrangement)
Please read the Campus Access and HKU Vaccine Pass (https://covid19.hku.hk/announcements/all/2022/04/13776/)

HKU members

Non-HKU members

About the speaker

Mr Hanpeng Hu

Mr Hanpeng Hu is currently pursuing the Ph.D. degree in the Department of Computer Science at The University of Hong Kong. His research interests include distributed Deep Neural Network training, machine learning diagnosing and performance optimization. He obtained his bachelor’s degree from Huazhong University of Science and Technology in 2018.

Gallery

Other Tech talks