Tech Talk – dPRO: A Generic Performance Diagnosis and Optimization Toolkit for Expediting Distributed DNN Training

All members of the HKU community and the general public are welcome to join!
Speaker: Mr Hanpeng Hu, PhD Candidate, Department of Computer Science , HKU
Date: 9th August 2022 (Tuesday)
Time: 4:30pm
Mode: Mixed
About the Tech Talk
All members of the HKU community and the general public are welcome to join!
Speaker: Mr Hanpeng Hu, PhD Candidate, Department of Computer Science, Faculty of Engineering, HKU
Moderator: Mr Junwei Su, PhD Candidate, Department of Computer Science, Faculty of Engineering, HKU
Date: 9th August 2022 (Tuesday)
Time: 4:30pm
Mode: Mixed (both face-to-face and online). Seats for on-site participants are limited. A confirmation email will be sent to participants who have successfully registered.
Language: English

Distributed training using multiple devices (i.e., GPU servers) has been widely adopted for learning DNN models over large datasets. However, the performance of large-scale distributed training tends to be far from linear speed-up in practice. Given the complexity of distributed systems, it is challenging to identify the root cause(s) of inefficiency and exercise effective performance optimizations when unexpected low training speed occurs. To date, there exists no software tool which diagnoses performance issues and helps expedite distributed DNN training, while the training can be run using different machine learning frameworks. This paper proposes dPRO, a toolkit that includes: (1) an efficient profiler that collects runtime traces of distributed DNN training across multiple frameworks, especially fine-grained communication traces, and constructs global data flow graphs including detailed communication operations for accurate replay; (2) an optimizer that effectively identifies performance bottlenecks and explores optimization strategies (from computation, communication and memory aspects) for training acceleration. We implement dPRO on multiple deep learning frameworks (PyTorch, TensorFlow, MXNet) and representative communication schemes (AllReduce and Parameter Server architecture). Extensive experiments show that dPRO predicts performance of distributed training in various settings with<5% errors in most cases and finds optimization strategies with up to87.1%speed-up over the baselines.

This figure shows the architecture of dPRO, a toolkit dedicated for performance diagnosing and optimization for distribute DNN training. The toolkit can work with different ML frameworks and communication frameworks.
The figure shows that the replayed in dPRO can accurately simulate the performance of distributed DNN training jobs with different ML workloads, ML frameworks, communication frameworks and network connections.
Based on the accurate replayed, we design an optimizer to automatically search the optimal optimizations, e.g., operator fusion, tensor fusion and partition. The following figure shows that dPRO’s optimizer performs the best in most cases and achieves up to 62.95% speedup as compared to baselines.
  • The tech talk “dPRO: A Generic Performance Diagnosis and Optimization Toolkit for Expediting Distributed DNN Training” will be organized in the Tam Wing Fan Innovation Wing Two (G/F, Run Run Shaw Building, HKU) on 9th August 2022 (Tuesday)4:30 pm.
  • Stay tuned for the registration.
  • Seats are limited. Zoom broadcast is available if the seating quota is full. 
  • Registrants on the waiting list will be notified of the arrangement after the registration deadline (with seating/free-standing/other arrangement)
  • Please read the Campus Access and HKU Vaccine Pass (
About the speaker

Mr Hanpeng Hu

Mr Hanpeng Hu is currently pursuing the Ph.D. degree in the Department of Computer Science at The University of Hong Kong. His research interests include distributed Deep Neural Network training, machine learning diagnosing and performance optimization. He obtained his bachelor’s degree from Huazhong University of Science and Technology in 2018.

Other Tech talks