ML Seminar: Speeding up Distributed SGD via Communication-Efficient Model Aggregation
Large-scale machine learning training, in particular, distributed stochastic gradient descent (SGD), needs to be robust to inherent system variability such as unpredictable computation and communication delays. This work considers a distributed SGD framework where each worker node is allowed to perform local model updates and the resulting models are averaged periodically. Our goal is to analyze and improve the true speed of error convergence with respect to wall-clock time (instead of the number of iterations). For centralized model-averaging, we propose a strategy called AdaComm that gradually increases the model-averaging frequency in order to strike the best error-runtime trade-off. For decentralized model-averaging, we propose MATCHA, where we use matching decomposition sampling of the base graph to parallelize inter-worker information exchange and reduce communication delay. Experiments on training deep neural networks show that AdaComm and MATCHA can take 3x less time to achieve the same final training loss as compared to fully synchronous SGD and vanilla decentralized SGD respectively.
Based on joint work with Jianyu Wang, Anit Sahu, Zhouyi Yang, and Soummya Kar. Links to papers: https://arxiv.org/abs/1808.07576, https://arxiv.org/pdf/1810.08313.pdf, https://arxiv.org/abs/1905.09435