Название: Optimization Algorithms for Distributed Machine Learning Автор: Gauri Joshi Издательство: Springer Год: 2023 Страниц: 137 Язык: английский Формат: pdf (true), epub Размер: 19.9 MB
Stochastic gradient descent (SGD) is the backbone of supervised Machine Learning training today. Classical SGD was designed to be run on a single computing node, and its error convergence with respect to the number of iterations has been extensively analyzed and improved in optimization and learning theory literature. However, due to the massive training datasets and models used today, running SGD at a single node can be prohibitively slow. This calls for distributed implementations of SGD, where gradient computation and aggregation are split across multiple worker nodes. Although parallelism boosts the amount of data processed per iteration, it exposes SGD to unpredictable node slowdown and communication delays stemming from variability in the computing infrastructure. Thus, there is a critical need to make distributed SGD fast, yet robust to system variability.
In this book, we will discuss state-of-the-art algorithms in large-scale Machine Learning that improve the scalability of distributed SGD via techniques such as asynchronous aggregation, local updates, quantization and decentralized consensus. These methods reduce the communication cost in several different ways—asynchronous aggregation allows overlap between communication and local computation, local updates reduce the communication frequency thus amortizing the communication delay across several iterations, quantization and sparsification methods reduce the per-iteration communication time, and decentralized consensus offers spatial communication reduction by allowing different nodes in a network topology to train models and average them with neighbors in parallel.
For each of the distributed SGD algorithms presented here, the book also provides an analysis of its convergence. However, unlike traditional optimization literature, we do not only focus on the error versus iterations convergence, or the iteration complexity. In distributed implementations, it is important to study the error versus wallclock time convergence because the wallclock time taken to complete each iteration is impacted by the synchronization and communication protocol. We model computation and communication delays as random variables and determine the expected wallclock runtime per iteration of the various distributed SGD algorithms presented in this book. By pairing this runtime analysis with the error convergence analysis, one can get a true comparison of the convergence speed of different algorithms. The book advocates a system-aware philosophy, which is cognizant of computation, synchronization and communication delays, toward the design and analysis of distributed Machine Learning algorithms.
Скачать Optimization Algorithms for Distributed Machine Learning