Dez 6, 2021
Sprecher:in · 0 Follower:innen
Sprecher:in · 0 Follower:innen
Sprecher:in · 0 Follower:innen
Sprecher:in · 0 Follower:innen
Sprecher:in · 0 Follower:innen
Sprecher:in · 0 Follower:innen
Gradient compression is a widely-established remedy to tackle the communication bottleneck in distributed training of large deep neural networks (DNNs). Under the error-feedback framework, Top-k sparsification, sometimes with k as little as 0.1% of the gradient size, enables training to the same model quality as the uncompressed case for a similar iteration count. We find that, from the optimization perspective, Top-k is the communication-optimal sparsifier given a per-iteration k element budget.We argue that to further the benefits of gradient sparsification, especially for DNNs, a different perspective is necessary — one that moves from per-iteration optimality to consider optimality for the entire training.We identify that the total error – the sum of the compression-errors for all iterations – encapsulates sparsification throughout training. Then, we propose a communication-complexity model that captures minimizing the total error under a communication budget for the entire training. We find that the hard-threshold sparsifier – a variant of the Top-k sparsifier with k determined by a constant hard-threshold – is the optimal sparsifier for this model. Motivated by this, we provide convex and non-convex convergence analyses for the hard-threshold sparsifier with error-feedback. Unlike with Top-k sparsifier, we show that hard-threshold has the same asymptotic convergence and linear speedup property as SGD in the convex case, and has no impact of data-heterogeneity in the non-convex case. Our diverse experiments on various DNNs and a logistic regression model demonstrate that the hard-threshold sparsifier is more communication-efficient than Top-k.Gradient compression is a widely-established remedy to tackle the communication bottleneck in distributed training of large deep neural networks (DNNs). Under the error-feedback framework, Top-k sparsification, sometimes with k as little as 0.1% of the gradient size, enables training to the same model quality as the uncompressed case for a similar iteration count. We find that, from the optimization perspective, Top-k is the communication-optimal sparsifier given a per-iteration k element budget…
Konto · 1,9k Follower:innen
Neural Information Processing Systems (NeurIPS) is a multi-track machine learning and computational neuroscience conference that includes invited talks, demonstrations, symposia and oral and poster presentations of refereed papers. Following the conference, there are workshops which provide a less formal setting.
Professionelle Aufzeichnung und Livestreaming – weltweit.
Präsentationen, deren Thema, Kategorie oder Sprecher:in ähnlich sind
Ewigspeicher-Fortschrittswert: 0 = 0.0%
Jiachen Sun, …
Ewigspeicher-Fortschrittswert: 0 = 0.0%
Hanna Tseran, …
Ewigspeicher-Fortschrittswert: 0 = 0.0%
Ewigspeicher-Fortschrittswert: 0 = 0.0%
Ewigspeicher-Fortschrittswert: 0 = 0.0%
Yash Pote, …
Ewigspeicher-Fortschrittswert: 0 = 0.0%