Accelerated Linear Convergence of Stochastic Momentum Methods in Wasserstein Distances Momentum methods such as Polyak's heavy ball (HB) method, Nesterov's accelerated gradient (AG) as well as accelerated projected gradient (APG) method have been commonly used in machine learning practice, but their performance is quite sensitive to noise in the gradients. We study these methods under a first-order stochastic oracle model where noisy estimates of the gradients are available. For strongly convex problems, we show that the distribution of the iterates of AG converges with the accelerated O(√κlog(1/ε))O(κlog⁡(1/ε)) linear rate to a ball of radius εε centered at a unique invariant distribution in the 1-Wasserstein metric where κκ is the condition number as long as the noise variance is smaller than an explicit upper bound we can provide. Our analysis also certifies linear convergence rates as a function of the stepsize, momentum parameter and the noise variance; recovering the accelerated rates in the noiseless case and quantifying the level of noise that can be tolerated to achieve a given performance. In the special case of strongly convex quadratic objectives, we can show accelerated linear rates in the pp-Wasserstein metric for any p≥1p≥1 with improved sensitivity to noise for both AG and HB through a non-asymptotic analysis under some additional assumptions on the noise structure. Our analysis for HB and AG also leads to improved non-asymptotic convergence bounds in suboptimality for both deterministic and stochastic settings which is of independent interest. To the best of our knowledge, these are the first linear convergence results for stochastic momentum methods under the stochastic oracle model. We also extend our results to the APG method and weakly convex functions showing accelerated rates when the noise magnitude is sufficiently small. SGD without Replacement: Sharper Rates for General Smooth Convex Functions We study stochastic gradient descent {\em without replacement} (\sgdwor) for smooth convex functions. \sgdwor is widely observed to converge faster than true \sgd where each sample is drawn independently {\em with replacement}~\cite{bottou2009curiously} and hence, is more popular in practice. But it's convergence properties are not well understood as sampling without replacement leads to coupling between iterates and gradients. By using method of exchangeable pairs to bound Wasserstein distance, we provide the first non-asymptotic results for \sgdwor when applied to {\em general smooth, strongly-convex} functions. In particular, we show that \sgdwor converges at a rate of $O(1/K^2)$ while \sgd~is known to converge at $O(1/K)$ rate, where $K$ denotes the number of passes over data and is required to be {\em large enough}. Existing results for \sgdwor in this setting require additional {\em Hessian Lipschitz assumption}~\cite{gurbuzbalaban2015random,haochen2018random}. For {\em small} $K$, we show \sgdwor can achieve same convergence rate as \sgd for {\em general smooth strongly-convex} functions. Existing results in this setting require $K=1$ and hold only for generalized linear models \cite{shamir2016without}. In addition, by careful analysis of the coupling, for both large and small $K$, we obtain better dependence on problem dependent parameters like condition number. On the Complexity of Approximating Wasserstein Barycenters We study the complexity of approximating Wassertein barycenter of mm discrete measures, or histograms of size nn by contrasting two alternative approaches using entropic regularization. The first approach is based on the Iterative Bregman Projections (IBP) algorithm for which our novel analysis gives a complexity bound proportional to mn2ε2mn2ε2 to approximate the original non-regularized barycenter. Using an alternative accelerated-gradient-descent-based approach, we obtain a complexity proportional to mn2εmn2ε. As a byproduct, we show that the regularization parameter in both approaches has to be proportional to \e\e, which causes instability of both algorithms when the desired accuracy is high. To overcome this issue, we propose a novel proximal-IBP algorithm, which can be seen as a proximal gradient method, which uses IBP on each iteration to make a proximal step. We also consider the question of scalability of these algorithms using approaches from distributed optimization and show that the first algorithm can be implemented in a centralized distributed setting (master/slave), while the second one is amenable to a more general decentralized distributed setting with an arbitrary network topology. Estimate Sequences for Variance-Reduced Stochastic Composite Optimization In this paper, we propose a unified view of gradient-based algorithms for stochastic convex composite optimization by extending the concept of estimate sequence introduced by Nesterov. This point of view covers the stochastic gradient descent method, variants of the approaches SAGA, SVRG, and has several advantages: (i) we provide a generic proof of convergence for the aforementioned methods; (ii) we show that this SVRG variant is adaptive to strong convexity; (iii) we naturally obtain new algorithms with the same guarantees; (iv) we derive generic strategies to make these algorithms robust to stochastic noise, which is useful when data is corrupted by small random perturbations. Finally, we show that this viewpoint is useful to obtain new accelerated algorithms in the sense of Nesterov. A Dynamical Systems Perspective on Nesterov Acceleration This article presents a dynamic system model describing Nesterov's accelerated gradient method. In contrast to earlier work, the derivation does not rely on a vanishing step size argument. It is shown that Nesterov's accelerated gradient method follows from discretizing an ordinary differential equation with a semi-implicit Euler integration scheme. We analyze both the corresponding differential equation as well as the discretization for obtaining insights into the phenomenon of acceleration. The analysis suggests that a curvature-dependent damping term lies at the heart of acceleration. We further establish connections between the discretized and the continuous-time dynamics. Random Shuffling Beats SGD after Finite Epochs A long-standing problem in stochastic optimization is proving that RandomShuffle, the without-replacement version of SGD, converges faster than the usual with-replacement SGD. Building upon Gurbuzbalaban et el, we present the first (to our knowledge) non-asymptotic results for this problem by proving that after a reasonable number of epochs RandomShuffle converges faster than SGD. Specifically, we prove that for strongly convex, second-order smooth functions, the iterates of RandomShuffle converge to the optimal solution as O(1/T^2+n^3/T^3), where n is the number of components in the objective, and T is number of iterations. This result implies that after O(\sqrt{n}) epochs, RandomShuffle is strictly better than SGD (which converges as O(1/T)). The key step toward showing this better dependence on T is the introduction of n into the bound; and as our analysis shows, in general a dependence on n is unavoidable without further changes. To understand how RandomShuffle works in practice, we further explore two empirically useful settings: data sparsity and over-parameterization. For sparse data, RandomShuffle has the rate O(1/T^2), again strictly better than SGD. Under a setting closely related to over-parameterization, RandomShuffle is shown to converge faster than SGD after any arbitrary number of iterations. Finally, we extend the analysis of RandomShuffle to smooth non-convex and convex functions. First-Order Algorithms Converge Faster than O(1/k) on Convex Problems It has been known for many years that both gradient descent and stochastic coordinate descent achieve a global convergence rate of O(1/k)O(1/k) in the objective value, when applied to a scheme for minimizing a Lipschitz-continuously differentiable, unconstrained convex function. In this work, we improve this rate to o(1/k)o(1/k). We extend the result to proximal gradient and proximal coordinate descent on regularized problems to show similar o(1/k)o(1/k) convergence rates. The result is tight in the sense that an O(1/k1+ϵ)O(1/k1+ϵ) rate is not generally attainable for any ϵ>0ϵ>0, for any of these methods. Improved Convergence for ℓ1ℓ1 and ℓ∞ℓ∞ Regression via Iteratively Reweighted Least Squares The iteratively reweighted least squares method (IRLS) is a popular technique used in practice for solving regression problems. Various versions of this method have been proposed, but their theoretical analyses failed to capture the good practical performance. In this paper we propose a simple and natural version of IRLS for solving $\ell\inftyandand\ell1regression,whichprovablyconvergestoaregression,whichprovablyconvergestoa(1+\epsilon)−approximatesolutionin−approximatesolutioninO(m^{1/3}\log(1/\epsilon)/\epsilon + \log(m/\epsilon)/\epsilon^2)iterations,whereiterations,wheremisthenumberofrowsoftheinputmatrix.Interestingly,thisrunningtimeisindependentoftheconditioningoftheinput,andthedominanttermoftherunningtimedependsonlylinearlyinisthenumberofrowsoftheinputmatrix.Interestingly,thisrunningtimeisindependentoftheconditioningoftheinput,andthedominanttermoftherunningtimedependsonlylinearlyin\epsilon^{-1}$, despite the fact that the problem it is solving is non-smooth, and the algorithm is not using any regularization. This improves upon the more complex algorithms of Chin et al. (ITCS '12), and Christiano et al. (STOC '11) by a factor of at least 1/ϵ5/31/ϵ5/3, and yields a truly efficient natural algorithm for the slime mold dynamics (Straszak-Vishnoi, SODA '16, ITCS '16, ITCS '17). Optimal Mini-Batch and Step Sizes for SAGA Recently it has been shown that the step sizes of a family of variance reduced gradient methods, the JacSketch methods, depend on the expected smoothness constant. In particular, if this expected smoothness constant could be calculated a priori, then one could safely set much larger step sizes which would result in a much faster convergence rate. We fill in this gap, and provide simple closed form expressions for the expected smoothness constant and careful numerical experiments verifying these bounds. Using these bounds, and since the SAGA algorithm is part of this JacSketch family, we suggest a new standard practice for setting the step sizes and mini-batch size for SAGA that are competitive with a numerical grid search. Furthermore, we can now show that the total complexity of the SAGA algorithm decreases linearly in the mini-batch size up to a pre-defined value: the optimal mini-batch size. This is a rare result in the stochastic variance reduced literature, only previously shown for the Katyusha algorithm. Finally we conjecture that this is the case for many other stochastic variance reduced methods and that our bounds and analysis of the expected smoothness constant is key to extending these results. Differential Inclusions for Modeling Nonsmooth ADMM Variants: A Continuous Limit Theory Recently, there has been a great deal of research attention on understanding the convergence behavior of first-order methods. One line of this research focuses on analyzing the convergence behavior of first-order methods using tools from continuous dynamical systems such as ordinary differential equation and different inclusions. These research results shed lights on better understanding first-order methods from a non-optimization point of view. The alternating direction method of multipliers (ADMM) is a widely used first-order method for solving optimization problems with separable structure in the variable and objective function, and it is important to investigate its behavior using these new techniques from dynamical systems. Existing works along this line have been mainly focusing on problems with smooth objective functions, which excludes many important applications that are traditionally solved by ADMM variants. In this paper, we analyze some well-known and widely used ADMM variants for nonsmooth optimization problems using the tools of differential inclusions. In particular, we analyze the convergence behavior of linearized ADMM and gradient-based ADMM for nonsmooth problems. We anticipate that these results will provide new insights on understanding ADMM for solving nonsmooth problems.

Watch SlidesLive on mobile devices

© SlidesLive Inc.