Almost surely constrained convex optimization We propose a stochastic gradient framework for solving stochastic composite convex optimization problems with (possibly) infinite number of linear inclusion constraints that need to be satisfied almost surely. We use smoothing and homotopy techniques to handle constraints without the need for matrix-valued projections. We show for our stochastic gradient algorithm φ(log(k)/√k) convergence rate for general convex objectives and φ(log(k)/k) convergence rate for restricted strongly convex objectives. These rates are known to be optimal up to logarithmic factors, even without constraints. We demonstrate the performance of our algorithm with numerical experiments on basis pursuit, a hard margin support vector machines and a portfolio optimization and show that our algorithm achieves state-of-the-art practical performance. Generalized Majorization-Minimization Non-convex optimization is ubiquitous in machine learning. Majorization-Minimization (MM) is a powerful iterative procedure for optimizing non-convex functions that works by optimizing a sequence of bounds on the function. In MM, the bound at each iteration is required to touch the objective function at the optimizer of the previous bound. We show that this touching constraint is unnecessary and overly restrictive. We generalize MM by relaxing this constraint, and propose a new optimization framework, named Generalized Majorization-Minimization (G-MM), that is more flexible. For instance, G-MM can incorporate application-specific biases into the optimization procedure without changing the objective function. We derive G-MM algorithms for several latent variable models and show empirically that they consistently outperform their MM counterparts in optimizing non-convex objectives. In particular, G-MM algorithms appear to be less sensitive to initialization. On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization For SGD based distributed stochastic optimization, computation complexity, measured by the convergence rate in terms of the number of stochastic gradient access, and communication complexity, measured by the number of inter-node communication rounds, are the most important two performance metrics. The classical data-parallel implementation of SGD over N workers can achieve a linear speedup of its convergence rate but incurs an inter-node communication round at each batch. We study the benefit of using dynamically increasing batch sizes in parallel SGD for stochastic non-convex optimization by charactering the attained convergence rate and the required number of communication rounds. We show that for stochastic non-convex optimization under the P-L condition, the classical data parallel SGD with exponentially increasing batch sizes can achieve the fastest known O(1/(NT)) convergence with linear speedup using only log(T) communication rounds. For general stochastic non-convex optimization, we propose a Catalyst-like algorithm that achieves the fastest known O(1/) √NT) convergence with linear speedup using only O((√NT)log(T/N)) communication rounds. Simple Stochastic Gradient Methods for Non-Smooth Non-Convex Regularized Optimization Our work focuses on stochastic gradient methods for optimizing a smooth non-convex loss function with a non-smooth non-convex regularizer. Research on this class of problem is quite limited, and until very recently no non-asymptotic convergence results have been reported. We present two simple stochastic gradient algorithms, for finite-sum and general stochastic optimization problems, which have superior convergence complexities compared to the current state of the art. We also demonstrate superior performance of our algorithms in practice for empirical risk minimization on well known datasets. Surrogate Losses for Online Learning of Stepsizes in Stochastic Non-Convex Optimization Stochastic Gradient Descent (SGD) has played a central role in machine learning. However, it requires a carefully hand-picked stepsize for fast convergence, which is notoriously tedious and time-consuming to tune. Over the last several years, a plethora of adaptive gradient-based algorithms have emerged to ameliorate this problem. In this paper, we propose new surrogate losses to cast the problem of learning the optimal stepsizes for the stochastic optimization of a non-convex smooth objective function onto an online convex optimization problem. This allows the use of no-regret online algorithms to compute optimal stepsizes on the fly. In turn, this results in a SGD algorithm with self-tuned stepsizes that guarantees convergence rates that are automatically adaptive to the level of noise. Efficient Dictionary Learning with Gradient Descent Randomly initialized first-order optimization algorithms are the method of choice for solving many high-dimensional nonconvex problems in machine learning, yet general theoretical guarantees cannot rule out convergence to critical points of poor objective value. For some highly structured nonconvex problems however, the success of gradient descent can be understood by studying the geometry of the objective. We study one such problem -- complete orthogonal dictionary learning, and provide converge guarantees for randomly initialized gradient descent to the neighborhood of a global optimum. The resulting rates scale as low order polynomials in the dimension even though the objective possesses an exponential number of saddle points. This efficient convergence can be viewed as a consequence of negative curvature normal to the stable manifolds associated with saddle points, and we provide evidence that this feature is shared by other nonconvex problems of importance as well. Plug-and-Play Methods Provably Converge with Properly Trained Denoisers Plug-and-play (PnP) is a non-convex framework that integrates modern denoising priors, such as BM3D or deep learning-based denoisers, into ADMM or other proximal algorithms. An advantage of PnP is that one can use pre-trained denoisers when there is not sufficient data for end-to-end training. Although PnP has been recently studied extensively and has exhibited great empirical results, theoretical analysis addressing even the most basic question of convergence has been insufficient. In this paper, we theoretically establish convergence of PnP-FBS and PnP-ADMM, without using diminishing stepsizes, under a certain Lipschitz condition on the denoisers. We then propose a technique, which we call real spectral normalization, to train deep learning-based denoisers that satisfy the proposed Lipschitz condition. Finally, we present experimental results that validate the theory. Riemannian adaptive stochastic gradient algorithms on matrix manifolds Adaptive stochastic gradient algorithms in the Euclidean space have attracted much attention lately. Such explorations on Riemannian manifolds, on the other hand, are relatively new, limited, and challenging. This is because of the intrinsic non-linear structure of the underlying manifold and the absence of a canonical coordinate system. In machine learning applications, however, most manifolds of interest are represented as matrices with notions of row and column subspaces. In addition, the implicit manifold-related constraints may also lie on such subspaces. For example, the Grassmann manifold is the set of column subspaces. To this end, such a rich structure should not be lost by transforming matrices to just a stack of vectors while developing optimization algorithms on manifolds. We propose novel stochastic gradient algorithms for problems on Riemannian manifolds by adapting the row and column subspaces of gradients. Our algorithms are provably convergent and they achieve the convergence rate of order O(log(T)/sqrt(T)), where T is the number of iterations. Our experiments illustrate that the proposed algorithms outperform existing Riemannian adaptive stochastic algorithms. Stochastic Optimization for DC Functions and Non-smooth Non-convex Regularizers with Non-asymptotic Convergence Difference of convex (DC) functions cover a broad family of non-convex and possibly non-smooth and non-differentiable functions, and have wide applications in machine learning and statistics. Although deterministic algorithms for DC functions have been extensively studied, stochastic optimization that is more suitable for learning with big data remains under-explored. In this paper, we propose new stochastic optimization algorithms and study their first-order convergence theories for solving a broad family of DC functions. We improve the existing algorithms and theories of stochastic optimization for DC functions from both practical and theoretical perspectives. Moreover, we extend the proposed stochastic algorithms for DC functions to solve problems with a general non-convex non-differentiable regularizer, which does not necessarily have a DC decomposition but enjoys an efficient proximal mapping. To the best of our knowledge, this is the first work that gives the first non-asymptotic convergence for solving non-convex optimization whose objective has a general non-convex non-differentiable regularizer. Alternating Minimizations Converge to Second-Order Optimal Solutions This work studies the second-order convergence for both standard alternating minimization and proximal alternating minimization. We show that under mild assumptions on the (nonconvex) objective function, both algorithms avoid strict saddles almost surely from random initialization. Together with known first-order convergence results, this implies both algorithms converge to a second-order stationary point. This solves an open problem for the second-order convergence of alternating minimization algorithms that have been widely used in practice to solve large-scale nonconvex problems due to their simple implementation, fast convergence, and superb empirical performance.