# Optimization

by

• ## rong jin

· Jun 12, 2019 · 354 views ·

## ICML 2019

Distributed Learning with Sublinear Communication In distributed statistical learning, N samples are split across m machines and a learner wishes to use minimal communication to learn as well as if the examples were on a single machine. This model has received substantial interest in machine learning due to its scalability and potential for parallel speedup. However, in the high-dimensional regime, where the number examples is smaller than the number of features (dimension''), the speedup afforded by distributed learning may be overshadowed by the cost of communicating a single example. This paper investigates the following question: When is it possible to learn a d -dimensional model in the distributed setting with total communication sublinear in d? Starting with a negative result, we show that for learning the usual variants of (sparse or norm-bounded) linear models, no algorithm can obtain optimal error until communication is linear in dimension. Our main result is to show that by slightly relaxing the standard statistical assumptions for this setting we can obtain distributed algorithms that enjoy optimal error and communication logarithmic in dimension. Our upper bounds are based on family of algorithms that combine mirror descent with randomized sparsification/quantization of iterates, and extend to the general stochastic convex optimization model. On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization Recent developments on large-scale distributed machine learning applications, e.g., deep neural networks, benefit enormously from the advances in distributed non-convex optimization techniques, e.g., distributed Stochastic Gradient Descent (SGD). A series of recent works study the linear speedup property of distributed SGD variants with reduced communication. The linear speedup property enable us to scale out the computing capability by adding more computing nodes into our system. The reduced communication complexity is desirable since communication overhead is often the performance bottleneck in distributed systems. Recently, momentum methods are more and more widely adopted in training machine learning models and can often converge faster and generalize better. For example, many practitioners use distributed SGDs with momentum to train deep neural networks with big data. However, it remains unclear whether any distributed momentum SGD possesses the same linear speedup property as distributed SGDs and has reduced communication complexity. This paper fills the gap between practice and theory by considering a distributed communication efficient momentum SGD method and proving its linear speedup property. Stochastic Gradient Push for Distributed Deep Learning Distributed data-parallel algorithms aim to accelerate the training of deep neural networks by parallelizing the computation of large mini-batch gradient updates across multiple nodes. Approaches that synchronize nodes using exact distributed averaging (e.g., via All-Reduce) are sensitive to stragglers and communication delays. The PushSum gossip algorithm is robust to these issues, but only performs approximate distributed averaging. This paper studies Stochastic Gradient Push (SGP), which combines PushSum with stochastic gradient updates. We prove that SGP converges to a stationary point of smooth, non-convex objectives at the same sub-linear rate as SGD, that all nodes achieve consensus, and that SGP achieves a linear speedup with respect to the number of compute nodes. Furthermore, we empirically validate the performance of SGP on image classification and machine translation workloads. Our code, attached to the submission, will be made publicly available. Collective Model Fusion for Multiple Black-Box Experts Model fusion is a fundamental problem in collective machine learning (ML) where independent experts with heterogeneous learning architectures are required to combine expertise to improve predictive performance. This is particularly challenging in information-sensitive domains (e.g., medical records in health-care analytics) where experts do not have access to each other's internal architecture and local data. To address this challenge, this paper presents the first collective model fusion framework for multiple experts with heterogeneous black-box architectures. The proposed method will enable this by addressing the following key issues of how black-box experts interact to understand the predictive behaviors of one another; how these understandings can be represented and shared efficiently among themselves; and how the shared understandings can be combined to generate high-quality consensus prediction. The performance of the resulting framework is analyzed theoretically and demonstrated empirically on several datasets. Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization The communication overhead is one of the key challenges that hinders the scalability of distributed optimization algorithms to train large neural networks. In recent years, there has been a great deal of research to alleviate communication cost by compressing the gradient vector or using local updates and periodic model averaging. In this paper, we aim at developing communication-efficient distributed stochastic algorithms for non-convex optimization by effective data replication strategies. In particular, we, both theoretically and practically, show that by properly infusing redundancy to the training data with model averaging, it is possible to significantly reduce the number of communications rounds. To be more precise, for a predetermined level of redundancy, the proposed algorithm samples min-batches from redundant chunks of data from multiple workers in updating local solutions. As a byproduct, we also show that the proposed algorithm is robust to failures. Our empirical studies on CIFAR10 and CIFAR100 datasets in a distributed environment complement our theoretical results. Trimming the ℓ1 Regularizer: Statistical Analysis, Optimization, and Applications to Deep Learning We study high-dimensional estimators with the trimmed $\ell1penalty,whichleavesthehlargestparameterentriespenalty−free.Whileoptimizationtechniquesforthisnonconvexpenaltyhavebeenstudied,thestatisticalpropertieshavenotyetbeenanalyzed.WepresentthefirststatisticalanalysesforM−estimation,andcharacterizesupportrecovery,penalty,whichleavesthehlargestparameterentriespenalty−free.Whileoptimizationtechniquesforthisnonconvexpenaltyhavebeenstudied,thestatisticalpropertieshavenotyetbeenanalyzed.WepresentthefirststatisticalanalysesforM−estimation,andcharacterizesupportrecovery,\ell\inftyandand\ell2errorofthetrimmederrorofthetrimmed\ell1estimatesasafunctionofthetrimmingparameterh.Ourresultsshowdifferentregimesbasedonhowhcomparestothetruesupportsize.Oursecondcontributionisanewalgorithmforthetrimmedregularizationproblem,whichhasthesametheoreticalconvergencerateasdifferenceofconvex(DC)algorithms,butinpracticeisfasterandfindslowerobjectivevalues.Empiricalevaluationofestimatesasafunctionofthetrimmingparameterh.Ourresultsshowdifferentregimesbasedonhowhcomparestothetruesupportsize.Oursecondcontributionisanewalgorithmforthetrimmedregularizationproblem,whichhasthesametheoreticalconvergencerateasdifferenceofconvex(DC)algorithms,butinpracticeisfasterandfindslowerobjectivevalues.Empiricalevaluationof\ell1trimmingforsparselinearregressionandgraphicalmodelestimationindicatethattrimmedtrimmingforsparselinearregressionandgraphicalmodelestimationindicatethattrimmed\ell1canoutperformvanillacanoutperformvanilla\ell_1$ and non-convex alternatives. Our last contribution is to show that the trimmed penalty is beneficial beyond M-estimation, and yields promising results for two deep learning tasks: input structures recovery and network sparsification. Compressed Factorization: Fast and Accurate Low-Rank Factorization of Compressively-Sensed Data What learning algorithms can be run directly on compressively-sensed data? In this work, we consider the question of accurately and efficiently computing low-rank matrix or tensor factorizations given data compressed via random projections. We examine the approach of first performing factorization in the compressed domain, and then reconstructing the original high-dimensional factors from the recovered (compressed) factors. In both the matrix and tensor settings, we establish conditions under which this natural approach will provably recover the original factors. While it is well-known that random projections preserve a number of geometric properties of a dataset, our work can be viewed as showing that they can also preserve certain solutions of non-convex, NP-Hard problems like non-negative matrix factorization. We support these theoretical results with experiments on synthetic data and demonstrate the practical applicability of compressed factorization on real-world gene expression and EEG time series datasets. Noisy Dual Principal Component Pursuit Dual Principal Component Pursuit (DPCP) is a recently proposed non-convex optimization based method for learning subspaces of high relative dimension from noiseless datasets contaminated by as many outliers as the square of the number of inliers. Experimentally, DPCP has proved to be robust to noise and outperform the popular RANSAC on 3D vision tasks such as road plane detection and relative poses estimation from three views. This paper extends the global optimality and convergence theory of DPCP to the case of data corrupted by noise, and further demonstrates its robustness using synthetic and real data. Learning a Compressed Sensing Measurement Matrix via Gradient Unrolling Linear encoding of sparse vectors is widely popular, but is commonly data-independent -- missing any possible extra (but a-priori unknown) structure beyond sparsity. In this paper we present a new method to learn linear encoders that adapt to data, while still performing well with the widely used $\ell1decoder.Theconvexdecoder.Theconvex\ell1decoderpreventsgradientpropagationasneededinstandardgradient−basedtraining.Ourmethodisbasedontheinsightthatunrollingtheconvexdecoderintodecoderpreventsgradientpropagationasneededinstandardgradient−basedtraining.OurmethodisbasedontheinsightthatunrollingtheconvexdecoderintoT$ projected subgradient steps can address this issue. Our method can be seen as a data-driven way to learn a compressed sensing measurement matrix. We compare the empirical performance of 10 algorithms over 6 sparse datasets (3 synthetic and 3 real). Our experiments show that there is indeed additional structure beyond sparsity in the real datasets. Our method is able to discover it and exploit it to create excellent reconstructions with fewer measurements (by a factor of 1.1-3x) compared to the previous state-of-the-art methods. We illustrate an application of our method in learning label embeddings for extreme multi-label classification. Our experiments show that our method is able to match or outperform the precision scores of SLEEC, which is one of the state-of-the-art embedding-based approaches for extreme multi-label learning. Screening rules for Lasso with non-convex Sparse Regularizers Leveraging on the convexity of the Lasso problem, screening rules help in accelerating solvers by discarding irrelevant variables, during the optimization process. However, because they provide better theoretical guarantees in identifying relevant variables, several non-convex regularizers for the Lasso have been proposed in the literature. This work is the first that introduces a screening rule strategy into a non-convex Lasso solver. The approach we propose is based on a iterative majorization-minimization (MM) strategy that includes a screening rule in the inner solver and a condition for propagating screened variables between iterations of MM. In addition to improve efficiency of solvers, we also provide guarantees that the inner solver is able to identify the zeros components of its critical point in finite time. Our experimental analysis illustrates the significant computational gain brought by the new screening rule compared to classical coordinate-descent or proximal gradient descent methods.