Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel k-means Clustering Kernel methods generalize machine learning algorithms that only depend on the pairwise inner products of the dataset by replacing inner products with kernel evaluations, a function that passes input points through a nonlinear feature map before taking the inner product in a higher dimensional space. In this work, we present nearly tight lower bounds on the number of kernel evaluations required to approximately solve kernel ridge regression (KRR) and kernel k-means clustering (KKMC) on n input points. For KRR, our bound for relative error approximation the argmin f the objective function is $\Omega(nd{\mathrm{eff}}^\lambda/\varepsilon)wheed{\mathrm{eff}}^\lambdaistheeffectivestatisticaldimension,tightuptoa\log(d_{\mathrm{eff}}^\lambda/\varepsilon)factor.ForKKMC,ourboundforfindingak−clusteringachievingarelativeerrorapproximationoftheobjectivefunctionis\Omega(nk/\varepsilon),tightuptoa\log(k/\varepsilon)$ factor. Our KRR result resolves a variant of an open question of El Alaoui and Mahoney, asking whether the effective statistical dimension is a lower bound on the sampling complexity or not. Furthermore, for the important input distribution case of mixtures of Gaussians, we provide algorithms that bypass the above lower bounds. Dimensionality Reduction for Tukey Regression We give the first dimensionality reduction methods for the overconstrained Tukey regression problem. The Tukey loss function $\|y\|M = \sumi M(yi)hasM(yi) \approx |yi|^pforresidualerrorsyismallerthanaprescribedthreshold\tau,butM(yi)becomesconstantforerrors|yi| > \tau.Ourresultsdependonanewstructuralresult,provenconstructively,showingthatforanyd−dimensionalsubspaceL \subset \mathbb{R}^n,thereisafixedbounded−sizesubsetofcoordinatescontaining,foreveryy\in L,allthelargecoordinates,withrespecttotheTukeylossfunction, of y$. Our methods reduce a given Tukey regression problem to a smaller weighted version, whose solution is a provably good approximate solution to the original problem. Our reductions are fast, simple, and easy to implement, and we give empirical results demonstrating their practicality, using existing heuristic solvers for the small versions. We also give exponential-time algorithms giving provably good solutions, and hardness results suggesting that a significant speedup in the worst case is unlikely. Efficient Full-Matrix Adaptive Regularization Adaptive regularization methods pre-multiply a descent direction by a preconditioning matrix. Due to the large number of parameters of machine learning problems, full-matrix preconditioning methods are prohibitively expensive. We show how to modify full-matrix adaptive regularization in order to make it practical and effective. We also provide novel theoretical analysis for adaptive regularization in non-convex optimization settings. The core of our algorithm, termed GGT, consists of efficient inverse computation of square roots of low-rank matrices. Our preliminary experiments underscore improved convergence rate of GGT across a variety of synthetic tasks and standard deep learning benchmarks. Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient Algorithms Mixture-of-Experts (MoE) is a widely popular model for ensemble learning and is a basic building block of highly successful modern neural networks as well as a component in Gated Recurrent Units (GRU) and Attention networks. However, present algorithms for learning MoE, including the EM algorithm and gradient descent, are known to get stuck in local optima. From a theoretical viewpoint, finding an efficient and provably consistent algorithm to learn the parameters remains a long standing open problem for more than two decades. In this paper, we introduce the first algorithm that learns the true parameters of a MoE model for a wide class of non-linearities with global consistency guarantees. While existing algorithms jointly or iteratively estimate the expert parameters and the gating parameters in the MoE, we propose a novel algorithm that breaks the deadlock and can directly estimate the expert parameters by sensing its echo in a carefully designed cross-moment tensor between the inputs and the output. Once the experts are known, the recovery of gating parameters still requires an EM algorithm; however, we show that the EM algorithm for this simplified problem, unlike the joint EM algorithm, converges to the true parameters. We empirically validate our algorithm on both the synthetic and real data sets in a variety of settings, and show superior performance to standard baselines. Efficient Nonconvex Regularized Tensor Completion with Structure-aware Proximal Iterations Nonconvex regularizers have been successfully used in low-rank matrix learning. In this paper, we extend this to the more challenging problem of low-rank tensor completion. Based on the proximal average algorithm, we develop an efficient solver that avoids expensive tensor folding and unfolding. A special sparse plus low-rank" structure, which is essential for fast computation of individual proximal steps, is maintained throughout the iterations. We also incorporate adaptive momentum to further speed up empirical convergence. Convergence results to critical points are provided under smoothness and Kurdyka-Lojasiewicz conditions. Experimental results on a number of synthetic and real-world data sets show that the proposed algorithm is more efficient in both time and space, and is also more accurate than existing approaches. Robust Estimation of Tree Structured Gaussian Graphical Models Consider jointly Gaussian random variables whose conditional independence structure is specified by a graphical model. If we observe realizations of the variables, we can compute the covariance matrix, and it is well known that the support of the inverse covariance matrix corresponds to the edges of the graphical model. Instead, suppose we only have noisy observations. If the noise at each node is independent, we can compute the sum of the covariance matrix and an unknown diagonal. The inverse of this sum is (in general) dense. We ask: can the original independence structure be recovered? We address this question for tree structured graphical models. We prove that this problem is unidentifiable, but show that this unidentifiability is limited to a small class of candidate trees. We further present additional constraints under which the problem is identifiable. Finally, we provide an O(n^3) algorithm to find this equivalence class of trees. Spectral Approximate Inference Graphical models (GMs) have been successfully applied to various applications of machine learning. Given a GM, computing its partition function is the most essential inference task, but it is computationally intractable in general. To address the issue, iterative approximation algorithms exploring certain local structure/consistency of GM have been investigated as popular choices in practice. However, due to their local/iterative nature, they often output poor approximations or even do not converge, e.g., in low-temperature regimes (hard instances of large parameters). To overcome the limitation, we propose a novel approach utilizing the global spectral feature of GM. Our contribution is two-fold: (a) we first propose a fully polynomial-time approximation scheme (FPTAS) for approximating the partition function of GM associating with a low-rank coupling matrix; (b) for general high-rank GMs, we design a spectral mean-field scheme utilizing (a) as a subroutine, where it approximates a high-rank GM into a product of rank-1 GMs for an efficient approximation of the partition function. The proposed algorithm is more robust in its running time and accuracy than prior methods, i.e., neither suffers from the convergence issue nor depends on hard local structures. Our experiments demonstrate that it indeed outperforms baselines, in particular, significantly in the low-temperature regimes. Partially Linear Additive Gaussian Graphical Models We propose a partially linear additive Gaussian graphical model (PLA-GGM) for the estimation of associations between random variables distorted by observed confounders. Model parameters are estimatedusing an L1-regularized maximal pseudo-profile likelihood estimator (MaPPLE) for which we prove a √n-sparsistency. Importantly, our approach avoids parametric constraints on the effects of confounders on the estimated graphical model structure. Empirically, the PLA-GGM is applied to both synthetic and real-world datasets, demonstrating superior performance compared to competing methods. DAG-GNN: DAG Structure Learning with Graph Neural Networks Learning a faithful directed acyclic graph (DAG) from samples of a joint distribution is a challenging combinatorial problem, owing to the intractable search space superexponential in the number of graph nodes. A recent breakthrough formulates the problem as a continuous optimization with a structural constraint that ensures acyclicity (Zheng et al., 2018). The authors apply the approach to the linear structural equation model (SEM) and the least-squares loss function that are statistically well justified but nevertheless limited. Motivated by the widespread success of deep learning that is capable of capturing complex nonlinear mappings, in this work we propose a nonlinear generative model and apply a variant of the structural constraint to learn the DAG. At the heart of the generative model is a variational autoencoder parameterized by a novel graph neural network architecture, which we coin DAG-GNN. In addition to the richer capacity, an advantage of the proposed model is that it naturally handles discrete variables as well as vector-valued variables. We demonstrate that on synthetic data sets, the proposed method learns more accurate graphs for nonlinearly generated samples; and on benchmark data sets with discrete variables, the learned graphs are reasonably close to the global optima. Random Walks on Hypergraphs with Edge-Dependent Vertex Weights Hypergraphs are used in many machine learning methods to model higher-order relationships in data. While spectral methods for graphs are well-established, spectral theory for hypergraphs remains an active area of research. In this paper, we use random walks to develop a spectral theory for hypergraphs with edge-dependent vertex weights: hypergraphs where every vertex v has a weight γe(v) for each incident hyperedge e, describing the contribution of v to the hyperedge e. We derive a random walk-based hypergraph Laplacian, and bound the mixing time of random walks on such hypergraphs. Moreover, we give conditions under which random walks on such hypergraphs are equivalent to random walks on graphs. As a corollary, we show that current machine learning methods that rely on Laplacians derived from random walks on hypergraphs with edge-independent vertex weights do not utilize higher-order relationships in the data. Finally, we demonstrate the effectiveness of hypergraphs with edge-dependent vertex weights on ranking applications using both synthetic and real-world datasets.