A Tree-Based Method for Fast Repeated Sampling of Determinantal Point Processes It is often desirable for recommender systems and other information retrieval applications to provide diverse results, and determinantal point processes (DPPs) have become a popular way to capture the trade-off between the quality of individual results and the diversity of the overall set. However, computational concerns limit the usefulness of DPPs in practice. Sampling from a DPP is inherently expensive: if the underlying collection contains N items, then generating each DPP sample requires O(N) time following a one-time preprocessing phase. Additionally, results often need to be personalized to a user, but standard approaches to personalization invalidate the preprocessing, making personalized samples especially expensive. In this work we address both of these shortcomings. First, we propose a new algorithm for generating DPP samples in O(log N) time following a slightly more expensive preprocessing phase. We then extend the algorithm to support arbitrary query-time feature weights, allowing us to generate samples customized to individual users while still retaining logarithmic runtime. Experiments show that our approach runs over 300 times faster than traditional DPP sampling on collections of 100,000 items for samples of size 10. Nonlinear Stein Variational Gradient Descent for Learning Diversified Mixture Models Diversification has been shown to be a powerful mechanism for learning robust models in non-convex settings. A notable example is learning mixture models, in which enforcing diversity between the different mixture components allows us to prevent the model collapsing phenomenon and capture more patterns from the observed data. In this work, we present a variational approach for diversity-promoting learning, which leverages the entropy functional as a natural mechanism for enforcing diversity. We develop a simple and efficient functional gradient-based algorithm for optimizing the variational objective function, which provides a significant generalization of Stein variational gradient descent (SVGD). We test our method on various challenging real world problems, including deep embedded clustering and deep anomaly detection. Empirical results show that our method provides an effective mechanism for diversity-promoting learning, achieving substantial improvement over existing methods. Understanding and Accelerating Particle-Based Variational Inference Particle-based variational inference methods (ParVIs) have gained notable attention in the Bayesian inference area, for their flexible approximation ability and effective iteration. In this work, we explore in-depth the perspective of ParVIs as Wasserstein gradient flows and make both theoretical and pragmatic contributions. On the theory side, we unify various finite-particle approximations that existing ParVIs use, and recognize that the approximation is essentially a compulsory smoothing treatment, in either of two equivalent forms. This novel understanding reveals the assumption and relations of existing ParVIs, and also inspires new ParVIs as we will demonstrate. On the technical side, we propose an acceleration framework and a principled bandwidth selection method for general ParVIs. They are based on the developed theory and our excavation on the geometry of the Wasserstein space. Experimental results show the improved convergence by the acceleration framework and enhanced sample accuracy by the bandwidth selection method. Efficient learning of smooth probability functions from Bernoulli tests with guarantees We study the fundamental problem of learning an unknown, smooth probability function via point-wise Bernoulli tests. We provide a scalable algorithm for efficiently solving this problem with rigorous guarantees. In particular, we prove the convergence rate of our posterior update rule to the true probability function in L2-norm. Moreover, we allow the Bernoulli tests to depend on contextual features, and provide a modified inference engine with provable guarantees for this novel setting. Numerical results show that the empirical convergence rates match the theory, and illustrate the superiority of our approach in handling contextual features over the state-of-the-art. The Variational Predictive Natural Gradient Variational inference transforms posterior inference into parametric optimization thereby enabling the use of latent variable models where otherwise impractical. However, variational inference can be finicky when different variational parameters control variables that are strongly correlated under the model. Traditional natural gradients based on the variational approximation fail to correct for correlations when the approximation is not the true posterior. To address this, we construct a new natural gradient called the variational predictive natural gradient. It is constructed as an average of the Fisher information of the reparameterized predictive model distribution. Unlike traditional natural gradients for variational inference, this natural gradient accounts for the relationship between model parameters and variational parameters. We also show the variational predictive natural gradient relates to the negative Hessian of the expected log-likelihood. A simple example shows the insight. We demonstrate the empirical value of our method on a classification task, a deep generative model of images, and probabilistic matrix factorization for recommendation. Scalable Nonparametric Sampling from Multimodal Posteriors with the Posterior Bootstrap Increasingly complex datasets pose a number of challenges for Bayesian inference. Conventional posterior sampling based on Markov chain Monte Carlo can be too computationally intensive, is serial in nature and mixes poorly between posterior modes. Furthermore, all models are misspecified, which brings into question the validity of the conventional Bayesian update. We present a scalable Bayesian nonparametric learning routine that enables posterior sampling through the optimization of suitably randomized objective functions. A Dirichlet process prior on the unknown data distribution accounts for model misspecification, and admits an embarrassingly parallel posterior bootstrap algorithm that generates independent and exact samples from the nonparametric posterior distribution. Our method is particularly adept at sampling from multimodal posterior distributions via a random restart mechanism, and we demonstrate this on Gaussian mixture model and sparse logistic regression examples. An Instability in Variational Inference for Topic Models Naive mean field variational methods are the state of-the-art approach to inference in topic modeling. We show that these methods suffer from an instability that can produce misleading conclusions. Namely, for certain regimes of the model parameters, variational inference outputs a non-trivial decomposition into topics. However -for the same parameter values- the data contain no actual information about the true topic decomposition, and the output of the algorithm is uncorrelated with it. In particular, the estimated posterior mean is wrong, and estimated credible regions do not achieve the nominal coverage. We discuss how this instability is remedied by more accurate mean field approximations. Bayesian Optimization of Composite Functions We consider optimization of composite objective functions, i.e., of the form f(x)=g(h(x)), where h is a black-box derivative-free expensive-to-evaluate function with vector-valued outputs, and g is a cheap-to-evaluate function taking vector-valued inputs. While these problems can be solved with standard Bayesian optimization, we propose a novel approach that exploits the composite structure of the objective function to substantially improve sampling efficiency. Our approach models h using a multi-output Gaussian process and chooses where to sample using a natural generalization of the expected improvement acquisition function, called Expected Improvement for Composite Functions (EI-CF). Although EI-CF cannot be computed in closed form, we provide a novel stochastic gradient estimator that allows its efficient maximization. We then show that our approach is asymptotically consistent, i.e., that it recovers a globally optimal solution as sampling effort grows to infinity, generalizing previous convergence results for classical EI. Numerical experiments show our approach dramatically outperforms standard Bayesian optimization benchmarks, achieving simple regret that is smaller by several orders of magnitude. The Kernel Interaction Trick: Fast Bayesian Discovery of Pairwise Interactions in High Dimensions Discovering interaction effects on a response of interest is a fundamental problem faced in biology, medicine, economics, and many other scientific disciplines. In theory, Bayesian methods for discovering pairwise interactions enjoy many benefits -- including coherent uncertainty quantification, the ability to incorporate background knowledge, and desirable shrinkage properties. In practice, however, Bayesian methods are often computationally intractable for even moderate-dimensional problems. Our key insight is that many hierarchical models of practical interest admit a Gaussian process representation such that a posterior over all O(p^2) interactions need never be maintained explicitly, only a vector of O(p) kernel hyper-parameters. This implicit representation allows us to run MCMC over model hyper-parameters in time and memory linear in p per iteration. On datasets with a variety of covariate and parameter behaviors such as sparsity, we show that: (1) our method improves running time by orders of magnitude over naive applications of MCMC, (2) that our method offers improved Type I and Type II error relative to state-of-the-art LASSO-based approaches, and (3) that our method offers improved computational scaling in high dimensions relative to existing Bayesian and LASSO-based approaches. Quantile Stein Variational Gradient Descent for Batch Bayesian Optimization Batch Bayesian optimization has been shown to be an efficient and successful approach for black-box function optimization, especially when the evaluation of cost function is highly expensive but can be efficiently parallelized. In this paper, we introduce a novel variational framework for batch query optimization, based on the argument that the query batch should be selected to have both high diversity and good worst case performance. This motivates us to introduce a variational objective that combines a quantile-based risk measure (for worst case performance) and entropy regularization (for enforcing diversity). We derive a gradient-based particle-based algorithm for solving our quantile-based variational objective, which generalizes Stein variational gradient descent (SVGD). We evaluate our method on a number of real-world applications and show that it consistently outperforms other recent state-of-the-art batch Bayesian optimization methods. Extensive experimental results indicate that our method achieves better or comparable performance, compared to the existing methods.