DBSCAN++: Towards fast and scalable density clustering DBSCAN is a classical density-based clustering procedure with tremendous practical relevance. However, DBSCAN implicitly needs to compute the empirical density for each sample point, leading to a quadratic worst-case time complexity, which is too slow on large datasets. We propose DBSCAN++, a simple modification of DBSCAN which only requires computing the densities for a chosen subset of points. We show empirically that, compared to traditional DBSCAN, DBSCAN++ can provide not only competitive performance but also added robustness in the bandwidth hyperparameter while taking a fraction of the runtime. We also present statistical consistency guarantees showing the trade-off between computational cost and estimation rates. Surprisingly, up to a certain point, we can enjoy the same estimation rates while lowering computational cost, showing that DBSCAN++ is a sub-quadratic algorithm that attains minimax optimal rates for level-set estimation, a quality that may be of independent interest. Concrete Autoencoders: Differentiable Feature Selection and Reconstruction We introduce the concrete autoencoder, an end-to-end differentiable method for global feature selection, which efficiently extracts a sparse set of the most informative features and simultaneously learns a neural network to reconstruct the input data from the selected features. Our method is unsupervised, and is based on using a concrete selector layer as the encoder and using a standard neural network as the decoder. During the training phase, the temperature of the concrete selector layer is gradually decreased, which encourages a user-specified number of discrete features to be learned. During test time, the selected features can be used with the decoder network to reconstruct the remaining input features. We evaluate concrete autoencoders on a variety of datasets, where they significantly outperform state-of-the-art methods for feature selection and data reconstruction. In particular, on a large-scale gene expression dataset, the concrete autoencoder selects a small subset of genes whose expression levels can be use to impute the expression levels of the remaining genes. In doing so, it improves on the current widely-used expert-curated L1000 landmark genes, potentially saving experimental costs. The concrete autoencoder can be implemented by adding just a few lines of code to a standard autoencoder. Gromov-Wasserstein Learning for Graph Matching and Node Embedding A novel Gromov-Wasserstein learning framework is proposed to jointly match (align) graphs and learn embedding vectors for the associated graph nodes. Using Gromov-Wasserstein discrepancy, we measure the dissimilarity between two graphs and find their correspondence, according to the learned optimal transport. The node embeddings associated with the two graphs are learned under the guidance of the optimal transport, the distance of which not only reflects the topological structure of each graph but also yields the correspondence across the graphs. These two learning steps are mutually-beneficial, and are unified here by minimizing the Gromov-Wasserstein discrepancy with structural regularizers. This framework leads to an optimization problem that is solved by a proximal point method. We apply the proposed method to matching problems in real-world networks, and demonstrate its superior performance compared to alternative approaches. Spectral Clustering of Signed Graphs via Matrix Power Means Signed graphs can be used to encode positive (attractive) and negative (repulsive) relations between nodes. We propose to merge the information from positive and negative edges via the one-parameter family of signed power mean Laplacians defined as the matrix power mean of standard and signless Laplacians. We analyze the signed power mean Laplacian in the stochastic block model in expectation and show that it outperforms the state of the art in terms of conditions under which it recovers the ground truth clusters. Moreover, under the stochastic block model we show concentration around its expectation of eigenvalues and eigenvectors of the signed power mean Laplacian. Finally, we provide an extensive comparison to existing methods on real world datasets. Coresets for Ordered Weighted Clustering We design coresets for ordered k-median, a generalization of classical clustering problems such as k-median and k-center, that offers a more flexible data analysis, like easily combining multiple objectives (e.g., to increase fairness or for Pareto optimization). Its objective function is defined via the Ordered Weighted Averaging (OWA) paradigm of Yager (1988), where data points are weighted according to a predefined weight vector, but in order of their contribution to the objective (distance from the centers). A powerful data-reduction technique, called a coreset, is to summarize a point set X in R^d into a small (weighted) point set X', such that for every set of k potential centers, the objective value of the coreset X' approximates that of X within factor 1±ϵ1±ϵ. When there are multiple objectives (weights), the above standard coreset might have limited usefulness, and we therefore introduce the notion of a \emph{simultaneous} coreset, where the above approximation holds for all weights (in addition to all centers). Our main result is a construction of a simultaneous coreset of size Oϵ,d(k2log2|X|)Oϵ,d(k2log2⁡|X|) for ordered k-median. We believe that simultaneous coresets can be applicable more broadly in data analysis. To validate the efficacy of our coreset construction we ran experiments on a real geographical data set. We find that our algorithm produces a small coreset, which translates to a massive speedup of clustering computations, while maintaining high accuracy for a range of weights. Fair k-Center Clustering for Data Summarization In data summarization we want to choose k prototypes in order to summarize a data set. We study a setting where the data set comprises several demographic groups and we are restricted to choose k_i prototypes belonging to group i. A common approach to the problem without the fairness constraint is to optimize a centroid-based clustering objective such as k-center. A natural extension then is to incorporate the fairness constraint into the clustering objective. Existing algorithms for doing so run in time super-quadratic in the size of the data set. This is in contrast to the standard k-center objective that can be approximately optimized in linear time. In this paper, we resolve this gap by providing a simple approximation algorithm for the k-center problem under the fairness constraint with running time linear in the size of the data set and k. If the number of demographic groups is small, the approximation guarantee of our algorithm only incurs a constant-factor overhead. We demonstrate the applicability of our algorithm on both synthetic and real data sets. A Better k-means++ Algorithm via Local Search In this paper, we develop a new variant of k-means++ seeding that in expectation achieves a constant approximation guarantee. We obtain this result by a simple combination of k-means++ sampling with a local search strategy. We evaluate our algorithm empirically and show that it also improves the quality of a solution in practice. Kernel Normalized Cut: a Theoretical Revisit In this paper, we study about theoretical properties of clustering based on the kernel normalized cut. Our first contribution is to derive a nonasymptotic upper bound on the expected distortion rate of the kernel normalized cut. From this result, we show that the solution of the kernel normalized cut converges to that of the population-level weighted k-means clustering on a certain reproducing kernel Hilbert space (RKHS). Our second contribution is to discover an interesting fact that the population-level weighted k-means clustering in the RKHS is equivalent to the population-level normalized cut. Combining these results, we can see that the kernel normalized cut converges to the population-level normalized cut. The criterion of the population-level normalized cut can be considered as an indivisibility of the population distribution, and this criterion plays an important role in the theoretical analysis of spectral clustering in Schiebinger et al. (2015). We believe that our results will provide deep insights into behavior of both normalized cut and spectral clustering. Guarantees for Spectral Clustering with Fairness Constraints Given the widespread popularity of spectral clustering (SC) for partitioning graph data, we study a version of constrained SC in which we try to incorporate the fairness notion proposed by Chierichetti et al. (2017). According to this notion, a clustering is fair if every demographic group is approximately proportionally represented in each cluster. To this end, we develop variants of both normalized and unnormalized constrained SC and show that they help find fairer clusterings on both synthetic and real data. We also provide a rigorous theoretical analysis of our algorithms. While there have been efforts to incorporate various constraints into the SC framework, theoretically analyzing them is a challenging problem. We overcome this by proposing a natural variant of the stochastic block model where h groups have strong inter-group connectivity, but also exhibit a "natural" clustering structure which is fair. We prove that our algorithms can recover this fair clustering with high probability. Supervised Hierarchical Clustering with Exponential Linkage In supervised clustering, standard techniques for learning a dissimilarity function often present a mismatch between the training and clustering objectives. This mismatch leads to poor performance, which we demonstrate in the case where training maximizes prediction accuracy on all within- and across-cluster pairs and clustering is performed with agglomerative clustering with single linkage. We present a training procedure tailored specifically to single linkage clustering that results in improved performance. Since designing specialized training procedures is cumbersome, we introduce the parametric family of exponential linkage functions, which smoothly interpolates between single, average and complete linkage, and give a training procedure that jointly selects a linkage from the family and learns a dissimilarity function suited for that linkage. In experiments on four datasets, our training procedure leads to improvements of up to 6\% dendrogram purity over all pairs training and consistently matches or outperforms the next best linkage/training-procedure pair on three out of four datasets.