COMIC: Multi-view Clustering Without Parameter Selection In this paper, we study two challenges in clustering analysis, namely, how to cluster multi-view data and how to perform clustering without parameter selection on cluster size. To this end, we propose a novel objective function to project raw data into one space in which the projection embraces the geometric consistency (GC) and the cluster assignment consistency (CAC). To be specific, the GC aims to learn a connection graph from a projection space wherein the data points are connected if and only if they belong to the same cluster. The CAC aims to minimize the discrepancy of pairwise connection graphs induced from different views based on the view-consensus assumption, \textit{i.e.}, different views could produce the same cluster assignment structure as they are different portraits of the same object. Thanks to the view-consensus derived from the connection graph, our method could achieve promising performance in learning view-specific representation and eliminating the heterogeneous gaps across different views. Furthermore, with the proposed objective, it could learn almost all parameters including the cluster number from data without labor-intensive parameter selection. Extensive experimental results show the promising performance achieved by our method on five datasets comparing with nine state-of-the-art multi-view clustering approaches. The Wasserstein Transform We introduce the Wasserstein transform, a method for enhancing and denoising datasets defined on general metric spaces. The construction draws inspiration from Optimal Transportation ideas. We establish the stability of our method under data perturbation and, when the dataset is assumed to be Euclidean, we also exhibit a precise connection between the Wasserstein transform and the mean shift family of algorithms. We then use this connection to prove that mean shift also inherits stability under perturbations. We study the performance of the Wasserstein transform method on different datasets as a preprocessing step prior to clustering and classification tasks. Sequential Facility Location: Approximate Submodularity and Greedy Algorithm We develop and analyze a novel utility function and a fast optimization algorithm for subset selection in sequential data that incorporates the underlying dynamic model of data. We propose a capacitated sequential facility location function that finds a fixed number of representatives that well encode the data, where the sequence of representatives are compatible with the dynamic model of data. As maximizing this new objective function is NP-hard, we develop a fast greedy algorithm based on submodular maximization. Unlike the conventional facility location, the computation of the marginal gain in our case cannot be done by operations on each item independently. We exploiting the sequential structure of the problem and develop an efficient dynamic programming-based algorithm that exactly computes the marginal gain. We investigate conditions on the dynamic transition model of data, under which our utility function is submodular or ϵ-approximately submodualr, hence, the greedy algorithm comes with performance guarantees. By experiments on synthetic data and the real-world problem of procedure learning from instructional videos, we show that our framework not only significantly improves the computational time, but also achieves better objective function values and obtains more coherent summaries. Neural Collaborative Subspace Clustering We introduce the Neural Collaborative Subspace Clustering, a neural model that discovers clusters of data points drawn from a union of low-dimensional subspaces. In contrast to previous attempts, our model runs without the aid of spectral clustering. This makes our algorithm one of the kinds that can gracefully scale to large datasets. At its heart, our neural model benefits from a classifier which determines whether a pair of points lies on the same subspace or not. Essential to our model is the construction of two affinity matrices, one from the classifier and the other from a notion of subspace self-expressiveness, to supervise training in a collaborative scheme. We thoroughly assess and contrast the performance of our model against various state-of-the-art clustering algorithms including deep subspace-based ones. Unsupervised Deep Learning by Neighbourhood Discovery Deep convolutional neural networks (CNNs) have demonstrated remarkable success in computer vision by supervisedly learning strong visual feature representations. However, training CNNs relies heavily on the availability of exhaustive training data annotations, limiting significantly their deployment and scalability in many application scenarios. In this work, we introduce a generic unsupervised deep learning approach to training deep models without the need for any manual label supervision. Specifically, we progressively discover sample anchored/centred neighbourhoods to reason and learn the underlying class decision boundaries iteratively and accumulatively. Every single neighbourhood is specially formulated so that all the member samples can share the same unseen class labels at high probability for facilitating the extraction of class discriminative feature representations during training. Experiments on image classification show the performance advantages of the proposed method over the state-of-the-art unsupervised learning models on CIFAR10 and CIFAR100, SVHN, and ImageNet benchmarks. Autoregressive Energy Machines Neural density estimators are flexible families of parametric models which have seen widespread use in unsupervised machine learning in recent years. Maximum-likelihood training typically dictates that these models be constrained to specify an explicit density. However, this limitation can be overcome by instead using a neural network to specify an energy function, or unnormalized density, which can subsequently be normalized to obtain a valid distribution. The challenge with this approach lies in accurately estimating the normalizing constant of the high-dimensional energy function. We propose the Autoregressive Energy Machine, an energy-based model which simultaneously learns an unnormalized density and computes an importance-sampling estimate of the normalizing constant for each conditional in an autoregressive decomposition. The Autoregressive Energy Machine achieves state-of-the-art performance on a suite of density-estimation tasks. Greedy Orthogonal Pivoting Algorithm for Non-Negative Matrix Factorization Non-negative matrix factorization is a powerful tool for learning useful representations in the data and has been widely applied in many problems such as data mining and signal processing. Orthogonal NMF, which can improve the locality of decomposition, has drawn considerable interest in solving clustering problems in recent years. However, imposing simultaneous non-negative and orthogonal structure can be quite difficult, and so existing algorithms can only solve it approximately. To address this challenge, we propose an innovative procedure called Greedy Orthogonal Pivoting Algorithm (GOPA). The GOPA algorithm fully exploits the sparsity of non-negative orthogonal solutions to break the global problem into a series of local optimizations, in which an adaptive subset of coordinates are updated in a greedy, closed-form manner. The biggest advantage of GOPA is that it promotes exact orthogonality and provides solid empirical evidence that stronger orthogonality does contribute favorably to better clustering performance. On the other hand, we further design randomized and parallel version of GOPA, which can further reduce the computational cost and improve accuracy, making it suitable for large data. Noise2Self: Blind Denoising by Self-Supervision We propose a general framework for denoising high-dimensional measurements which requires no prior on the signal, no estimate of the noise, and no clean training data. The only assumption is that the noise exhibits statistical independence across different dimensions of the measurement. Moreover, our framework is not restricted to a particular denoising model. We show how it can be used to calibrate any parameterised denoising algorithm, from the single hyperparameter of a median filter to the millions of weights of a deep neural network. We demonstrate this on natural image and microscopy data, where we exploit the independence of noise between pixels, and on single-cell gene expression data, where we exploit the independence between detections of individual molecules. Finally, we prove a theoretical lower bound on the performance of an optimal denoiser. This framework generalizes recent work on training neural nets from noisy images and on cross-validation for matrix factorization. Learning Dependency Structures for Weak Supervision Models Labeling training data is a key bottleneck in the modern machine learning pipeline. Recent weak supervision approaches combine labels from multiple noisy sources by estimating their accuracies without access to ground truth; however, estimating the dependencies among these sources is a critical challenge. We focus on a robust PCA-based algorithm for learning these dependency structures, establish improved theoretical recovery rates, and outperform existing methods on various real-world tasks. Under certain conditions, we show that the amount of unlabeled data needed can scale sublinearly or even logarithmically with the number of sources m, improving over previous efforts that ignore the sparsity pattern in the dependency structure and scale linearly in m. We provide an information-theoretic lower bound on the minimum sample complexity of the weak supervision setting. Our method outperforms weak supervision approaches that assume conditionally-independent sources by up to 4.64 F1 points and previous structure learning approaches by up to 4.41 F1 points on real-world relation extraction and image classification tasks. Geometry and Symmetry in Short-and-Sparse Deconvolution We study the Short-and-Sparse (SaS) deconvolution problem of recovering a short signal a0 and a sparse signal x0 from their convolution. We propose a method based on nonconvex optimization, which under certain conditions recovers the target short and sparse signals, up to a signed shift symmetry which is intrinsic to this model. This symmetry plays a central role in shaping the optimization landscape for deconvolution. We give a regional analysis, which characterizes this landscape geometrically, on a union of subspaces. Our geometric characterization holds when the length-p0 short signal a0 has shift coherence µ, and x0 follows a random sparsity model with sparsity rate θ ∈ [c1/p0, c2/(p0\sqrt{\mu}+\sqrt{p0})] / (log^2(p0)) . Based on this geometry, we give a provable method that successfully solves SaS deconvolution with high probability.