Hierarchical Clustering in Graph Streams: Single-Pass Algorithms and Space Lower Bounds

Jul 2, 2022

Speakers

About

The Hierarchical Clustering (HC) problem consists of building a hierarchy of clusters to represent a given dataset. Motivated by the modern large-scale applications, we study the problem in the streaming model, in which the memory is heavily limited and only a single or very few passes over the input are allowed. Specifically, we investigate whether a good hierarchical clustering can be obtained, or at least whether we can approximately estimate the value of the optimal hierarchy. To measure the quality of a hierarchy, we use the HC minimization objective introduced by Dasgupta [STOC'16]. Assuming that the input is an n-vertex weighted graph whose edges arrive in a stream, we derive the following results on space-vs-accuracy tradeoffs: – With O(n polylog n) space, we develop a single-pass algorithm, whose approximation ratio matches the currently best offline algorithm by Charikar and Chatziafratis [SODA'17]. – When the space is more limited, namely, n^1-o(1), we prove that no algorithm can even estimate the value of optimum hierarchical tree to within an o(log(n)/loglog(n)) factor, even when allowed polylogn passes over the input and exponential time. – In the most stringent setting of polylogn space, studied extensively in the literature, we rule out algorithms that can even distinguish between “highly”-vs-“poorly” clusterable graphs, namely, graphs that have an n^1/2-o(1) factor gap between their HC objective value. – Finally, we prove that any single-pass streaming algorithm that computes an optimal HC clustering requires to store almost the entire input even if allowed exponential time. Our algorithmic results establish a general structural result that proves that cut sparsifiers of input graph can preserve cost of “balanced” hierarchical trees to within some constant factor, and thus can be used in place of the original (dense) graphs when solving HC. Our lower bound results involve establishing a new streaming lower bound for a novel problem “One-vs-Many-Expanders”, which can be of independent interest.

Organizer

About COLT

The conference is held annually since 1988 and has become the leading conference on Learning theory by maintaining a highly selective process for submissions. It is committed in high-quality articles in all theoretical aspects of machine learning and related topics.

Like the format? Trust SlidesLive to capture your next event!

Professional recording and live streaming, delivered globally.

Sharing

Recommended Videos

Presentations on similar topic, category or speaker

Interested in talks like this? Follow COLT