|
Algorithms and Data Science Reading Group |
||
Welcome to the Algorithms and Data Science Reading Group at Texas A&M University.
Clustering is a fundamental problem in unsupervised machine learning, aiming to group similar data points together without predefined labels. This talk explores clustering and its various algorithmic approaches. Brief outline: I'll begin with an overview of what clustering is and its applications, followed by a quick recap of k-means clustering and its potential issues and shortcomings. I'll then go over spectral clustering, a technique that uses graph theory and eigenvalue decomposition to identify clusters, and how it solves potential issues with non-convex cluster shapes. I'll also briefly cover some recent developments and algorithms in clustering within the last year, including deep clustering, clustering with transformers, and TeraHAC.
LSH for the Hamming distance, and introduce the LSH for the Manhattan (L1) distance and the Euclidean (L2) distance.
Nearest Neighbor Search (NNS) is a fundamental problem in computational geometry and machine learning, aiming to find the closest point to a query in a high-dimensional space. This talk explores NNS and its efficient approximation through Locality Sensitive Hashing (LSH). LSH is a powerful technique that applies hash functions to group similar items together, enabling sublinear-time approximate nearest neighbor queries. I'll go over a simple formulation for LSH under the Hamming distance, and an efficient algorithm for NNS by LSH. I'll also cover a sub-optimal formulation for the Euclidean distance if time allows. The contents mainly follow from the lecture notes from https://www.cs.cmu.edu/~15451-s19/lectures/lec22-nearest-neighbor.pdf.
We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts used in Large Language Models (LLMs). Recent work suggests that in the worst-case scenario, quadratic time is necessary unless the entries of the attention matrix are bounded or the matrix has low stable rank. We introduce two parameters which measure: (1) the max column norm in the normalized attention matrix, and (2) the ratio of row norms in the unnormalized attention matrix after detecting and removing large entries. We use these fine-grained parameters to capture the hardness of the problem. Despite previous lower bounds, we are able to achieve a linear time sampling algorithm even when the matrix has unbounded entries or a large stable rank, provided the above parameters are small. HyperAttention features a modular design that easily accommodates integration of other fast low-level implementations, particularly FlashAttention. Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods, giving significant speed improvements compared to state-of-the-art solutions like FlashAttention. We validate the empirical performance of HyperAttention on a variety of different long-context length datasets. For example, HyperAttention makes the inference time of ChatGLM2 50\% faster on 32k context length while perplexity increases from 5.6 to 6.3. On larger context length, e.g., 131k, with causal masking, HyperAttention offers 5-fold speedup on a single attention layer.
We consider a generalization of the densest subhypergraph problem where nonnegative rewards are given for including partial hyperedges in a dense subhypergraph. Prior work addressed this problem only in cases where reward functions are convex, in which case the problem is poly-time solvable. We consider a broader setting where rewards are monotonic but otherwise arbitrary. We first prove hardness results for a wide class of non-convex rewards, then design a 1/k-approximation by projecting to the nearest set of convex rewards, where k is the maximum hyperedge size. We also design another 1/k-approximation using a faster peeling algorithm, which (somewhat surprisingly) differs from the standard greedy peeling algorithm used to approximate other variants of the densest subgraph problem. Our results include an empirical analysis of our algorithm across several real-world hypergraphs.
Regularity lemmas are a cornerstone of combinatorics, with many applications in math and CS. The most famous one is Szemeredi's regularity lemma, which states that any graph can be partitioned into a "few" parts that mostly "look random". However, there is a caveat - "few" is really a huge number, a tower of exponentials in the error parameter. A weaker notion of regularity, defined in Frieze and Kannan, gives an improved bound, where the number of parts is "merely" exponential in the error parameter. Still, for many applications one would hope for an even better bound - polynomial or quasi-polynomial in the error parameter. In this talk, I will describe such a notion called "spread regularity". It is based on a combinatorial analog of a recent breakthrough of Kelley and Meka in additive combinatorics. I will also describe two CS applications of it, one in communication complexity and the other in combinatorial algorithms for matrix multiplication. Based on joint works with Amir Abboud, Nick Fischer, Zander Kelley and Raghu Meka
Streaming algorithms have become popular in recent research in massive data processing. This work studies streaming algorithms for kernelization of the famous NP-hard problem (weighted) Vertex Cover, which not only consider space complexity and update-time of streaming algorithms, but also quantitatively measure the quality of data compression. We first develop a space lower bound Ω(n) for streaming algorithms, which shows that a well-known technique in kernelization algorithms for Vertex Cover would not be suitable for space-limited streaming models. We also prove a lower bound Ω(k^2) on both space complexity and kernel size for streaming kernelization algorithms for Vertex Cover. We then develop a new streaming kernelization algorithm for Vertex Cover that runs in O(k^2) space and O(1) update-time and produces a kernel of size O(k^2). According to the lower bounds we proved, this algorithm is optimal in all these three complexity measures. Our algorithm also improves previous known algorithms for the problem. Joint work with H. Feng, W. Yang, and S. J. John Stella
Submodularity is a property of set functions that arises in many applications such as data summarization, recommendation systems, and viral marketing in social networks. Most existing algorithms for the maximization of a submodular objective assume value oracle access. That is, the submodular function is a black box that can be queried for any set, and its value returned. However, this is not always a realistic assumption, instead we may only be able to make noisy queries to some underlying submodular objective. In this talk, I will: (i) Discuss a few applications of the noisy submodular maximization setting, (ii) Talk about different models of noise, and finally (iii) Present and analyze algorithms that find approximate solutions in as few noisy queries as possible.
Correlation clustering is a classical problem in the intersection of machine learning and theoretical computer science, and it has broad applications in various areas, including document categorization, community detection, and financial network analysis. The problem is known to be NP-hard, and we can obtain a 1.437 approximation using linear programming-based techniques. For combinatorial algorithms, it is possible to achieve a 3-approximation with the simple pivot-based algorithm, and a very recent work achieves a 1.846-approximation via local search. In this talk, I will discuss the correlation clustering problem via the lens of the sparse-dense decomposition technique. In particular, I will discuss sublinear algorithms that achieve $O(1)$ approximation in $O(n log^2 {n})$ time in the adjacency list model and $O(n log{n})$ space in the single-pass streaming model. Furthermore, I will discuss extensions of our algorithm to the truly streaming setting, where we aim to estimate the cost of correlation clustering with polylog(n) space, and to the adversarial dynamic setting, where we aim to maintain correlation clustering solution with polylog (n) update time against an adaptive adversary. In addition to the asymptotically optimal approximation guarantees, these algorithms are simple to implement, converge with high efficiency, and offer very competitive empirical performances. The talk is based on two published papers (AW [ITCS’22]; ASW [NeurIPS’23]) and an ongoing work.
In this work, we give the first known adaptive attack against linear sketches for the well-studied L_0-estimation problem over turnstile, integer streams. For any linear streaming algorithm A which uses sketching matrix A in R^{r times n}, this attack makes ~O(r^8) queries and succeeds with high constant probability in breaking the sketch. Additionally, we give an adaptive attack against linear sketches for the L_0-estimation problem over finite fields F_p, which requires a smaller number of ~O(r^3) queries. Our results provide an exponential improvement over the previous number of queries known to break an L_0-estimation sketch. Joint work with Elena Gribelyuk, Honghao Lin, David P. Woodruff, and Huacheng Yu
We consider the optimization problem of matroid constrained maximization of a monotone submodular set function f:2^N --> R^+ (SMMC) in a distributed setting. We introduce the algorithm Low-Adaptive Greedy (LAGreedy), which is the first low-adaptivity algorithm with an almost linear number of queries to f for SMMC. We prove that \textsc{LAGreedy} achieves an approximation ratio arbitrarily close to 1/2 in O(\log n \log k) adaptive rounds and O(n \log k) queries to f. In addition, we propose the algorithm Accelerated Continuous Threshold (AccThres) to construct a fractional solution for the multilinear extension $F$ of the function $f$, guaranteeing a near-optimal 1- 1/e - O(\epsilon) approximation in O(\log n \log k) adaptive rounds and O(n \log k) queries to f. We provide an extensive experimental evaluation on real centralized and distributed instances of SMMC in order to demonstrate the effectiveness, efficiency, and parallelizability of our proposed algorithms.
We present streaming algorithms for Graph k-Matching in the insert-only and dynamic models. Our algorithms, with space complexity matching the best upper bounds, have optimal or near-optimal update time, significantly improving previous results. More specifically, for the insert-only streaming model, we present a one-pass algorithm with optimal space complexity O(k^2) and optimal update time O(1), that computes a maximum weighted k-matching of a weighted graph. For the dynamic streaming model, we present a one-pass algorithm that computes a maximum weighted k-matching in O(Wk^2 ·polylog(n)) space and O(polylog(n)) update time, where W is the number of distinct edge weights. The update time of our algorithm improves the previous upper bound of O(k2 polylog(n)).
We give the first algorithm that achieves near-optimal (1+eps)-approximation to (k,z)-clustering in the sliding window model, thus significantly improving on works by Braverman et. al. (SODA 2016), Borassi et. al. (NeurIPS 2021), and Epasto et. al. (SODA 2022). Along the way, we develop a data structure for clustering called an online coreset, which outputs a coreset not only for the end of a stream, but also for all prefixes of the stream. We show a separation from the problem of constructing an offline coreset, i.e., constructing online coresets is strictly harder.
For constrained, not necessarily monotone submodular maximization, guiding the measured continuous greedy algorithm with a local search algorithm currently obtains the state-of-the-art approximation factor of 0.401 (Buchbinder and Feldman, 2023). These algorithms rely upon the multilinear extension and the Lovasz extension of a submodular set function. However, the state-of-the-art approximation factor of combinatorial algorithms has remained 1/e ≈ 0.367 (Buchbinder et al., 2014). In this work, we develop combinatorial analogues of the guided measured continuous greedy algorithm and obtain approximation ratio of 0.385 in O(kn) queries to the submodular set function for size constraint, and 0.305 for a general matroid constraint. Further, we derandomize these algorithms, maintaining the same ratio and asymptotic time complexity. Finally, we develop a deterministic, nearly linear time algorithm with ratio 0.377.
In this talk, we consider the combinatorial optimization problem Submodular Cover (SCP), which is to find a minimum cardinality subset of a finite universe U such that the value of a submodular function f is above an input threshold \tau. SCP is NP-hard, and therefore the focus is on developing approximation algorithms to efficiently find an approximate solution. SCP can be viewed as the dual of the relatively more widely considered submodular maximization with a cardinality constraint problem. I will cover a survey of some existing approximation algorithms for SCP, and then describe some of my own papers on SCP including SCP where queries to f are inexact, where f is not assumed to be monotone, and a number of other variants of SCP.
In this talk I'll present a proof that a simple greedy algorithm for finding a maximal independent set in a graph is also a randomized 2-approximation algorithm for weighted vertex cover. The unweighted version of the algorithm has existed for decades as a maximal independent set algorithm, but was not previously known to approximate vertex cover. This result leads to several new insights and simplified algorithms for different graph problems as corollaries. This includes a simple O(log n)-round parallel algorithm for vertex cover, simplified approximation algorithms for certain edge deletion problems, connections to a problem called correlation clustering, and insights for list heuristic algorithms for vertex cover. This will be a more detailed version of a talk recently presented at the Symposium on Simplicity in Algorithms (SOSA '24) (https://epubs.siam.org/doi/pdf/10.1137/1.9781611977936.32)
We introduce an alternative model for the design and analysis of strategyproof mechanisms that is motivated by the recent surge of work in "learning-augmented algorithms".
Aiming to complement the traditional approach in computer science, which analyzes the performance of algorithms based on worst-case instances, this line of work has focused
on the design and analysis of algorithms that are enhanced with machine-learned predictions regarding the optimal solution. The algorithms can use the predictions as a guide
to inform their decisions, and the goal is to achieve much stronger performance guarantees when these predictions are accurate (consistency), while also maintaining
near-optimal worst-case guarantees, even if these predictions are very inaccurate (robustness). So far, these results have been limited to algorithms.
We initiate the design and analysis of strategyproof mechanisms that are augmented with predictions regarding the private information of the participating agents.
To exhibit the important benefits of this approach, we revisit the canonical problem of strategic facility location. We propose new strategyproof mechanisms that leverage
predictions to guarantee an optimal trade-off between consistency and robustness guarantees. Furthermore, we also prove parameterized approximation results as a function of
the prediction error, showing that our mechanisms perform well even when the predictions are not fully accurate.
Joint work with Priyank Agrawal, Vasilis Gkatzelis, Tingting Ou, and Xizhi Tan
Given a collection of m sets from a universe U, the Maximum Set Coverage problem consists of finding k sets whose union has largest cardinality.
This problem is NP-Hard, but the solution can be approximated by a polynomial time algorithm up to a factor 1-1/e.
However, this algorithm does not scale well with the input size.
In a streaming context, practical high-quality solutions are found, but with space complexity that scales linearly with respect to the size of the universe n = |U|.
However, one randomized streaming algorithm has been shown to produce a 1-1/e-ε approximation of the optimal solution with a space complexity that scales
only poly-logarithmically with respect to m and n. In order to achieve such a low space complexity, the authors used two techniques in their multi-pass approach:
F0-sketching, allows to determine with great accuracy the number of distinct elements in a set using less space than the set itself.
Subsampling, consists of only solving the problem on a subspace of the universe. It is implemented using γ-independent hash functions.
This article focuses on the sublinear-space algorithm and highlights the time cost of these two techniques, especially subsampling. We present optimizations that
significantly reduce the time complexity of the algorithm. Firstly, we give some optimizations that do not alter the space complexity, number of passes and approximation
quality of the original algorithm. In particular, we reanalyze the error bounds to show that the original independence factor of Ω(ε^{-2} k log m) can be fine-tuned to Ω(k log m);
we also show how F0-sketching can be removed. Secondly, we derive a new lower bound for the probability of producing a 1-1/e-ε approximation using only pairwise independence:
1- (4/(c k log m)) compared to 1-(2e/(m^{ck/6})) with Ω(k log m)-independence. Although the theoretical guarantees are weaker, suggesting the approximation quality would suffer,
for large streams, our algorithms perform well in practice. Finally, our experimental results show that even a pairwise-independent hash-function sampler does not produce
worse solution than the original algorithm, while running significantly faster by several orders of magnitude.
Graph clustering is a fundamental task in network analysis where the goal is to detect sets of nodes that are well-connected to each other but sparsely connected to the rest of the graph. We present faster approximation algorithms for an NP-hard parameterized clustering framework called LambdaCC, which is governed by a tunable resolution parameter and generalizes many other clustering objectives such as modularity, sparsest cut, and cluster deletion. Previous LambdaCC algorithms are either heuristics with no approximation guarantees, or computationally expensive approximation algorithms. We provide fast new approximation algorithms that can be made purely combinatorial. These rely on a new parameterized edge labeling problem we introduce that generalizes previous edge labeling problems that are based on the principle of strong triadic closure and are of independent interest in social network analysis. Our methods are orders of magnitude more scalable than previous approximation algorithms and our lower bounds allow us to obtain a posteriori approximation guarantees for previous heuristics that have no approximation guarantees of their own.
In graph clustering and hypergraph clustering, the goal is to partition a set of nodes into well-connected clusters in such a way that few edges (or hyperedges) are "cut", i.e., separated between clusters. Although graph models are more common, hypergraphs can directly encode multiway relationships like group social interactions, multiway biological interactions, or chemical reactions involving multiple substances. As a result, hypergraph clustering is more natural for many applications. However, this generalized problem is more nuanced since there are many different ways to partition a single hyperedge among clusters. Different ways of penalizing cut hyperedges may be more or less useful depending on the application, and this also leads to significant differences in computational complexity results when solving hypergraph clustering problems. This talk will introduce a generalized notion of the minimum hypergraph cut problem, along with efficient algorithms for certain variants and NP-hardness results for others. I'll then show how new algorithmic techniques for hypergraph cut problems can be incorporated into more sophisticated pipelines for hypergraph clustering. This leads to improved results in several different types of downstream data analysis tasks, such as finding related retail products in large e-commerce datasets, detecting groups of students from the same classroom based on group interaction data, and identifying related posts in large online Q&A forums.
Although k-means clustering is a well-studied problem, there exist strong theoretical limits on worst-case inputs. To overcome this barrier, we consider a scenario where “advice”
is provided to help perform clustering. We consider the k-means problem augmented with a predictor that, given any point, returns its cluster label in an approximately optimal
clustering up to some error. We present an algorithm whose performance improves along with the accuracy of the predictor, even though naively following the accurate predictor can
still lead to a high clustering cost. Thus if the predictor is sufficiently accurate, we can retrieve a close to optimal clustering with nearly optimal runtime, breaking known
computational barriers for algorithms that do not have access to such advice.
Based on joint work with Jon Ergun, Zhili Feng, Sandeep Silwal, and David P. Woodruff