Algorithms and Data Science Reading Group |
Welcome to the Algorithms and Data Science Reading Group at Texas A&M University.
We consider the optimization problem of matroid constrained maximization of a monotone submodular set function f:2^N --> R^+ (SMMC) in a distributed setting. We introduce the algorithm Low-Adaptive Greedy (LAGreedy), which is the first low-adaptivity algorithm with an almost linear number of queries to f for SMMC. We prove that \textsc{LAGreedy} achieves an approximation ratio arbitrarily close to 1/2 in O(\log n \log k) adaptive rounds and O(n \log k) queries to f. In addition, we propose the algorithm Accelerated Continuous Threshold (AccThres) to construct a fractional solution for the multilinear extension $F$ of the function $f$, guaranteeing a near-optimal 1- 1/e - O(\epsilon) approximation in O(\log n \log k) adaptive rounds and O(n \log k) queries to f. We provide an extensive experimental evaluation on real centralized and distributed instances of SMMC in order to demonstrate the effectiveness, efficiency, and parallelizability of our proposed algorithms.
We present streaming algorithms for Graph k-Matching in the insert-only and dynamic models. Our algorithms, with space complexity matching the best upper bounds, have optimal or near-optimal update time, significantly improving previous results. More specifically, for the insert-only streaming model, we present a one-pass algorithm with optimal space complexity O(k^2) and optimal update time O(1), that computes a maximum weighted k-matching of a weighted graph. For the dynamic streaming model, we present a one-pass algorithm that computes a maximum weighted k-matching in O(Wk^2 ·polylog(n)) space and O(polylog(n)) update time, where W is the number of distinct edge weights. The update time of our algorithm improves the previous upper bound of O(k2 polylog(n)).
We give the first algorithm that achieves near-optimal (1+eps)-approximation to (k,z)-clustering in the sliding window model, thus significantly improving on works by Braverman et. al. (SODA 2016), Borassi et. al. (NeurIPS 2021), and Epasto et. al. (SODA 2022). Along the way, we develop a data structure for clustering called an online coreset, which outputs a coreset not only for the end of a stream, but also for all prefixes of the stream. We show a separation from the problem of constructing an offline coreset, i.e., constructing online coresets is strictly harder.
For constrained, not necessarily monotone submodular maximization, guiding the measured continuous greedy algorithm with a local search algorithm currently obtains the state-of-the-art approximation factor of 0.401 (Buchbinder and Feldman, 2023). These algorithms rely upon the multilinear extension and the Lovasz extension of a submodular set function. However, the state-of-the-art approximation factor of combinatorial algorithms has remained 1/e ≈ 0.367 (Buchbinder et al., 2014). In this work, we develop combinatorial analogues of the guided measured continuous greedy algorithm and obtain approximation ratio of 0.385 in O(kn) queries to the submodular set function for size constraint, and 0.305 for a general matroid constraint. Further, we derandomize these algorithms, maintaining the same ratio and asymptotic time complexity. Finally, we develop a deterministic, nearly linear time algorithm with ratio 0.377.
In this talk, we consider the combinatorial optimization problem Submodular Cover (SCP), which is to find a minimum cardinality subset of a finite universe U such that the value of a submodular function f is above an input threshold \tau. SCP is NP-hard, and therefore the focus is on developing approximation algorithms to efficiently find an approximate solution. SCP can be viewed as the dual of the relatively more widely considered submodular maximization with a cardinality constraint problem. I will cover a survey of some existing approximation algorithms for SCP, and then describe some of my own papers on SCP including SCP where queries to f are inexact, where f is not assumed to be monotone, and a number of other variants of SCP.
In this talk I'll present a proof that a simple greedy algorithm for finding a maximal independent set in a graph is also a randomized 2-approximation algorithm for weighted vertex cover. The unweighted version of the algorithm has existed for decades as a maximal independent set algorithm, but was not previously known to approximate vertex cover. This result leads to several new insights and simplified algorithms for different graph problems as corollaries. This includes a simple O(log n)-round parallel algorithm for vertex cover, simplified approximation algorithms for certain edge deletion problems, connections to a problem called correlation clustering, and insights for list heuristic algorithms for vertex cover. This will be a more detailed version of a talk recently presented at the Symposium on Simplicity in Algorithms (SOSA '24) (https://epubs.siam.org/doi/pdf/10.1137/1.9781611977936.32)
We introduce an alternative model for the design and analysis of strategyproof mechanisms that is motivated by the recent surge of work in "learning-augmented algorithms".
Aiming to complement the traditional approach in computer science, which analyzes the performance of algorithms based on worst-case instances, this line of work has focused
on the design and analysis of algorithms that are enhanced with machine-learned predictions regarding the optimal solution. The algorithms can use the predictions as a guide
to inform their decisions, and the goal is to achieve much stronger performance guarantees when these predictions are accurate (consistency), while also maintaining
near-optimal worst-case guarantees, even if these predictions are very inaccurate (robustness). So far, these results have been limited to algorithms.
We initiate the design and analysis of strategyproof mechanisms that are augmented with predictions regarding the private information of the participating agents.
To exhibit the important benefits of this approach, we revisit the canonical problem of strategic facility location. We propose new strategyproof mechanisms that leverage
predictions to guarantee an optimal trade-off between consistency and robustness guarantees. Furthermore, we also prove parameterized approximation results as a function of
the prediction error, showing that our mechanisms perform well even when the predictions are not fully accurate.
Joint work with Priyank Agrawal, Vasilis Gkatzelis, Tingting Ou, and Xizhi Tan
Given a collection of m sets from a universe U, the Maximum Set Coverage problem consists of finding k sets whose union has largest cardinality.
This problem is NP-Hard, but the solution can be approximated by a polynomial time algorithm up to a factor 1-1/e.
However, this algorithm does not scale well with the input size.
In a streaming context, practical high-quality solutions are found, but with space complexity that scales linearly with respect to the size of the universe n = |U|.
However, one randomized streaming algorithm has been shown to produce a 1-1/e-ε approximation of the optimal solution with a space complexity that scales
only poly-logarithmically with respect to m and n. In order to achieve such a low space complexity, the authors used two techniques in their multi-pass approach:
F0-sketching, allows to determine with great accuracy the number of distinct elements in a set using less space than the set itself.
Subsampling, consists of only solving the problem on a subspace of the universe. It is implemented using γ-independent hash functions.
This article focuses on the sublinear-space algorithm and highlights the time cost of these two techniques, especially subsampling. We present optimizations that
significantly reduce the time complexity of the algorithm. Firstly, we give some optimizations that do not alter the space complexity, number of passes and approximation
quality of the original algorithm. In particular, we reanalyze the error bounds to show that the original independence factor of Ω(ε^{-2} k log m) can be fine-tuned to Ω(k log m);
we also show how F0-sketching can be removed. Secondly, we derive a new lower bound for the probability of producing a 1-1/e-ε approximation using only pairwise independence:
1- (4/(c k log m)) compared to 1-(2e/(m^{ck/6})) with Ω(k log m)-independence. Although the theoretical guarantees are weaker, suggesting the approximation quality would suffer,
for large streams, our algorithms perform well in practice. Finally, our experimental results show that even a pairwise-independent hash-function sampler does not produce
worse solution than the original algorithm, while running significantly faster by several orders of magnitude.
In graph clustering and hypergraph clustering, the goal is to partition a set of nodes into well-connected clusters in such a way that few edges (or hyperedges) are "cut", i.e., separated between clusters. Although graph models are more common, hypergraphs can directly encode multiway relationships like group social interactions, multiway biological interactions, or chemical reactions involving multiple substances. As a result, hypergraph clustering is more natural for many applications. However, this generalized problem is more nuanced since there are many different ways to partition a single hyperedge among clusters. Different ways of penalizing cut hyperedges may be more or less useful depending on the application, and this also leads to significant differences in computational complexity results when solving hypergraph clustering problems. This talk will introduce a generalized notion of the minimum hypergraph cut problem, along with efficient algorithms for certain variants and NP-hardness results for others. I'll then show how new algorithmic techniques for hypergraph cut problems can be incorporated into more sophisticated pipelines for hypergraph clustering. This leads to improved results in several different types of downstream data analysis tasks, such as finding related retail products in large e-commerce datasets, detecting groups of students from the same classroom based on group interaction data, and identifying related posts in large online Q&A forums.
Although k-means clustering is a well-studied problem, there exist strong theoretical limits on worst-case inputs. To overcome this barrier, we consider a scenario where “advice”
is provided to help perform clustering. We consider the k-means problem augmented with a predictor that, given any point, returns its cluster label in an approximately optimal
clustering up to some error. We present an algorithm whose performance improves along with the accuracy of the predictor, even though naively following the accurate predictor can
still lead to a high clustering cost. Thus if the predictor is sufficiently accurate, we can retrieve a close to optimal clustering with nearly optimal runtime, breaking known
computational barriers for algorithms that do not have access to such advice.
Based on joint work with Jon Ergun, Zhili Feng, Sandeep Silwal, and David P. Woodruff