Samson Zhou

		Samson Zhou

Hi, I'm an Assistant Professor in the Department of Computer Science & Engineering at Texas A&M University. My research lies at the intersections of theoretical computer science, data science, and machine learning. In particular, I am currently interested in numerical linear algebra, streaming algorithms, and differential privacy. Here are some of my favorite links.

Profiles: [dblp] [Scholar]
Feel free to reach me at:
samsonzhou AT gmail DOT com

Academic Positions and Education

Texas A&M University, Assistant Professor (August 2023-Present)
Simons Institute for the Theory of Computing, Sublinear Algorithms, Visting Scientist (Summer 2024)
UC Berkeley and Rice University, Postdoctoral Researcher, hosted by Jelani Nelson and Vladimir Braverman (September 2022-July 2023)
Carnegie Mellon University, Postdoctoral Researcher, hosted by David P. Woodruff (September 2019-August 2022)
Indiana University, Postdoctoral Researcher, hosted by Grigory Yaroslavtsev (August 2018-2019)
Purdue University, Postdoctoral Researcher, hosted by Jeremiah Blocki (Summer 2018)

Purdue University, Doctor of Philosophy, Computer Science, advised by Greg Frederickson and Elena Grigorescu (2018)
Massachusetts Institute of Technology, Master of Engineering, Computer Science (2011)
Massachusetts Institute of Technology, Bachelor of Science, Computer Science (2011)
Massachusetts Institute of Technology, Bachelor of Science, Mathematics (2010)

Teaching

Fall 2025: CSCE 411: Design and Analysis of Algorithms
Spring 2025: CSCE 411: Design and Analysis of Algorithms
Fall 2024: CSCE 411: Design and Analysis of Algorithms
Spring 2024: CSCE 658: Randomized Algorithms
Fall 2023: CSCE 689: Special Topics in Modern Algorithms for Data Science
Spring 2018: CS 584: Theory of Computation/Complexity Theory (Teaching Assistant)
Spring 2016: CS 381: Introduction to the Analysis of Algorithms (Teaching Assistant)
Fall 2015: CS 580: Algorithm Design and Analysis (Teaching Assistant, ACM Graduate TA Award)

Research Group

Postdoc: Chen Wang, co-hosted with Vladimir Braverman
Masters: Shenghao Xie

Service

I have the privilege of serving on the program committees of SOSA 2026. Please consider sending your papers!
April 14-16, 2025: Co-organizer of the Workshop on Algorithms for Large Data (Online), i.e., WALDO 2025
Fall 2023-Present: Problem-Solving Session Instructor at the TAMU Math Circle
Fall 2023-Present: Co-organizer of the Algorithms and Data Science Reading Group at Texas A&M University
August 19-21, 2024: Co-organizer of the TTIC Workshop: Learning-Augmented Algorithms
November 6-9, 2023: Co-organizer of the FOCS 2023 Workshop: Exploring the Frontiers of Adaptive Robustness
August 23-25, 2021: Co-organizer of the Workshop on Algorithms for Large Data (Online)
Fall 2018: Organizer of Convex Optimization Reading Group at Indiana University
Fall 2016-Fall 2017: Organizer of Theoretical CS Reading Group at Purdue University

Program committee/area chair: SOSA 2026, NeurIPS 2025, APPROX 2025, RANDOM 2025, ICML 2025, NeurIPS 2024, ESA 2024, ISAAC 2024, COLT 2024, COLT 2023, COLT 2022, RANDOM 2022
Reviewer:
- 2025: AISTATS, ESA, FOCS, ICALP, ICLR, IPCO, ISIT, ITCS, PODS, SODA, STOC
- 2024: AISTATS, COLT, ESA, FOCS, ICALP, ICDT, ICLR, ICML, INFOCOM, ISAAC, ISIT, ITCS, NeurIPS, PODS, RANDOM, SoCG, SODA, STACS, STOC
- 2023: COLT, ESA, FOCS, ICALP, ICML, ITCS, NeurIPS, PODS, SIGMETRICS, SODA, STACS, STOC
- 2022: AAAI, AISTATS, COLT, ICALP, ICLR, ICML, ITCS, NeurIPS, PODS, RANDOM, SODA, SPAA, STOC
- 2021: AAAI, CPM, ESA, ICDT, ISAAC, NeurIPS, PODS, SODA, STOC
- 2020: ESA, FOCS, IPCO, ISIT, ITCS, NeurIPS, PODS, RANDOM, SODA, STOC
- 2019: AISTATS, CSR, ESA, FC, FOCS, ICANN, ICML, ISAAC, ISIT, ITCS, NeurIPS, RANDOM, SODA, SPAA, STOC
- 2018: APPROX, ESA, ICALP, ISIT, LATIN, SODA, STOC
- 2017: CSR, FOCS, SPAA
- 2016: RANDOM
- Journals: Algorithmica, ACM Transactions on Algorithms, IEEE Transactions on Information Theory, SIAM Journal on Computing, SIAM Journal on Matrix Analysis and Applications, Information Processing Letters, Journal of Computer and System Sciences

Publications

On Fine-Grained Distinct Element Estimation
Ilias Diakonikolas, Daniel M. Kane, Jasper C.H. Lee, Thanasis Pittas, David P. Woodruff, Samson Zhou
ICML 2025 [abstract]

We study the problem of distributed distinct element estimation, where $\alpha$ servers each receive a subset of a universe $[n]$ and aim to compute a $(1+\varepsilon)$-approximation to the number of distinct elements using minimal communication. While prior work establishes a worst-case bound of $\Theta\left(\alpha\log n+\frac{\alpha}{\varepsilon^2}\right)$ bits, these results rely on assumptions that may not hold in practice. We introduce a new parameterization based on the number $C = \frac{\beta}{\varepsilon^2}$ of pairwise collisions, i.e., instances where the same element appears on multiple servers, and design a protocol that uses only $O\left(\alpha\log n\log\log n+\frac{\sqrt{\beta}}{\varepsilon^2} \log n\right)$ bits, breaking previous lower bounds when $C$ is small. We further improve our algorithm under assumptions on the number of distinct elements or collisions and provide matching lower bounds in all regimes, establishing $C$ as a tight complexity measure for the problem. Finally, we consider streaming algorithms for distinct element estimation parameterized by the number of items with frequency larger than $1$. Overall, our results offer insight into why statistical problems with known hardness results can be efficiently solved in practice.

Relative Error Fair Clustering in the Weak-Strong Oracle Model
Vladimir Braverman, Prathamesh Dharangutte, Shaofeng H.-C. Jiang, Hoai-An Nguyen, Chen Wang, Yubo Zhang, Samson Zhou
ICML 2025 [abstract]

We study fair clustering problems in a setting where distance information is obtained from two sources: a strong oracle providing exact distances, but at a high cost, and a weak oracle providing potentially inaccurate distance estimates at a low cost. The goal is to produce a near-optimal \emph{fair} clustering on $n$ input points with a minimum number of strong oracle queries. This models the increasingly common trade-off between accurate but expensive similarity measures (e.g., large-scale embeddings) and cheaper but inaccurate alternatives. The study of fair clustering in the model is motivated by the important quest of achieving fairness with the presence of inaccurate information. We achieve the first $(1+\varepsilon)$-coresets for fair $k$-median clustering using $\text{poly}\left(\frac{k}{\varepsilon}\cdot\log n\right)$ queries to the strong oracle. Furthermore, our results imply coresets for the standard setting (without fairness constraints), and we could in fact obtain $(1+\varepsilon)$-coresets for $(k,z)$-clustering for general $z=O(1)$ with a similar number of strong oracle queries. By contrast, previous results achieved a constant-factor $(>10)$ approximation for the standard $k$-clustering problems, and no previous work considered the fair $k$-median clustering problem.

Learning-Augmented Hierarchical Clustering
Vladimir Braverman, Jon C. Ergun, Chen Wang, Samson Zhou
ICML 2025 [abstract] [pdf]

Hierarchical clustering (HC) is an important data analysis technique in which the goal is to recursively partition a dataset into a tree-like structure while grouping together similar data points at each level of granularity. Unfortunately, for many of the proposed HC objectives, there exist strong barriers to approximation algorithms with the hardness of approximation. We consider the problem of hierarchical clustering given auxiliary information from natural oracles in the learning-augmented framework. Our main results are algorithms that given learning-augmented oracles, compute efficient approximate HC trees for the celebrated Dasgupta's and Moseley-Wang objectives that overcome known hardness barriers.

Perfect Sampling in Turnstile Streams Beyond Small Moments
David P. Woodruff, Shenghao Xie, Samson Zhou
PODS 2025 [abstract] [pdf]

Given a vector $x \in \mathbb{R}^n$ induced by a turnstile stream $S$, a function $G: \mathbb{R}_+ \to \mathbb{R}_+$, a perfect $G$-sampler outputs an index $i$ with probability $\frac{G(x_i)}{\sum_{j\in[n]} G(x_j)}+\frac{1}{\text{poly}(n)}$. Jayaram and Woodruff (FOCS 2018) introduced a perfect $L_p$-sampler, where $G(z)=z^p$, for $p\in(0,2]$. In this paper, we solve this problem for $p>2$ by a sampling-and-rejection method. Our algorithm runs in $n^{1-2/p} \cdot \text{polylog}(n)$ bits of space, which is tight up to polylogarithmic factors in $n$. Our algorithm also provides a $(1+\varepsilon)$-approximation to the sampled item $x_i$ with high probability using an additional $\varepsilon^{-2} n^{1-2/p} \cdot \text{polylog}(n)$ bits of space.

Interestingly, we show our techniques can be generalized to perfect polynomial samplers on turnstile streams, which is a class of functions that is not scale-invariant, in contrast to the existing perfect $L_p$ samplers. We also achieve perfect samplers for the logarithmic function $G(z)=\log(1+z)$ and the cap function $G(z)=\min(T,z^p)$. Finally, we give an application of our results to the problem of norm/moment estimation for a subset $\mathcal{Q}$ of coordinates of a vector, revealed only after the data stream is processed, e.g., when the set $\mathcal{Q}$ represents a range query, or the set $n\setminus\mathcal{Q}$ represents a collection of entities who wish for their information to be expunged from the dataset.

On Approximability of ℓ₂² Min-Sum Clustering
Karthik C.S., Euiwoong Lee, Yuval Rabani, Chris Schwiegelshohn, Samson Zhou
SoCG 2025 [abstract] [pdf]

The $\ell_2^2$ min-sum $k$-clustering problem is to partition an input set into clusters $C_1,\ldots,C_k$ to minimize $\sum_{i=1}^k\sum_{p,q\in C_i}\|p-q\|_2^2$. The objective is a density-based clustering and can be more effective than the traditional centroid-based clustering like $k$-median and $k$-means in capturing complex structures in data that may not be linearly separable, such as when the clusters have irregular, non-convex shapes or are overlapping. Although $\ell_2^2$ min-sum $k$-clustering is NP-hard, it is not known whether it is NP-hard to approximate $\ell_2^2$ min-sum $k$-clustering beyond a certain factor.

In this paper, we give the first hardness-of-approximation result for the $\ell_2^2$ min-sum $k$-clustering problem. We show that it is NP-hard to approximate the objective to a factor better than $1.056$ and moreover, assuming a balanced variant of the Johnson Coverage Hypothesis, it is NP-hard to approximate the objective to a factor better than $1.327$.

We then complement our hardness result by giving the first $(1+\varepsilon)$-coreset construction for $\ell_2^2$ min-sum $k$-clustering. Our coreset uses $\mathcal{O}\left(k^{\varepsilon^{-4}}\right)$ space and can be leveraged to achieve a polynomial-time approximation scheme with runtime $nd\cdot f(k,\varepsilon^{-1})$, where $d$ is the underlying dimension of the input dataset and $f$ is a fixed function. Our algorithm is faster than the previous PTAS and uses runtime linear in the size $n$ of the input.

Finally, we consider a learning-augmented setting, where the algorithm has access to an oracle that outputs a label $i\in[k]$ for input point, thereby implicitly partitioning the input dataset into $k$ clusters that induce an approximately optimal solution, up to some amount of adversarial error $\alpha\in\left[0,\frac{1}{2}\right)$. We give a polynomial-time algorithm that outputs a $\frac{1+\gamma\alpha}{(1-\alpha)^2}$-approximation to $\ell_2^2$ min-sum $k$-clustering, for a fixed constant $\gamma>0$. Therefore, our algorithm improves smoothly with the performance of the oracle and can be used to achieve approximation guarantees better than the NP-hard barriers for sufficiently accurate oracles.

Lifting Linear Sketches: Optimal Bounds and Adversarial Robustness
Elena Gribelyuk, Honghao Lin, David P. Woodruff, Huacheng Yu, Samson Zhou
STOC 2025 [abstract] [pdf]

We introduce a novel technique for ``lifting'' dimension lower bounds for linear sketches in the continuous setting to dimension lower bounds for linear sketches with polynomially-bounded integer entries when the input is a polynomially-bounded integer vector. Using this technique, we obtain the first optimal sketching lower bounds for discrete inputs in a data stream, for classical problems such as approximating the frequency moments, estimating the operator norm, and compressed sensing. Additionally, we lift the adaptive attack of Hardt and Woodruff (STOC, 2013) for breaking any real-valued linear sketch via a sequence of real-valued queries, and show how to obtain an attack on any integer-valued linear sketch using integer-valued queries. This shows that there is no linear sketch in a data stream with insertions and deletions that is adversarially robust for approximating any $L_p$ norm of the input. This resolves a central open question for adversarially robust streaming algorithms. To do so, we introduce a new pre-processing technique of independent interest which, given an integer-valued linear sketch, increases the dimension of the sketch by only a constant factor in order to make the orthogonal lattice to its row span smooth, enabling us to leverage results in lattice theory on discrete Gaussian distributions and reason that efficient discrete sketches imply efficient continuous sketches. Our work resolves open questions from the Banff '14 and '17 workshops on Communication Complexity and Applications, as well as the STOC '21 and FOCS '23 workshops on adaptivity and robustness.

Learning-Augmented Search Data Structures
Chunkai Fu, Brandon G. Nguyen, Jung Hoon Seo, Ryan Zesch, Samson Zhou
ICLR 2025 [abstract] [pdf]

We study the integration of machine learning advice to improve upon traditional data structure designed for efficient search queries. Although there has been recent effort in improving the performance of binary search trees using machine learning advice, e.g., Lin et. al. (ICML 2022), the resulting constructions nevertheless suffer from inherent weaknesses of binary search trees, such as complexity of maintaining balance across multiple updates and the inability to handle partially-ordered or high-dimensional datasets. For these reasons, we focus on skip lists and KD trees in this work. Given access to a possibly erroneous oracle that outputs estimated fractional frequencies for search queries on a set of items, we construct skip lists and KD trees that provably provides the optimal expected search time, within nearly a factor of two. In fact, our learning-augmented skip lists and KD trees are still optimal up to a constant factor, even if the oracle is only accurate within a constant factor. We also demonstrate robustness by showing that our data structures achieves an expected search time that is within a constant factor of an oblivious skip list/KD tree construction even when the predictions are arbitrarily incorrect. Finally, we empirically show that our learning-augmented search data structures outperforms their corresponding traditional analogs on both synthetic and real-world datasets.

Fair Submodular Cover
Wenjing Chen, Shuo Xing, Samson Zhou, Victoria G. Crawford
ICLR 2025 [abstract] [pdf]

Machine learning algorithms are becoming increasing prevalent in the modern world, and as a result there has been significant recent study into algorithmic fairness in order to minimize the possibility of unintentional bias or discrimination in these algorithms. Submodular optimization problems also arise in many machine learning applications, including those such as data summarization and clustering where fairness is an important concern. In this paper, we initiate the study of the Fair Submodular Cover Problem (FSC). Given a ground set $U$, a monotone submodular function $f:2^U\to\mathbb{R}_{\ge 0}$, and a threshold $\tau$, the goal of FSC is to find a balanced subset of $U$ with minimum cardinality such that $f(S)\ge\tau$. We first introduce discrete algorithms for FSC that achieve a bicriteria approximation ratio of $(\frac{1}{\varepsilon}, 1-O(\varepsilon))$. We then present a continuous algorithm that achieves a $(\ln\frac{1}{\varepsilon}, 1-O(\varepsilon))$-bicriteria approximation ratio, which matches the best approximation guarantee of submodular cover without a fairness constraint. Finally, we complement our theoretical results with a number of empirical evaluations that demonstrate the efficiency of our algorithms on instances of maximum coverage.

On the Price of Differential Privacy for Hierarchical Clustering
Chenyuan Deng, Jie Gao, Jalaj Upadhyay, Chen Wang, Samson Zhou
ICLR 2025 [abstract] [pdf]

Hierarchical clustering is a fundamental unsupervised machine learning task with the aim of organizing data by a hierarchy of clusters. Unfortunately, many applications of hierarchical clustering involve sensitive user information, but recent results show that any privacy-preserving algorithm under the edge-level differential privacy necessarily suffers a large error. Thus we consider a weight privacy model where each edge of the input graph is at least unit weight. We present a novel algorithm in the weight privacy model that shows significantly better error than known impossibility results in the edge-level DP setting. In particular, our algorithm achieves $O(\log^{1.5}n/\eps)$ multiplicative error for $\eps$-DP and runs in polynomial time, where $n$ is the size of the input graph. We complement our algorithm by showing if we are not given the unit-weight constraint, the lower bound for weight-level DP hierarchical clustering is essentially the same as the edge-level DP, i.e. $\Omega(n^2/\varepsilon)$ additive error. As a result, we also obtain a new lower bound of $\tilde{\Omega}(1/\eps)$ additive error for balanced sparsest cuts in the weight-level DP model, which may be of independent interest.

Fair Clustering in the Sliding Window Model
Vincent Cohen-Addad, Shaofeng H.-C. Jiang, Qiaoyuan Yang, Yubo Zhang, Samson Zhou
ICLR 2025 (selected for spotlight presentation) [abstract] [pdf]

We study streaming algorithms for proportionally fair clustering (a notion originally suggested by Chierichetti et al. (2017) in the sliding window model. We show that although there exist efficient streaming algorithms exist in the insertion-only model, surprisingly no algorithm can achieve finite ratio without violating the fairness constraint in sliding window. Hence, the problem of fair clustering is a rare separation between the insertion-only streaming model and the sliding window model. On the other hand, we show that if the fairness constraint \textcolor{blue}{is relaxed} by a multiplicative $\varepsilon$ factor, there exists a $(1 + \varepsilon)$-approximate sliding window algorithm that uses $\text{poly}(k\varepsilon^{-1}\log n)$ space. This achieves essentially the best parameters (up to degree in the polynomial) provided the aforementioned lower bound. We also implement a number of empirical evaluations on real datasets to complement our theoretical results.

Adversarially Robust Dense-Sparse Tradeoffs via Heavy-Hitters
David P. Woodruff, Samson Zhou
NeurIPS 2024 [abstract] [pdf]

In the adversarial streaming model, the input is a sequence of adaptive updates that defines an underlying dataset and the goal is to approximate, collect, or compute some statistic while using space sublinear in the size of the dataset. In 2022, Ben-Eliezer, Eden, and Onak showed a dense-sparse trade-off technique that elegantly combined sparse recovery with known techniques using differential privacy and sketch switching to achieve adversarially robust algorithms for $L_p$ estimation and other algorithms on turnstile streams. However, there has been no progress since, either in terms of achievability or impossibility. In this work, we first give improved algorithms for adversarially robust $L_p$-heavy hitters, utilizing deterministic turnstile heavy-hitter algorithms with better tradeoffs. We then utilize our heavy-hitter algorithm to reduce the problem to estimating the frequency moment of the tail vector. We give a new algorithm for this problem in the classical streaming setting, which achieves additive error and uses space independent in the size of the tail. We then leverage these ingredients to give an improved algorithm for adversarially robust $L_p$ estimation on turnstile streams. We believe that our results serve as an important conceptual breakthrough, demonstrating that there is no inherent barrier at the previous state-of-the-art.

On Socially Fair Low-Rank Approximation and Column Subset Selection
Zhao Song, Ali Vakilian, David P. Woodruff, Samson Zhou
NeurIPS 2024 [abstract] [pdf]

Low-rank approximation and column subset selection are two fundamental and related problems that are applied across a wealth of machine learning applications. In this paper, we study the question of socially fair low-rank approximation and socially fair column subset selection, where the goal is to minimize the loss over all sub-populations of the data. We show that surprisingly, even constant-factor approximation to fair low-rank approximation requires exponential time under certain standard complexity hypotheses. On the positive side, we give an algorithm for fair low-rank approximation that, for a constant number of groups and constant-factor accuracy, runs in $2^{\text{poly}(k)}$ time rather than the na\"{i}ve $n^{\text{poly}(k)}$, which is a substantial improvement when the dataset has a large number $n$ of observations. We then show that there exist bicriteria approximation algorithms for fair low-rank approximation and fair column subset selection that run in polynomial time.

A Strong Separation for Adversarially Robust ℓ₀ Estimation for Linear Sketches
Elena Gribelyuk, Honghao Lin, David P. Woodruff, Huacheng Yu, Samson Zhou
FOCS 2024 [abstract] [pdf]

The majority of streaming problems are defined and analyzed in a static setting, where the data stream is fixed in advance to be a fixed sequence of insertions and deletions. However, many real-world applications require a more flexible model, where an adaptive adversary may select future stream elements after observing the previous outputs of the algorithm. Over the last few years, there has been increased interest in proving lower bounds for natural problems in the adaptive streaming model. In this work, we give the first known adaptive attack against linear sketches for the well-studied $\ell_0$-estimation problem over turnstile, integer streams. For any linear streaming algorithm $\mathcal{A}$ which uses sketching matrix $\mathbf{A}\in \mathbb{Z}^{r \times n}$, this attack makes $\tilde{\mathcal{O}}(r^8)$ queries and succeeds with high constant probability in breaking the sketch. Additionally, we give an adaptive attack against linear sketches for the $\ell_0$-estimation problem over finite fields $\mathbb{F}_p$, which requires a smaller number of $\tilde{\mathcal{O}}(r^3)$ queries. Our results provide an exponential improvement over the previous number of queries known to break an $\ell_0$-estimation sketch.

Private Vector Mean Estimation in the Shuffle Model: Optimal Rates Require Many Messages
Hilal Asi, Vitaly Feldman, Jelani Nelson, Huy L. Nguyễn, Kunal Talwar, Samson Zhou
ICML 2024 [abstract] [pdf]

We study the problem of private vector mean estimation in the shuffle model of privacy where $n$ users each have a unit vector $v^{(i)} \in\mathbb{R}^d$. We propose a new multi-message protocol that achieves the optimal error using $\tilde{\mathcal{O}}\left(\min(n\varepsilon^2,d)\right)$ messages per user. Moreover, we show that any (unbiased) protocol that achieves optimal error requires each user to send $\Omega(\min(n\varepsilon^2,d)/\log(n))$ messages, demonstrating the optimality of our message complexity up to logarithmic factors.

Additionally, we study the single-message setting and design a protocol that achieves mean squared error $\mathcal{O}(dn^{d/(d+2)}\varepsilon^{-4/(d+2)})$. Moreover, we show that \emph{any} single-message protocol must incur mean squared error $\Omega(dn^{d/(d+2)})$, showing that our protocol is optimal in the standard setting where $\varepsilon = \Theta(1)$. Finally, we study robustness to malicious users and show that malicious users can incur large additive error with a single shuffler.

Streaming Algorithms with Few State Changes
Rajesh Jayaram, David P. Woodruff, Samson Zhou
PODS 2024 [abstract] [pdf]

In this paper, we study streaming algorithms that minimize the number of changes made to their internal state (i.e., memory contents). While the design of streaming algorithms typically focuses on minimizing space and update time, these metrics fail to capture the asymmetric costs, inherent in modern hardware and database systems, of reading versus writing to memory. In fact, most streaming algorithms write to their memory on \emph{every update}, which is undesirable when writing is significantly more expensive than reading. This raises the question of whether streaming algorithms with small space \emph{and} number of memory writes are possible.

We first demonstrate that, for the fundamental $F_p$ moment estimation problem with $p\ge 1$, any streaming algorithm that achieves a constant factor approximation must make $\Omega(n^{1-1/p})$ internal state changes, regardless of how much space it uses. Perhaps surprisingly, we show that this lower bound can be matched by an algorithm which also has near-optimal space complexity. Specifically, we give a $(1+\varepsilon)$-approximation algorithm for $F_p$ moment estimation that use a near-optimal $\widetilde{\mathcal{O}}_\varepsilon(n^{1-1/p})$ number of state changes, while simultaneously achieving near-optimal space, i.e., for $p\in[1,2)$, our algorithm uses $\text{poly}\left(\log n,\frac{1}{\varepsilon}\right)$ bits of space for, while for $p>2$, the algorithm uses $\widetilde{\mathcal{O}}_\varepsilon(n^{1-1/p})$ space. We similarly design streaming algorithms that are simultaneously near-optimal in both space complexity and the number of state changes for the heavy-hitters problem, sparse support recovery, and entropy estimation. Our results demonstrate that an optimal number of state changes can be achieved without sacrificing space complexity.

Near-Optimal k-Clustering in the Sliding Window Model
David P. Woodruff, Peilin Zhong, Samson Zhou
NeurIPS 2023 [abstract] [poster] [pdf]

Clustering is an important technique for identifying structural information in large-scale data analysis, where the underlying dataset may be too large to store. In many applications, recent data can provide more accurate information and thus older data past a certain time is expired. The sliding window model captures these desired properties and thus there has been substantial interest in clustering in the sliding window model.

In this paper, we give the first algorithm that achieves near-optimal $(1+\varepsilon)$-approximation to $(k,z)$-clustering in the sliding window model. Our algorithm uses $\frac{k}{\min(\varepsilon^4,\varepsilon^{2+z})}\,\text{polylog}\frac{n\Delta}{\varepsilon}$ words of space when the points are from $[\Delta]^d$, thus significantly improving on works by Braverman et. al. (SODA 2016), Borassi et. al. (NeurIPS 2021), and Epasto et. al. (SODA 2022).

Along the way, we develop a data structure for clustering called an online coreset, which outputs a coreset not only for the end of a stream, but also for all prefixes of the stream. Our online coreset samples $\frac{k}{\min(\varepsilon^4,\varepsilon^{2+z})}\,\text{polylog}\frac{n\Delta}{\varepsilon}$ points from the stream. We then show that any online coreset requires $\Omega\left(\frac{k}{\varepsilon^2}\log n\right)$ samples, which shows a separation between the problem of constructing an offline coreset, i.e., constructing online coresets is strictly harder. Our results also extend to general metrics on $[\Delta]^d$ and are near-optimal in light of a $\Omega\left(\frac{k}{\varepsilon^{2+z}}\right)$ lower bound for the size of an offline coreset.

Streaming Algorithms for Learning with Experts: Deterministic Versus Robust
David P. Woodruff, Fred Zhang, Samson Zhou
NeurIPS 2023 [abstract] [pdf]

In the online learning with experts problem, an algorithm must make a prediction about an outcome on each of $T$ days (or times), given a set of $n$ experts who make predictions on each day (or time). The algorithm is given feedback on the outcomes of each day, including the cost of its prediction and the cost of the expert predictions, and the goal is to make a prediction with the minimum cost, specifically compared to the best expert in the set. Recent work by Srinivas, Woodruff, Xu, and Zhou (STOC 2022) introduced the study of the online learning with experts problem under memory constraints.

However, often the predictions made by experts or algorithms at some time influence future outcomes, so that the input is adaptively chosen. Whereas deterministic algorithms would be robust to adaptive inputs, existing algorithms all crucially use randomization to sample a small number of experts.

In this paper, we study deterministic and robust algorithms for the experts problem. We first show a space lower bound of $\widetilde{\Omega}\left(\frac{nM}{RT}\right)$ for any deterministic algorithm that achieves regret $R$ when the best expert makes $M$ mistakes. Our result shows that the natural deterministic algorithm, which iterates through pools of experts until each expert in the pool has erred, is optimal up to polylogarithmic factors. On the positive side, we give a randomized algorithm that is robust to adaptive inputs that uses $\widetilde{O}\left(\frac{n}{R\sqrt{T}}\right)$ space for $M=O\left(\frac{R^2 T}{\log^2 n}\right)$, thereby showing a smooth space-regret trade-off.

Streaming Euclidean k-median and k-means with o(log n) Space
Vincent Cohen-Addad, David P. Woodruff, Samson Zhou
FOCS 2023 [abstract] [video] [slides] [pdf]

We consider the classic Euclidean $k$-median and $k$-means objective on data streams, where the goal is to provide a $(1+\varepsilon)$-approximation to the optimal $k$-median or $k$-means solution, while using as little memory as possible. Over the last 20 years, clustering in data streams has received a tremendous amount of attention and has been the test-bed for a large variety of new techniques, including coresets, the merge-and-reduce framework, bicriteria approximation, sensitivity sampling, and so on. Despite this intense effort to obtain smaller sketches for these problems, all known techniques require storing at least $\Omega(\log(n\Delta))$ words of memory, where $n$ is the size of the input and $\Delta$ is the aspect ratio. A natural question is if one can beat this logarithmic dependence on $n$ and $\Delta$. In this paper, we break this barrier by first giving an insertion-only streaming algorithm that achieves a $(1+\varepsilon)$-approximation to the more general $(k,z)$-clustering problem, using $\tilde{\mathcal{O}}\left(\frac{dk}{\varepsilon^2}\right)\cdot(2^{z\log z})\cdot\min\left(\frac{1}{\varepsilon^z},k\right)\cdot\text{poly}(\log\log(n\Delta))$ words of memory. Our techniques can also be used to achieve two-pass algorithms for $k$-median and $k$-means clustering on dynamic streams using $\tilde{\mathcal{O}}\left(\frac{1}{\varepsilon^2}\right)\cdot\text{poly}(d,k,\log\log(n\Delta))$ words of memory.

How to Make Your Approximation Algorithm Private: A Black-Box Differentially-Private Transformation for Tunable Approximation Algorithms of Functions with Low Sensitivity
Jeremiah Blocki, Elena Grigorescu, Tamalika Mukherjee, Samson Zhou
RANDOM 2023 [abstract] [pdf]

We develop a framework for efficiently transforming certain approximation algorithms into differentially-private variants, in a black-box manner. Specifically, our results focus on algorithms $A$ that output an approximation to a function $f$ of the form $(1-\alpha)f(x)-\kappa\leq A(x) \leq (1+\alpha)f(x)+\kappa$, where $\alpha\in [0,1)$ is a parameter that can be``tuned" to small-enough values while incurring only a polynomial blowup in the running time/space. We show that such algorithms can be made differentially private without sacrificing accuracy, as long as the function $f$ has small ``global sensitivity''. We achieve these results by applying the ``smooth sensitivity'' framework developed by Nissim, Raskhodnikova, and Smith (STOC 2007). Our framework naturally applies to transform non-private FPRAS (resp. FPTAS) algorithms into $(\varepsilon,\delta)$-differentially private (resp. $\varepsilon$-differentially private) approximation algorithms. We apply our framework in the context of sublinear-time and sublinear-space algorithms, while preserving the nature of the algorithm in meaningful ranges of the parameters. Our results include the first (to the best of our knowledge) $(\eps,\delta)$-edge differentially-private sublinear-time algorithm for estimating the number of triangles, the number of connected components, and the weight of a minimum spanning tree of a graph, as well as a more efficient algorithm (while sacrificing pure DP in contrast to previous results) for estimating the average degree of a graph. In the area of streaming algorithms, our results include $(\eps,\delta)$-DP algorithms for estimating $L_p$-norms, distinct elements, and weighted minimum spanning tree for both insertion-only and turnstile streams. Our transformation also provides a private version of the smooth histogram framework, which is commonly used for converting streaming algorithms into sliding window variants, and achieves a multiplicative approximation to many problems, such as estimating $L_p$-norms, distinct elements, and the length of the longest increasing subsequence.

Private Data Stream Analysis for Universal Symmetric Norm Estimation
Vladimir Braverman, Joel Manning, Zhiwei Steven Wu, Samson Zhou
RANDOM 2023 [abstract] [slides] [pdf]
FORC 2022 (non-archival track)

We study how to release summary statistics on a data stream subject to the constraint of differential privacy. In particular, we focus on releasing the family of \emph{symmetric norms, which are invariant under sign-flips and coordinate-wise permutations on an input data stream and include $L_p$ norms, $k$-support norms, top-$k$ norms, and the box norm as special cases}. Although it may be possible to design and analyze a separate mechanism for each symmetric norm, we propose a general parametrizable framework that differentially privately releases a number of sufficient statistics from which the approximation of all symmetric norms can be simultaneously computed. Our framework partitions the coordinates of the underlying frequency vector into different levels based on their magnitude and releases approximate frequencies for the ``heavy'' coordinates in important levels and releases approximate level sizes for the ``light'' coordinates in important levels. Surprisingly, our mechanism allows for the release of an \emph{arbitrary} number of symmetric norm approximations without any overhead or additional loss in privacy. Moreover, our mechanism permits $(1+\alpha)$-approximation to each of the symmetric norms and can be implemented using sublinear space in the streaming model for many regimes of the accuracy and privacy parameters.

Fast (1+ε)-Approximation Algorithms for Binary Matrix Factorization
Ameya Velingker, Maximilian Vötsch, David P. Woodruff, Samson Zhou
ICML 2023 [abstract] [pdf]

We introduce efficient $(1+\varepsilon)$-approximation algorithms for the binary matrix factorization (BMF) problem, where the inputs are a matrix $\mathbf{A}\in\{0,1\}^{n\times d}$, a rank parameter $k>0$, and an accuracy parameter $\varepsilon>0$, and the goal is to approximate $\mathbf{A}$ by outputting factors $\mathbf{U}\in\{0,1\}^{n\times k}$ and $\mathbf{V}\in\{0,1\}^{k\times d}$ that minimize the Frobenius loss $\|\mathbf{A}-\mathbf{U}\mathbf{V}\|_F$. Currently, the state-of-the-art for this problem is the approximation algorithm of Kumar \etal~[ICML 2019], which achieves a $C$-approximation for some constant $C\ge 576$. We give the first $(1+\varepsilon)$-approximation algorithm using runtime singly exponential in $k$, which is typically a small integer. Our techniques generalize to other common variants of the BMF problem, admitting bicriteria $(1+\varepsilon)$-approximation algorithms for $L_p$ loss functions, as well as the setting where matrix operations are performed in $\mathbb{F}_2$. Our approach can be implemented in standard big data models, such as the streaming model or the distributed model.

Provable Data Subset Selection For Efficient Neural Network Training
Murad Tukan, Samson Zhou, Alaa Maalouf, Daniela Rus, Vladimir Braverman, Dan Feldman
ICML 2023 [abstract] [pdf]

Radial basis function neural networks (\emph{RBFNN}) are {well-known} for their capability to approximate any continuous function on a closed bounded set with arbitrary precision given enough hidden neurons. In this paper, we introduce the first algorithm to construct coresets for \emph{RBFNNs}, i.e., small weighted subsets that approximate the loss of the input data on any radial basis function network and thus approximate any function defined by an \emph{RBFNN} on the larger input data. In particular, we construct coresets for radial basis and Laplacian loss functions. We then use our coresets to obtain a provable data subset selection algorithm for training deep neural networks. Since our coresets approximate every function, they also approximate the gradient of each weight in a neural network, which is a particular function on the input. We then perform empirical evaluations on function approximation and dataset subset selection on popular network architectures and data sets, demonstrating the efficacy and accuracy of our coreset construction.

Differentially Private Aggregation via Imperfect Shuffling
Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Jelani Nelson, Samson Zhou
ITC 2023 [abstract] [pdf]
FORC 2023 (non-archival track)

In this paper, we introduce the imperfect shuffle differential privacy model, where messages sent from users are shuffled in an \emph{almost} uniform manner before being observed by a curator for private aggregation. We then consider the private summation problem. We show that the standard split-and-mix protocol by Ishai et. al. [FOCS 2006] can be adapted to achieve near-optimal utility bounds in the imperfect shuffle model. Specifically, we show that surprisingly, there is no additional error overhead necessary in the imperfect shuffle model, not even the small additive overhead in the privacy amplification framework.

Selective Experience Replay Compression using Coresets for Lifelong Deep Reinforcement Learning in Medical Imaging
Guangyao Zheng, Samson Zhou, Vladimir Braverman, Michael A. Jacobs, Vishwa S. Parekh
MIDL 2023 [abstract] [pdf]

Selective experience replay is a popular strategy for integrating lifelong learning with deep reinforcement learning. Selective experience replay aims to recount selected experiences from previous tasks to avoid catastrophic forgetting. Furthermore, selective experience replay based techniques are model agnostic and allow experiences to be shared across different models. However, storing experiences from all previous tasks make lifelong learning using selective experience replay computationally very expensive and impractical as the number of tasks increase. To that end, we propose a reward distribution-preserving coreset compression technique for compressing experience replay buffers stored for selective experience replay. We evaluated the coreset lifelong deep reinforcement learning technique on the brain tumor segmentation (BRATS) dataset for the task of ventricle localization and on the whole-body MRI for localization of left knee cap, left kidney, right trochanter, left lung, and spleen. The coreset lifelong learning models trained on a sequence of 10 different brain MR imaging environments demonstrated excellent performance localizing the ventricle with a mean pixel error distance of 12.93, 13.46, 17.75, and 18.55 for the compression ratios of 10x, 20x, 30x, and 40x, respectively. In comparison, the conventional lifelong learning model localized the ventricle with a mean pixel distance of 10.87. Similarly, the coreset lifelong learning models trained on whole-body MRI demonstrated no significant difference (p=0.28) between the 10x compressed coreset lifelong learning models and conventional lifelong learning models for all the landmarks. The mean pixel distance for the 10x compressed models across all the landmarks was 25.30, compared to 19.24 for the conventional lifelong learning models. Our results demonstrate that the potential of the coreset-based ERB compression method for compressing experiences without a significant drop in performance.

On Differential Privacy and Adaptive Data Analysis with Bounded Space
Itai Dinur, Uri Stemmer, David P. Woodruff, Samson Zhou
EUROCRYPT 2023 [abstract] [slides] [pdf]

We study the space complexity of the two related fields of {\em differential privacy} and {\em adaptive data analysis}. Specifically,
1. Under standard cryptographic assumptions, we show that there exists a problem $P$ that requires exponentially more space to be solved efficiently with differential privacy, compared to the space needed without privacy. To the best of our knowledge, this is the first separation between the space complexity of private and non-private algorithms.
2. The line of work on adaptive data analysis focuses on understanding the number of {\em samples} needed for answering a sequence of adaptive queries. We revisit previous lower bounds at a foundational level, and show that they are a consequence of a space bottleneck rather than a sampling bottleneck. To obtain our results, we define and construct an encryption scheme with multiple keys that is built to withstand a limited amount of key leakage in a very particular way.

Sub-quadratic Algorithms for Kernel Matrices via Kernel Density Estimation
Ainesh Bakshi, Praneeth Kacham, Piotr Indyk, Sandeep Silwal, Samson Zhou
ICLR 2023 (selected for spotlight presentation) [abstract] [pdf]

Kernel matrices, as well as weighted graphs represented by them, are ubiquitous objects in machine learning, statistics and other related fields. The main drawback of using kernel methods (learning and inference using kernel matrices) is efficiency -- given $n$ input points, most kernel-based algorithms need to materialize the full $n \times n$ kernel matrix before performing any subsequent computation, thus incurring $\Omega(n^2)$ runtime. Breaking this quadratic barrier for various problems has therefore, been a subject of extensive research efforts. We break the quadratic barrier and obtain \emph{subquadratic} time algorithms for several fundamental linear-algebraic and graph processing primitives, including approximating the top eigenvalue and eigenvector, spectral sparsification, solving linear systems, local clustering, low-rank approximation, arboricity estimation and counting weighted triangles. We build on the recently developed Kernel Density Estimation framework, which (after preprocessing in time subquadratic in $n$) can return estimates of row/column sums of the kernel matrix. In particular, we develop efficient reductions from \emph{weighted vertex} and \emph{weighted edge sampling} on kernel graphs, \emph{simulating random walks} on kernel graphs, and \emph{importance sampling} on matrices to Kernel Density Estimation and show that we can generate samples from these distributions in \emph{sublinear} (in the support of the distribution) time. Our reductions are the central ingredient in each of our applications and we believe they may be of independent interest. We empirically demonstrate the efficacy of our algorithms on low-rank approximation (LRA) and spectral sparsification, where we observe a $\textbf{9x}$ decrease in the number of kernel evaluations over baselines for LRA and a $\textbf{41x}$ reduction in the graph size for spectral sparsification.

Differentially Private L₂-Heavy Hitters in the Sliding Window Model
Jeremiah Blocki, Seunghoon Lee, Tamalika Mukherjee, Samson Zhou
ICLR 2023 (selected for spotlight presentation) [abstract] [pdf]

The data management of large companies often prioritize more recent data, as a source of higher accuracy prediction than outdated data. For example, the Facebook data policy retains user search histories for $6$ months while the Google data retention policy states that browser information may be stored for up to $9$ months. These policies are captured by the sliding window model, in which only the most recent $W$ statistics form the underlying dataset. In this paper, we consider the problem of privately releasing the $L_2$-heavy hitters in the sliding window model, which include $L_p$-heavy hitters for $p\le 2$ and in some sense are the strongest possible guarantees that can be achieved using polylogarithmic space, but cannot be handled by existing techniques due to the sub-additivity of the $L_2$ norm. Moreover, existing non-private sliding window algorithms use the smooth histogram framework, which has high sensitivity. To overcome these barriers, we introduce the first differentially private algorithm for $L_2$-heavy hitters in the sliding window model by initiating a number of $L_2$-heavy hitter algorithms across the stream with significantly lower threshold. Similarly, we augment the algorithms with an approximate frequency tracking algorithm with significantly higher accuracy. We then use smooth sensitivity and statistical distance arguments to show that we can add noise proportional to an estimation of the $L_2$ norm. To the best of our knowledge, our techniques are the first to privately release statistics that are related to a sub-additive function in the sliding window model, and may be of independent interest to future differentially private algorithmic design in the sliding window model.

Robust Algorithms on Adaptive Inputs from Bounded Adversaries
Yeshwanth Cherapanamjeri, Sandeep Silwal, David P. Woodruff, Fred Zhang, Qiuyi Zhang, Samson Zhou
ICLR 2023 [abstract] [pdf]

We study dynamic algorithms robust to adaptive input generated from sources with bounded capabilities, such as sparsity or limited interaction. For example, we consider robust linear algebraic algorithms when the updates to the input are sparse but given by an adversary with access to a query oracle. We also study robust algorithms in the standard centralized setting, where an adversary queries an algorithm in an adaptive manner, but the number of interactions between the adversary and the algorithm is bounded. We first recall a unified framework of [HKM+20,BKM+22,ACS+23] for answering $Q$ adaptive queries that incurs $\widetilde{\mathcal{O}}(\sqrt{Q})$ overhead in space, which is roughly a quadratic improvement over the na\"{i}ve implementation, and only incurs a logarithmic overhead in query time. Although the general framework has diverse applications in machine learning and data science, such as adaptive distance estimation, kernel density estimation, linear regression, range queries, and point queries and serves as a preliminary benchmark, we demonstrate even better algorithmic improvements for (1) reducing the pre-processing time for adaptive distance estimation and (2) permitting an unlimited number of adaptive queries for kernel density estimation. Finally, we complement our theoretical results with additional empirical evaluations.

Near-Linear Sample Complexity for L_p Polynomial Regression
Raphael A. Meyer, Cameron Musco, Christopher Musco, David P. Woodruff, Samson Zhou
SODA 2023 [abstract] [slides] [pdf]

We study $L_p$ polynomial regression. Given query access to a function $f:[-1,1] \rightarrow \R$, the goal is to find a degree $d$ polynomial $\widehat{q}$ such that, for a given parameter $\varepsilon > 0$, \begin{align}\label{first} \|\widehat{q}-f\|_p\le (1+\epsilon) \cdot\min_{q:\deg(q)\le d}\|q-f\|_p. \end{align} Here $\norm{\cdot}_p$ is the $L_p$ norm, $\norm{g}_p = (\int_{-1}^1 |g(t)|^p dt)^{1/p}$. We show that querying $f$ at points randomly drawn from the Chebyshev measure on $[-1,1]$ is a near-optimal strategy for polynomial regression in all $L_p$ norms. In particular, to output $\hat q$ satisfying \eqref{first}, it suffices to sample $O(\frac{dp^4\polylog d}{\poly\,\varepsilon})$ points from $[-1,1]$ with probabilities proportional to this measure. While polynomial regression is well understood for $L_2$ and $L_\infty$, no prior work explicitly studies polynomial regression for other values of $p$ without strong assumptions like having bounded noise. Naively generalizing techniques from prior works would have a higher runtime than our algorithm. Further, they would either only give results for constant $\varepsilon$, or require a suboptimal $\Omega(d^2)$ sample complexity for $p > 2$. One of our main technical contributions is to provide explicit bounds on the \textit{$L_p$ Lewis weight function} of an infinite linear operator underlying the polynomial regression problem. Using tools from the orthogonal polynomial literature, we show that this function is closely related to the Chebyshev density. Our approach advances prior work, which studies explicit bounds on the $L_2$ leverage scores of infinite linear operators. A second contribution is to prove tighter bounds on $L_p$ Lewis weight sampling for the polynomial operator than hold for general linear operators or matrices.

Optimal Algorithms for Linear Algebra in the Current Matrix Multiplication Time
Yeshwanth Cherapanamjeri, Sandeep Silwal, David P. Woodruff, Samson Zhou
SODA 2023 [abstract] [pdf]

We study fundamental problems in linear algebra, such as finding a maximal linearly independent subset of rows or columns (a basis), solving linear regression, or computing a subspace embedding. For these problems, we consider input matrices $\mathbf{A}\in\mathbb{R}^{n\times d}$ with $n > d$. The input can be read in $\text{nnz}(\mathbf{A})$ time, which denotes the number of nonzero entries of $\mathbf{A}$. In this paper, we show that beyond the time required to read the input matrix, these fundamental linear algebra problems can be solved in $d^{\omega}$ time, i.e., where $\omega \approx 2.37$ is the current matrix-multiplication exponent. To do so, we introduce a constant-factor subspace embedding with the optimal $m=\mathcal{O}(d)$ number of rows, and which can be applied in time $\mathcal{O}\left(\frac{\text{nnz}(\mathbf{A})}{\alpha}\right) + d^{2 + \alpha}\text{poly}(\log d)$ for any trade-off parameter $\alpha>0$, tightening a recent result by Chepurko et. al. [SODA 2022] that achieves an $\exp(\text{poly}(\log\log n))$ distortion with $m=d\cdot\text{poly}(\log\log d)$ rows in $\mathcal{O}\left(\frac{\text{nnz}(\mathbf{A})}{\alpha}+d^{2+\alpha+o(1)}\right)$ time. Our subspace embedding uses a recently shown property of {\it stacked} Subsampled Randomized Hadamard Transforms (SRHT), which actually increase the input dimension, to ``spread'' the mass of an input vector among a large number of coordinates, followed by random sampling. To control the effects of random sampling, we use fast semidefinite programming to reweight the rows. We then use our constant-factor subspace embedding to give the first optimal runtime algorithms for finding a maximal linearly independent subset of columns, regression, and leverage score sampling. To do so, we also introduce a novel subroutine that iteratively grows a set of independent rows, which may be of independent interest.

Learning-Augmented Algorithms for Online Linear and Semidefinite Programming
Elena Grigorescu, Young-San Lin, Sandeep Silwal, Maoyuan Song, Samson Zhou
NeurIPS 2022 [abstract] [pdf]

Semidefinite programming (SDP) is a unifying framework that generalizes both linear programming and quadratically-constrained quadratic programming, while also yielding efficient solvers, both in theory and in practice. However, there exist known impossibility results for approximating the optimal solution when constraints for covering SDPs arrive in an online fashion. In this paper, we study online covering linear and semidefinite programs in which the algorithm is augmented with advice from a possibly erroneous predictor. We show that if the predictor is accurate, we can efficiently bypass these impossibility results and achieve a constant-factor approximation to the optimal solution, i.e., consistency. On the other hand, if the predictor is inaccurate, under some technical conditions, we achieve results that match both the classical optimal upper bounds and the tight lower bounds up to constant factors, i.e., robustness. More broadly, we introduce a framework that extends both (1) the online set cover problem augmented with machine-learning predictors, studied by Bamas, Maggiore, and Svensson (NeurIPS 2020), and (2) the online covering SDP problem, initiated by Elad, Kale, and Noar (ICALP 2016). Specifically, we obtain general online learning-augmented algorithms for covering linear programs with fractional advice and constraints, and initiate the study of learning-augmented algorithms for covering SDP problems. Our techniques are based on the primal-dual framework of Buchbinder and Naor (Mathematics of Operations Research, 34, 2009) and can be further adjusted to handle constraints where the variables lie in a bounded region, i.e., box constraints.

Adaptive Sketches for Robust Regression with Importance Sampling
Sepideh Mahabadi, David P. Woodruff, Samson Zhou
RANDOM 2022 [abstract] [pdf]

We introduce data structures for solving robust regression through stochastic gradient descent (SGD) by sampling gradients with probability proportional to their norm, i.e., importance sampling. Although SGD is widely used for large scale machine learning, it is well-known for possibly experiencing slow convergence rates due to the high variance from uniform sampling. On the other hand, importance sampling can significantly decrease the variance but is usually difficult to implement because computing the sampling probabilities requires additional passes over the data, in which case standard gradient descent (GD) could be used instead. In this paper, we introduce an algorithm that approximately samples $T$ gradients of dimension $d$ from nearly the optimal importance sampling distribution for a robust regression problem over $n$ rows. Thus our algorithm effectively runs $T$ steps of SGD with importance sampling while using sublinear space and just making a single pass over the data. Our techniques also extend to performing importance sampling for second-order optimization.

Hardness and Algorithms for Robust and Sparse Optimization
Eric Price, Sandeep Silwal, Samson Zhou
ICML 2022 [abstract] [pdf]

We explore algorithms and limitations for sparse optimization problems such as sparse linear regression and robust linear regression. The goal of the sparse linear regression problem is to identify a small number of key features, while the goal of the robust linear regression problem is to identify a small number of erroneous measurements. Specifically, the sparse linear regression problem seeks a $k$-sparse vector $x\in\mathbb{R}^d$ to minimize $\|Ax-b\|_2$, given an input matrix $A\in\mathbb{R}^{n\times d}$ and a target vector $b\in\mathbb{R}^n$, while the robust linear regression problem seeks a set $S$ that ignores at most $k$ rows and a vector $x$ to minimize $\|(Ax-b)_S\|_2$. We first show bicriteria, NP-hardness of approximation for robust regression building on the work of [OWZ15], which implies a similar result for sparse regression. We further show fine-grained hardness of robust regression through a reduction from the minimum-weight $k$-clique conjecture. On the positive side, we give an algorithm for robust regression that achieves arbitrarily accurate additive error and uses runtime that closely matches the lower bound from the fine-grained hardness result, as well as an algorithm for sparse regression with similar runtime. Both our upper and lower bounds rely on a general reduction from robust linear regression to sparse regression that we introduce. Our algorithms, inspired by the 3SUM problem, use approximate nearest neighbor data structures and may be of independent interest for solving sparse optimization problems. For instance, we demonstrate that our techniques can also be used for the well-studied sparse PCA problem.

The White-Box Adversarial Data Stream Model
Miklós Ajtai, Vladimir Braverman, T.S. Jayram, Sandeep Silwal, Alec Sun, David P. Woodruff, Samson Zhou
PODS 2022 [abstract] [pdf]

There has been a large body of recent literature studying streaming algorithms on adaptive inputs, where an input stream is chosen adaptively by a black-box adversary who observes the outputs of the algorithm at each time. Many of these algorithms use techniques that crucially use the fact that the adversary can only see the output of the algorithm by using either differential privacy to hide the internal randomness of the algorithm or switching between parallel instances of the algorithm to information theoretically hide the internal randomness of the algorithm. These techniques fail when the entire internal states of the algorithm at each point in time are also revealed to the adversary. We introduce and consider the problem of streaming algorithms in the white-box adversarial model, where the stream is chosen adaptively by an adversary who observes the entire internal state of the algorithm at each time. We first show that there exists a randomized algorithm for the $\eps$-$L_1$-heavy hitters problem that uses space $\O{\frac{1}{\eps}\left(\log n+\log\frac{1}{\eps}\right)+\log\log m}$, which outperforms the well-known deterministic Misra-Gries algorithm. We also show that if the white-box adversary is computationally bounded, then there exist efficient algorithms for the $(\varphi,\eps)$-$L_1$-heavy hitters problem and the vertex neighborhood identity problem. We also show a general technique for proving lower bounds for (possibly randomized) algorithms robust to white-box algorithms through two-player (possibly deterministic) communication problems. In particular, our results show that for all $p\ge 0$, there exists a constant $C_p>1$ such that any $C_p$-approximation algorithm for $F_p$ moment estimation requires $\Omega(n)$ space. Similarly, there exists a constant $C>1$ such that any $C$-approximation algorithm for matrix rank requires $\Omega(n)$ space. We also prove a lower bound of $\Omega(\log n)$ for the fundamental problem of deterministic approximate counting in a stream of $0$s and $1$s, which holds even if we we know how many total stream updates we have seen so far at each point in the stream. Such a lower bound on approximate counting with additional information was unknown, and may be of independent interest. In our context, it shows a separation between multiplayer deterministic maximum communication and the white box space complexity of a streaming algorithm.

Memory Bounds for the Experts Problem
Vaidehi Srinivas, David P. Woodruff, Ziyu Xu, Samson Zhou
STOC 2022 [abstract] [slides] [video] [pdf]

While online learning with experts has been studied in many settings and there is a large body of work in understanding optimal algorithms for this problem, there is much less work on understanding the memory required to solve this problem in natural streaming models. This is especially important when the number of experts, as well as the number of days the experts make predictions on, are large. The goal is for a streaming algorithm that processes the predictions of each expert on a day to make a prediction with the minimum cost for that day. After having made a prediction, the streaming algorithm sees the actual outcome on that day, updates its state, and then moves on to the next day. We initialize the study of the online learning with experts problem in the streaming model. We first give a lower bound for the discrete prediction with experts problem where each prediction is either correct or incorrect for that day. Our lower bound shows that to achieve the optimal regret from online learning, any algorithm must essentially use space linear in the number of experts. Our lower bound also gives a smooth regret versus memory tradeoff. We then give an upper bound for the more general online learning problem in the random-order model, that is tight with our lower-bound up to low order terms. Our upper bound shows that there are natural tradeoffs between the regret of the streaming algorithm, the memory required, and the total number of days. Finally, for adversarially ordered streams, we give upper bounds for the prediction with experts problems that use sublinear space and achieve sublinear regret in a number of natural parameter regimes. We hope that these results and techniques can inspire multiplicative weights algorithms for a wide range of other problems in the memory-constrained setting.

Learning-Augmented k-means Clustering
Jon C. Ergun, Zhili Feng, Sandeep Silwal, David P. Woodruff, Samson Zhou
ICLR 2022 (selected for spotlight presentation) [abstract] [slides] [pdf]

$k$-means clustering is a well-studied problem due to its wide applicability. Unfortunately, there exist strong theoretical limits on the performance of any algorithm for the $k$-means problem on worst-case inputs. To overcome this barrier, we consider a scenario where ``advice'' is provided to help perform clustering. Specifically, we consider the $k$-means problem augmented with a predictor that, given any point, returns its cluster label in an approximately optimal clustering up to some, possibly adversarial, error. We present an algorithm whose performance improves along with the accuracy of the predictor, even though na\"{i}vely following the accurate predictor can still lead to a high clustering cost. Thus if the predictor is sufficiently accurate, we can retrieve a close to optimal clustering with nearly optimal runtime, breaking known computational barriers for algorithms that do not have access to such advice. We evaluate our algorithms on real datasets and show significant improvements in the quality of clustering.

Fast Regression for Structured Inputs
Raphael A. Meyer, Cameron Musco, Christopher Musco, David P. Woodruff, Samson Zhou
ICLR 2022 [abstract] [pdf]

$L_p$ Regression on Structured Inputs is an important problem in data analysis and machine learning where we find a vector $\mathbf{x}\in\mathbb R^{d}$ that minimizes $\|\mathbf{A}\mathbf{x}-\mathbf{b}\|_p$ for a \textit{structured} matrix $\mathbf{A}\in\mathbb R^{n \times d}$ and response vector $\mathbf{b}\in\mathbb R^{n}$. Unfortunately, for many common classes of matrices, sampling-based algorithms for approximately solving $L_p$ regression require runtime that is exponential in $p$, e.g., $d^{\mathcal{O}(p)}$, which is prohibitively expensive. We show that for a large class of structured inputs, such as combinations of low-rank matrices, sparse matrices, and Vandermonde matrices, $L_p$ regression can be approximately solved using runtime that is polynomial in $p$. For example, we show that $L_p$ regression on Vandermonde matrices can be approximately solved using time $\mathcal{O}(T(\mathbf{A})\log n+(dp)^\omega\cdot\text{polylog}\,n)$, where $T(\mathbf{A})$ is the time to multiply $\mathbf{A}\mathbf{x}$ for an arbitrary vector $\mathbf{x}\in\mathbb{R}^d$, and $\omega$ is the exponent of matrix multiplication. The polynomial dependence on $p$ also crucially allows our algorithms to extend naturally to sublinear time algorithms for $L_\infty$ regression. Of independent interest, we develop a new algorithm for solving $L_p$ regression for arbitrary matrices, which is significantly faster in practice for every $p\ge4$.

New Coresets for Projective Clustering and Applications
Murad Tukan, Xuan Wu, Samson Zhou, Vladimir Braverman, Dan Feldman
AISTATS 2022 [abstract] [pdf]

$(j,k)$-projective clustering is the natural generalization of the family of $k$-clustering and $j$-subspace clustering problems. Given a set of points $P$ in $\mathbb{R}^d$, the goal is to find $k$ flats of dimension $j$, i.e., affine subspaces, that best fit $P$ under a given distance measure. In this paper, we propose the first algorithm that returns an $L_\infty$ coreset of size polynomial in $d$. Moreover, we give the first strong coreset construction for general $M$-estimator regression. Specifically, we show that our construction provides efficient coreset constructions for Cauchy, Welsch, Huber, Geman-McClure, Tukey, $L_1-L_2$, and Fair regression, as well as general concave and power-bounded loss functions. Finally, we provide experimental results based on real-world datasets, showing the efficacy of our approach.

A Fast, Provably Accurate Approximation Algorithm for Sparse Principal Component Analysis Reveals Human Genetic Variation Across the World
Agniva Chowdhury, Aritra Bose, Samson Zhou, David P. Woodruff, Petros Drineas
RECOMB 2022 [abstract] [pdf]

Principal component analysis (PCA) is a widely used dimensionality reduction technique in machine learning and multivariate statistics. To improve the interpretability of PCA, various approaches to obtain sparse principal direction loadings have been proposed, which are termed Sparse Principal Component Analysis (SPCA). In this paper, we present \texttt{ThreSPCA}, a provably accurate algorithm based on thresholding the Singular Value Decomposition for the SPCA problem, without imposing any restrictive assumptions on the input covariance matrix. Our thresholding algorithm is conceptually simple; much faster than current state-of-the-art; and performs well in practice. When applied to genotype data from the 1000 Genomes Project, \texttt{ThreSPCA} is faster than previous benchmarks, at least as accurate, and leads to a set of interpretable biomarkers, revealing genetic diversity across the world.

Noisy Boolean Hidden Matching with Applications
Michael Kapralov, Amulya Musipatla, Jakab Tardos, David P. Woodruff, Samson Zhou
ITCS 2022 [abstract] [video] [pdf]

The Boolean Hidden Matching (BHM) problem, introduced in a seminal paper of Gavinsky et. al. [STOC'07], has played an important role in the streaming lower bounds for graph problems such as triangle and subgraph counting, maximum matching, MAX-CUT, Schatten $p$-norm approximation, maximum acyclic subgraph, testing bipartiteness, $k$-connectivity, and cycle-freeness. The one-way communication complexity of the Boolean Hidden Matching problem on a universe of size $n$ is $\Theta(\sqrt{n})$, resulting in $\Omega(\sqrt{n})$ lower bounds for constant factor approximations to several of the aforementioned graph problems. The related (and, in fact, more general) Boolean Hidden Hypermatching (BHH) problem introduced by Verbin and Yu [SODA'11] provides an approach to proving higher lower bounds of $\Omega(n^{1-1/t})$ for integer $t\geq 2$. Reductions based on Boolean Hidden Hypermatching generate distributions on graphs with connected components of diameter about $t$, and basically show that long range exploration is hard in the streaming model of computation with adversarial arrivals. In this paper we introduce a natural variant of the BHM problem, called noisy BHM (and its natural noisy BHH variant), that we use to obtain higher than $\Omega(\sqrt{n})$ lower bounds for approximating several of the aforementioned problems in graph streams when the input graphs consist only of components of diameter bounded by a fixed constant. We also use the noisy BHM problem to show that the problem of classifying whether an underlying graph is isomorphic to a complete binary tree in insertion-only streams requires $\Omega(n)$ space, which seems challenging to show using BHM or BHH alone.

Truly Perfect Samplers for Data Streams and Sliding Windows
Rajesh Jayaram, David P. Woodruff, Samson Zhou
PODS 2022 [abstract] [pdf]

In the $G$-sampling problem, the goal is to output an index $i$ of a vector $f \in\mathbb{R}^n$, such that for all coordinates $j \in [n]$, \[\textbf{Pr}[i=j] = (1 \pm \eps) \frac{G(f_j)}{\sum_{k\in[n]} G(f_k)} + \gamma,\] where $G:\mathbb{R} \to \mathbb{R}_{\geq 0}$ is some non-negative function. If $\eps = 0$ and $\gamma = 1/\poly(n)$, the sampler is called \textit{perfect}. In the data stream model, $f$ is defined implicitly by a sequence of updates to its coordinates, and the goal is to design such a sampler in small space. Jayaram and Woodruff (FOCS 2018) gave the first perfect $L_p$ samplers in turnstile streams, where $G(x)=|x|^p$, using $\text{polylog}(n)$ space for $p\in(0,2]$. However, to date all known sampling algorithms are not \textit{truly perfect}, since their output distribution is only point-wise $\gamma = 1/\poly(n)$ close to the true distribution. This small error can be significant when samplers are run many times on successive portions of a stream, and leak potentially sensitive information about the data stream. In this work, we initiate the study of \textit{truly perfect} samplers, with $\eps = \gamma = 0$, and comprehensively investigate their complexity in the data stream and sliding window models. We begin by showing that sublinear space truly perfect sampling is impossible in the turnstile model, by proving a lower bound of $\Omega\left(\min\left\{n,\log \frac{1}{\gamma}\right\}\right)$ for any $G$-sampler with point-wise error $\gamma$ from the true distribution. We then give a general time-efficient sublinear-space framework for developing truly perfect samplers in the insertion-only streaming and sliding window models. As specific applications, our framework addresses $L_p$ sampling for all $p>0$, e.g., $\tO{n^{1-1/p}}$ space for $p\ge 1$, concave functions, and a large number of measure functions, including the $L_1-L_2$, Fair, Huber, and Tukey estimators. The update time of our truly perfect $L_p$-samplers is $\O{1}$, which is an exponential improvement over the running time of previous perfect $L_p$-samplers.

Adversarial Robustness of Streaming Algorithms through Importance Sampling
Vladimir Braverman, Avinatan Hassidim, Yossi Matias, Mariano Schain, Sandeep Silwal, Samson Zhou
NeurIPS 2021 [abstract] [poster] [pdf]
AdvML @ ICML 2021 (selected for oral presentation)
Silver Best Paper Award at AdvML @ ICML 2021

Robustness against adversarial attacks has recently been at the forefront of algorithmic design for machine learning tasks. In the adversarial streaming model, an adversary gives an algorithm a sequence of adaptively chosen updates $u_1,\ldots,u_n$ as a data stream. The goal of the algorithm is to compute or approximate some predetermined function for every prefix of the adversarial stream, but the adversary may generate future updates based on previous outputs of the algorithm. In particular, the adversary may gradually learn the random bits internally used by an algorithm to manipulate dependencies in the input. This is especially problematic as many important problems in the streaming model require randomized algorithms, as they are known to not admit any deterministic algorithms that use sublinear space. In this paper, we introduce adversarially robust streaming algorithms for central machine learning and algorithmic tasks, such as regression and clustering, as well as their more general counterparts, subspace embedding, low-rank approximation, and coreset construction. For regression and other numerical linear algebra related tasks, we consider the row arrival streaming model. Our results are based on a simple, but powerful, observation that sampling based algorithms give rise to adversarial robustness which is in contrast to sketching based algorithms, which are very prevalent in the streaming literature but suffer from adversarial attacks. In addition, we show that the well-known merge and reduce paradigm in streaming is adversarially robust. Since the merge and reduce paradigm defines coreset constructions, we thus obtain robust algorithms for $k$-means, $k$-median, $k$-center, Bregman clustering, projective clustering, principal component analysis (PCA) and non-negative matrix factorization. To the best of our knowledge, these are the first adversarially robust methods for these problems. Finally, we empirically confirm the robustness of our algorithms on various adversarial attacks and demonstrate that by contrast, common existing algorithms are not robust.

Dimensionality Reduction for Wasserstein Barycenter
Zachary Izzo, Sandeep Silwal, Samson Zhou
NeurIPS 2021 [abstract] [pdf]

The Wasserstein barycenter is a geometric construct which captures the notion of centrality among probability distributions, and which has found many applications in machine learning. However, most algorithms for finding even an approximate barycenter suffer an exponential dependence on the dimension $d$ of the underlying space of the distributions. In order to cope with this ``curse of dimensionality,'' we study dimensionality reduction techniques for the Wasserstein barycenter problem. When the barycenter is restricted to support of size $n$, we show that randomized dimensionality reduction can be used to map the problem to a space of dimension $O(\log n)$ independent of both $d$ and $k$, and that \emph{any} solution found in the reduced dimension will have its cost preserved up to arbitrary small error in the original space. We provide matching upper and lower bounds on the size of the reduced dimension, showing that our methods are optimal up to constant factors. We also provide a coreset construction for the Wasserstein barycenter problem that significantly decreases the number of input distributions. The coresets can be used in conjunction with random projections and thus further improve computation time. Lastly, our experimental results validate the speedup provided by dimensionality reduction while maintaining solution quality.

Efficient Coreset Constructions via Sensitivity Sampling
Vladimir Braverman, Dan Feldman, Harry Lang, Adiel Statman, Samson Zhou
ACML 2021 [abstract] [pdf]

A coreset for a set of points is a small subset of weighted points that approximately preserves important properties of the original set. Specifically, if $P$ is a set of points, $Q$ is a set of queries, and $f:P\times Q\to\mathbb{R}$ is a cost function, then a set $S\subseteq P$ with weights $w:P\to[0,\infty)$ is an $\epsilon$-coreset for some parameter $\epsilon>0$ if $\sum_{s\in S}w(s)f(s,q)$ is a $(1+\epsilon)$ multiplicative approximation to $\sum_{p\in P}f(p,q)$ for all $q\in Q$. Coresets are used to solve fundamental problems in machine learning under various big data models of computation. Many of the suggested coresets in the recent decade used, or could have used a general framework for constructing coresets whose size depends quadratically on what is known as total sensitivity $t$. In this paper we improve this bound from $O(t^2)$ to $O(t\log t)$. Thus our results imply more space efficient solutions to a number of problems, including projective clustering, $k$-line clustering, and subspace approximation. The main technical result is a generic reduction to the sample complexity of learning a class of functions with bounded VC dimension. We show that obtaining an $(\nu,\alpha)$-sample for this class of functions with appropriate parameters $\nu$ and $\alpha$ suffices to achieve space efficient $\epsilon$-coresets. Our result implies more efficient coreset constructions for a number of interesting problems in machine learning; we show applications to $k$-median/$k$-means, $k$-line clustering, $j$-subspace approximation, and the integer $(j,k)$-projective clustering problem.

Tight Bounds for Adversarially Robust Streams and Sliding Windows via Difference Estimators
David P. Woodruff, Samson Zhou
FOCS 2021 [abstract] [slides] [video] [pdf]

We introduce \emph{difference estimators} for data stream computation, which provide approximations to $F(v)-F(u)$ for frequency vectors $v\succeq u$ and a given function $F$. We show how to use such estimators to carefully trade error for memory in an iterative manner. The function $F$ is generally non-linear, and we give the first difference estimators for the frequency moments $F_p$ for $p\in[0,2]$, as well as for integers $p>2$. Using these, we resolve a number of central open questions in adversarial robust streaming and sliding window models: \begin{enumerate} \item For adversarially robust streams, we obtain a $(1+\epsilon)$-approximation to $F_p$ using $\tilde{\mathcal{O}}\left(\frac{\log n}{\epsilon^2}\right)$ bits of space for $p\in[0,2]$, and using $\tilde{\mathcal{O}}\left(\frac{1}{\epsilon^2}n^{1-2/p}\right)$ bits of space for integers $p>2$. We also obtain an adversarially robust algorithm for the $L_2$-heavy hitters problem using $\mathcal{O}\left(\frac{\log n}{\epsilon^2}\right)$ bits of space. Our bounds are optimal up to $\poly(\log\log n + \log(1/\epsilon))$ factors, and improve the $\frac{1}{\epsilon^3}$ dependence of Ben-Eliezer \emph{et al.} (PODS 2020, best paper award) and the $\frac{1}{\epsilon^{2.5}}$ dependence of Hassidim \emph{et al.} (NeurIPS 2020, oral presentation). \item For sliding windows, we obtain a $(1+\epsilon)$-approximation to $F_p$ using $\tilde{\mathcal{O}}\left(\frac{\log^2 n}{\epsilon^2}\right)$ bits of space for $p\in(0,2]$, resolving a longstanding question of Braverman and Ostrovsky (FOCS 2007). For example, for $p = 2$ we improve the dependence on $\epsilon$ from $\frac{1}{\epsilon^4}$ to an optimal $\frac{1}{\epsilon^2}$. \end{enumerate} For both models, our dependence on $\epsilon$ shows, up to $\log\frac{1}{\epsilon}$ factors, that there is no overhead over the standard insertion-only data stream model for any of these problems.

Symmetric Norm Estimation and Regression on Sliding Windows
Vladimir Braverman, Viska Wei, Samson Zhou
COCOON 2021 [abstract] [pdf]

The sliding window model generalizes the standard streaming model and often performs better in applications where recent data is more important or more accurate than data that arrived prior to a certain time. We study the problem of approximating symmetric norms (a norm on $\mathbb{R}^n$ that is invariant under sign-flips and coordinate-wise permutations) in the sliding window model, where only the $W$ most recent updates define the underlying frequency vector. Whereas standard norm estimation algorithms for sliding windows rely on the smooth histogram framework of Braverman and Ostrovsky (FOCS 2007), analyzing the \emph{smoothness} of general symmetric norms seems to be a challenging obstacle. Instead, we observe that the symmetric norm streaming algorithm of Braverman \emph{et al.} (STOC 2017) can be reduced to identifying and approximating the frequency of heavy-hitters in a number of substreams. We introduce a heavy-hitter algorithm that gives a $(1+\epsilon)$-approximation to each of the reported frequencies in the sliding window model, thus obtaining the first algorithm for general symmetric norm estimation in the sliding window model. Our algorithm is a universal sketch that simultaneously approximates all symmetric norms in a parametrizable class and also improves upon the smooth histogram framework for estimating $L_p$ norms, for a range of large $p$. Finally, we consider the problem of overconstrained linear regression problem in the case that loss function that is an Orlicz norm, a symmetric norm that can be interpreted as a scale-invariant version of $M$-estimators. We give the first algorithms that produce $(1+\eps)$-approximate solutions to the linear regression problem for loss functions that are Orlicz norms in both the streaming and sliding window models.

On the Security of Proofs of Sequential Work in a Post-Quantum World
Jeremiah Blocki, Seunghoon Lee, Samson Zhou
ITC 2021 [abstract] [Seunghoon's slides] [Seunghoon's video] [pdf]

A proof of sequential work allows a prover to convince a resource bounded verifier that the prover invested a substantial amount of sequential time to perform some underlying computation. Proofs of sequential work have many applications including time-stamping, blockchain design, and universally verifiable CPU benchmarks. Mahmoody, Moran and Vadhan (ITCS 2013) gave the first construction of proofs of sequential work in the random oracle model though the construction relied on expensive depth-robust graphs. In a recent breakthrough, Cohen and Pietrzak (EUROCRYPT 2018) gave a more efficient construction that does not require depth-robust graphs. In each of these constructions, the prover commits to a labeling of a directed acyclic graph $G$ with $N$ nodes and the verifier audits the prover by checking that a small subset of labels are locally consistent, e.g., $L_v = H(L_{v_1},\ldots,L_{v_\delta})$, where $v_1,\ldots,v_\delta$ denote the parents of node $v$. Provided that the graph $G$ has certain structural properties (e.g., depth-robustness), the prover must produce a long $\H$-sequence to pass the audit with non-negligible probability. An $\H$-sequence $x_0,x_1\ldots x_T$ has the property that $H(x_i)$ is a substring of $x_{i+1}$ for each $i$, i.e., we can find strings $a_i,b_i$ such that $x_{i+1} = a_i \cdot H(x_i) \cdot b_i$. In the parallel random oracle model, it is straightforward to argue that any attacker running in sequential time $T-1$ will fail to produce an $\H$-sequence of length $T$ except with negligible probability -- even if the attacker submits large batches of random oracle queries in each round. In this paper, we introduce the parallel quantum random oracle model and prove that any quantum attacker running in sequential time $T-1$ will fail to produce an $\H$-sequence except with negligible probability -- even if the attacker submits a large batch of quantum queries in each round. The proof is substantially more challenging and highlights the power of Zhandry's recent compressed oracle technique (CRYPTO 2019).

Separations for Estimating Large Frequency Moments on Data Streams
David P. Woodruff, Samson Zhou
ICALP 2021 [abstract] [slides] [video] [pdf]

We study the classical problem of moment estimation of an underlying vector whose $n$ coordinates are implicitly defined through a series of updates in a data stream. We show that if the updates to the vector arrive in the random-order insertion-only model, then there exist space efficient algorithms with improved dependencies on the approximation parameter $\varepsilon$. In particular, for any real $p > 2$, we first obtain an algorithm for $F_p$ moment estimation using $\tilde{\mathcal{O}}\left(\frac{1}{\varepsilon^{4/p}}\cdot n^{1-2/p}\right)$ bits of memory. Our techniques also give algorithms for $F_p$ moment estimation with $p>2$ on arbitrary order insertion-only and turnstile streams, using $\tilde{\mathcal{O}}\left(\frac{1}{\varepsilon^{4/p}}\cdot n^{1-2/p}\right)$ bits of space and two passes, which is the first optimal multi-pass $F_p$ estimation algorithm up to $\log n$ factors. Finally, we give an improved lower bound of $\Omega\left(\frac{1}{\varepsilon^2}\cdot n^{1-2/p}\right)$ for one-pass insertion-only streams, thus separating the complexity of this problem both between random and non-random orders, as well as one-pass and multi-pass streams.

Learning a Latent Simplex in Input Sparsity Time
Ainesh Bakshi, Chiranjib Bhattacharyya, Ravi Kannan, David P. Woodruff, Samson Zhou
ICLR 2021 (selected for spotlight presentation) [abstract] [slides] [pdf]

We consider the problem of learning a latent $k$-vertex simplex $K\in\mathbb{R}^d$, given access to $\AA\in\mathbb{R}^{d\times n}$, which can be viewed as a data matrix with $n$ points that are obtained by randomly perturbing latent points in the simplex $K$ (potentially beyond $K$). A large class of latent variable models, such as adversarial clustering, mixed membership stochastic block models, and topic models can be cast as learning a latent simplex. Bhattacharyya and Kannan (SODA 2020) give an algorithm for learning such a latent simplex in time roughly $O(k\cdot\nnz(\AA))$, where $\nnz(\AA)$ is the number of non-zeros in $\AA$. We show that the dependence on $k$ in the running time is unnecessary given a natural assumption about the mass of the top $k$ singular values of $\AA$, which holds in many of these applications. Further, we show this assumption is necessary, as otherwise an algorithm for learning a latent simplex would imply an algorithmic breakthrough for spectral low rank approximation. At a high level, Bhattacharyya and Kannan provide an adaptive algorithm that makes $k$ matrix-vector product queries to $\AA$ and each query is a function of all queries preceding it. Since each matrix-vector product requires $\nnz(\AA)$ time, their overall running time appears unavoidable. Instead, we obtain a low-rank approximation to $\AA$ in input-sparsity time and show that the column space thus obtained has small $\sin\Theta$ (angular) distance to the right top-$k$ singular space of $\AA$. Our algorithm then selects $k$ points in the low-rank subspace with the largest inner product (in absolute value) with $k$ carefully chosen random vectors. By working in the low-rank subspace, we avoid reading the entire matrix in each iteration and thus circumvent the $\Theta(k\cdot\nnz(\AA))$ running time.

Sensitivity Analysis of the Maximum Matching Problem
Yuichi Yoshida, Samson Zhou
ITCS 2021 [abstract] [slides] [video] [pdf]

We consider the \emph{sensitivity} of algorithms for the maximum matching problem against edge and vertex modifications. When an algorithm $A$ for the maximum matching problem is deterministic, the sensitivity of $A$ on $G$ is defined as $\max_{e \in E(G)}|A(G) \triangle A(G - e)|$, where $G-e$ is the graph obtained from $G$ by removing an edge $e \in E(G)$ and $\triangle$ denotes the symmetric difference. When $A$ is randomized, the sensitivity is defined as $\max_{e \in E(G)}d_{\mathrm{EM}}(A(G),A(G-e))$, where $d_{\mathrm{EM}}(\cdot,\cdot)$ denotes the earth mover's distance between two distributions. Thus the sensitivity measures the difference between the output of an algorithm after the input is slightly perturbed. Algorithms with low sensitivity, or \emph{stable} algorithms are desirable because they are robust to edge failure or attack. In this work, we show a randomized $(1-\epsilon)$-approximation algorithm with \emph{worst-case} sensitivity $O_{\epsilon}(1)$, which substantially improves upon the $(1-\epsilon)$-approximation algorithm of Varma and Yoshida (arXiv 2020) that obtains \emph{average} sensitivity $n^{O(1/(1+\epsilon^2))}$ sensitivity algorithm, and show a deterministic $1/2$-approximation algorithm with sensitivity $\exp(O(\log^*n))$ for bounded-degree graphs. We then show that any deterministic constant-factor approximation algorithm must have sensitivity $\Omega(\log^* n)$. Our results imply that randomized algorithms are strictly more powerful than deterministic ones in that the former can achieve sensitivity independent of $n$ whereas the latter cannot. We also show analogous results for vertex sensitivity, where we remove a vertex instead of an edge. As an application of our results, we give an algorithm for the online maximum matching with $O_{\epsilon}(n)$ total replacements in the vertex-arrival model. By comparison, Bernstein~et~al.~(J. ACM 2019) gave an online algorithm that always outputs the maximum matching, but only for bipartite graphs and with $O(n\log n)$ total replacements. Finally, we introduce the notion of normalized weighted sensitivity, a natural generalization of sensitivity that accounts for the weights of deleted edges. For a graph with weight function $w$, the normalized weighted sensitivity is defined to be the sum of the weighted edges in the symmetric difference of the algorithm normalized by the altered edge, i.e., $\max_{e \in E(G)}\frac{1}{w(e)}w\left(A(G) \triangle A(G - e)\right)$. Hence the normalized weighted sensitivity measures the weighted difference between the output of an algorithm after the input is slightly perturbed, normalized by the weight of the perturbation. We show that if all edges in a graph have polynomially bounded weight, then given a trade-off parameter $\alpha>2$, there exists an algorithm that outputs a $\frac{1}{4\alpha}$-approximation to the maximum weighted matching in $O(m\log_{\alpha} n)$ time, with normalized weighted sensitivity $O(1)$.

Near Optimal Linear Algebra in the Online and Sliding Window Models
Vladimir Braverman, Petros Drineas, Cameron Musco, Christopher Musco, Jalaj Upadhyay, David P. Woodruff, Samson Zhou
FOCS 2020 [abstract] [slides] [video] [pdf]

We initiate the study of numerical linear algebra in the sliding window model, where only the most recent $W$ updates in a stream form the underlying data set. Although many existing algorithms in the sliding window model use or borrow elements from the smooth histogram framework (Braverman and Ostrovsky, FOCS 2007), we show that many interesting linear-algebraic problems, including spectral and vector induced matrix norms, generalized regression, and low-rank approximation, are not amenable to this approach in the row-arrival model. To overcome this challenge, we first introduce a unified row-sampling based framework that gives \emph{randomized} algorithms for spectral approximation, low-rank approximation/projection-cost preservation, and $\ell_1$-subspace embeddings in the sliding window model, which often use nearly optimal space and achieve nearly input sparsity runtime. Our algorithms are based on ``reverse online'' versions of offline sampling distributions such as (ridge) leverage scores, $\ell_1$ sensitivities, and Lewis weights to quantify both the importance and the recency of a row; our structural results on these distributions may be of independent interest for future algorithmic design. Although our techniques initially address numerical linear algebra in the sliding window model, our row-sampling framework rather surprisingly implies connections to the well-studied online model; our structural results also give the first sample optimal (up to lower order terms) online algorithm for low-rank approximation/projection-cost preservation. Using this powerful primitive, we give online algorithms for column/row subset selection and principal component analysis that resolves the main open question of Bhaskara~\etal\,(FOCS 2019). We also give the first online algorithm for $\ell_1$-subspace embeddings. We further formalize the connection between the online model and the sliding window model by introducing an \emph{additional} unified framework for \emph{deterministic} algorithms using a merge and reduce paradigm and the concept of online coresets, which we define as a weighted subset of rows of the input matrix that can be used to compute a good approximation to some given function on all of its prefixes. Our sampling based algorithms in the row-arrival online model yield online coresets, giving deterministic algorithms for spectral approximation, low-rank approximation/projection-cost preservation, and $\ell_1$-subspace embeddings in the sliding window model that use nearly optimal space.

On Locally Decodable Codes in Resource Bounded Channels
Jeremiah Blocki, Shubhang Kulkarni, Samson Zhou
ITC 2020 [abstract] [Shubhang's video@ITC] [pdf]

Constructions of locally decodable codes (\LDCs) have one of two undesirable properties: low rate or high locality (polynomial in the length of the message). In settings where the encoder/decoder have already exchanged cryptographic keys and the channel is a probabilistic polynomial time (\PPT) algorithm, it is possible to circumvent these barriers and design \LDCs\ with constant rate and small locality. However, the assumption that the encoder/decoder have exchanged cryptographic keys is often prohibitive. We thus consider the problem of designing explicit and efficient \LDCs\ in settings where the channel is {\em slightly} more constrained than the encoder/decoder with respect to some resource e.g., space or (sequential) time. Given an explicit function $f$ that the channel cannot compute, we show how the encoder can transmit a random secret key to the local decoder using $f(\cdot)$ and a random oracle $\oracleH$. This allows bootstrap from the private key \LDC\ construction of Ostrovsky, Pandey and Sahai (ICALP, 2007), thereby answering an open question posed by Guruswami and Smith (FOCS 2010) of whether such bootstrapping techniques may apply to \LDCs\ in weaker channel models than just \PPT\ algorithms. Specifically, in the random oracle model we show how to construct explicit constant rate \LDCs\ with optimal locality of $\polylog$ in the security parameter against various resource constrained channels.

Non-Adaptive Adaptive Sampling on Turnstile Streams
Sepideh Mahabadi, Ilya Razenshteyn, David P. Woodruff, Samson Zhou
STOC 2020 [abstract] [Sepideh's slides] [Sepideh's video@TCS+] [pdf]

Adaptive sampling is a useful algorithmic tool for data summarization problems in the classical centralized setting, where the entire dataset is available to the single processor performing the computation. Adaptive sampling repeatedly selects rows of an underlying matrix $\A\in\mathbb{R}^{n\times d}$, where $n\gg d$, with probabilities proportional to their distances to the subspace of the previously selected rows. Intuitively, adaptive sampling seems to be limited to trivial multi-pass algorithms in the streaming model of computation due to its inherently sequential nature of assigning sampling probabilities to each row only after the previous iteration is completed. Surprisingly, we show this is not the case by giving the first one-pass algorithms for adaptive sampling on turnstile streams and using space $\poly(d,k,\log n)$, where $k$ is the number of adaptive sampling rounds to be performed. Our adaptive sampling procedure has a number of applications to various data summarization problems that either improve state-of-the-art or have only been previously studied in the more relaxed row-arrival model. We give the first relative-error algorithm for column subset selection on turnstile streams. We show our adaptive sampling algorithm also gives the first relative-error algorithm for subspace approximation on turnstile streams that returns $k$ noisy rows of $\A$. The quality of the output can be improved to a $(1+\eps)$-approximation at the tradeoff of a bicriteria algorithm that outputs a larger number of rows. We then give the first algorithm for projective clustering on turnstile streams that uses space sublinear in $n$. In fact, we use space $\poly\left(d,k,s,\frac{1}{\eps},\log n\right)$ to output a $(1+\eps)$-approximation, where $s$ is the number of $k$-dimensional subspaces. Our adaptive sampling primitive also provides the first algorithm for volume maximization on turnstile streams. We complement our volume maximization algorithmic results with lower bounds that are tight up to lower order terms, even for multi-pass algorithms. By a similar construction, we also obtain lower bounds for volume maximization in the row-arrival model, which we match with competitive upper bounds.

"Bring Your Own Greedy"+Max: Near-Optimal 1/2-Approximations for Submodular Knapsack
Grigory Yaroslavtsev, Samson Zhou, Dmitrii Avdiukhin
AISTATS 2020 [abstract] [pdf]

The problem of selecting a small-size representative summary of a large dataset is a cornerstone of machine learning, optimization and data science. Motivated by applications to recommendation systems and other scenarios with query-limited access to vast amounts of data, we propose a new rigorous algorithmic framework for a standard formulation of this problem as a submodular maximization subject to a linear (knapsack) constraint. Our framework is based on augmenting all partial Greedy solutions with the best additional item. It can be instantiated with negligible overhead in any model of computation, which allows the classic Greedy algorithm and its variants to be implemented. We give such instantiations in the offline (Greedy+Max), multi-pass streaming (Sieve+Max) and distributed (Distributed+Max) settings. Our algorithms give (1/2-epsilon)-approximation with most other key parameters of interest being near-optimal. Our analysis is based on a new set of first-order linear differential inequalities and their robust approximate versions. Experiments on typical datasets (movie recommendations, influence maximization) confirm scalability and high quality of solutions obtained via our framework. Instance-specific approximations are typically in the 0.6-0.7 range and frequently beat even the (1-1/epsilon) \approx 0.63 worst-case barrier for polynomial-time algorithms.

Data-Independent Neural Pruning via Coresets
Ben Mussay, Margarita Osadchy, Vladimir Braverman, Samson Zhou, Dan Feldman
IEEE Transactions on Neural Networks and Learning Systems 2021 [journal]
ICLR 2020 [abstract] [conference]

Previous work showed empirically that large neural networks can be significantly reduced in size while preserving their accuracy. Model compression became a central research topic, as it is crucial for deployment of neural networks on devices with limited computational and memory resources. The majority of the compression methods are based on heuristics and offer no worst-case guarantees on the trade-off between the compression rate and the approximation error for an arbitrarily new sample. We propose the first efficient, data-independent neural pruning algorithm with a provable trade-off between its compression rate and the approximation error for any future test sample. Our method is based on the coreset framework, which finds a small weighted subset of points that provably approximates the original inputs. Specifically, we approximate the output of a layer of neurons by a coreset of neurons in the previous layer and discard the rest. We apply this framework in a layer-by-layer fashion from the top to the bottom. Unlike previous works, our coreset is data independent, meaning that it provably guarantees the accuracy of the function for any input $x\in \mathbb{R}^d$, including an adversarial one. We demonstrate the effectiveness of our method on popular network architectures. In particular, our coresets yield 90\% compression of the LeNet-300-100 architecture on MNIST while improving classification accuracy.

Approximating Cumulative Pebbling Cost is Unique Games Hard
Jeremiah Blocki, Seunghoon Lee, Samson Zhou
ITCS 2020 [abstract] [Seunghoon's poster] [Seunghoon's slides] [website] [pdf]

The cumulative pebbling complexity of a directed acyclic graph G is defined as cc(G) = \min_P \sum_i |P_i|, where the minimum is taken over all legal (parallel) black pebblings of G and |P_i| denotes the number of pebbles on the graph during round i. Intuitively, cc(G) captures the amortized Space-Time complexity of pebbling m copies of G in parallel. The cumulative pebbling complexity of a graph G is of particular interest in the field of cryptography as cc(G) is tightly related to the amortized Area-Time complexity of the data-independent memory hard function (iMHF) f_{G,H} (Alwen and Serbinenko, STOC 2015) defined using a constant indegree directed acyclic graph (DAG) G and a random oracle H. A secure iMHF should have amortized Space-Time complexity as high as possible e.g., to deter brute-force password attacker who wants to find x such that f_{G,H}(x) = h. Thus, to analyze the (in)security of a candidate iMHF f_{G,H}, it is crucial to estimate the value cc(G) but currently, upper and lower bounds for leading iMHF candidates differ by several orders of magnitude. Blocki and Zhou recently showed that is NP-Hard to compute cc(G), but their techniques do not even rule out an efficient (1+epsilon)-approximation algorithm for any constant epsilon>0. We show that for any constant c > 0, it is Unique Games hard to approximate cc(G) to within a factor of c. Along the way, we show the hardness of approximation of the DAG Vertex Deletion problem on DAGs of constant indegree. Namely, we show that for any k,epsilon>0 given a DAG G with N nodes and constant indegree, the it is Unique Games hard to distinguish between the case that (e_1, d_1)-reducible with e_1=N^{1/(1+2epsilon)}/k and d_1=k N^{2epsilon/(1+2epsilon)} and the case that G is (e_2, d_2)-depth-robust with e_2 = (1-epsilon)k e_1 and d_2= 0.9 N^{(1+epsilon)/(1+2epsilon)}, which may be of independent interest. Our result generalizes a result of Svensson who proved an analogous result for DAGs with indegree O(N).

Computationally Data-Independent Memory Hard Functions
Mohammad Hassan Ameri, Jeremiah Blocki, Samson Zhou
ITCS 2020 [abstract] [slides] [video] [pdf]

Memory hard functions (MHFs) are an important cryptographic primitive that are used to design egalitarian proofs of work and in the construction of moderately expensive key-derivation functions resistant to brute-force attacks. Broadly speaking, MHFs can be divided into two categories: data-dependent memory hard functions (dMHFs) and data-independent memory hard functions (iMHFs). iMHFs are resistant to certain side-channel attacks as the memory access pattern induced by the honest evaluation algorithm is independent of the potentially sensitive input e.g., password. While dMHFs are potentially vulnerable to side-channel attacks (the induced memory access pattern might leak useful information to a brute-force attacker), they can achieve higher cumulative memory complexity (CMC) in comparison than an iMHF. In particular, any iMHF that can be evaluated in $N$ steps on a sequential machine has CMC {\em at most} $\O{\frac{N^2\log\log N}{\log N}}$. By contrast, the dMHF scrypt achieves maximal CMC $\Omega(N^2)$ --- though the CMC of scrypt would be reduced to just $\O{N}$ after a side-channel attack. In this paper, we introduce the notion of computationally data-independent memory hard functions (ciMHFs). Intuitively, we require that memory access pattern induced by the (randomized) ciMHF evaluation algorithm appears to be independent from the standpoint of a computationally bounded eavesdropping attacker --- even if the attacker selects the initial input. We then ask whether it is possible to circumvent known upper bound for iMHFs and build a ciMHF with CMC $\Omega(N^2)$. Surprisingly, we answer the question in the affirmative when the ciMHF evaluation algorithm is executed on a two-tiered memory architecture (RAM/Cache). We introduce the notion of a $k$-restricted dynamic graph to quantify the continuum between unrestricted dMHFs $(k=n)$ and iMHFs ($k=1$). For any $\eps > 0$ we show how to construct a $k$-restricted dynamic graph with $k=\Omega(N^{1-\eps})$ that provably achieves maximum cumulative pebbling cost $\Omega(N^2)$. We can use $k$-restricted dynamic graphs to build a ciMHF provided that cache is large enough to hold $k$ hash outputs and the dynamic graph satisfies a certain property that we call ``amenable to shuffling.'' In particular, we prove that the induced memory access pattern is indistinguishable to a polynomial time attacker who can monitor the locations of read/write requests to RAM, but not cache. We also show that when $k=o\left(N^{1/\log\log N}\right)$, then any $k$-restricted graph with constant indegree has cumulative pebbling cost $o(N^2)$. Our results almost completely characterize the spectrum of $k$-restricted dynamic graphs.

Memory-Efficient Performance Monitoring on Programmable Switches with Lean Algorithms
Zaoxing Liu, Samson Zhou, Ori Rottenstreich, Vladimir Braverman, Jennifer Rexford
APoCS 2020 [abstract] [pdf]

Network performance problems are notoriously difficult to diagnose. Prior profiling systems collect performance statistics by keeping information about each network flow, but maintaining per-flow state is not scalable on resource-constrained NIC and switch hardware. Instead, we propose sketch-based performance monitoring using memory that is sublinear in the number of flows. Existing sketches estimate metrics based on flow sizes. In contrast, performance monitoring typically requires combining information across pairs of packets, such as matching a data packet with its acknowledgment to compute a round-trip time. We define a new class of \emph{lean} algorithms that use memory sublinear in both the size of input data and the number of flows. We then introduce lean algorithms for a set of important statistics, such as identifying flows with high latency, loss, out-of-order, or retransmitted packets. We implement prototypes of our lean algorithms on a commodity programmable switch using the P4 language. Our experiments show that lean algorithms detect $\sim$82\% of top 100 problematic flows among real-world packet traces using just 40KB memory.

Fast Fourier Sparsity Testing
Grigory Yaroslavtsev, Samson Zhou
SOSA 2020 [abstract] [slides] [pdf]

A function $f : \F_2^n \to \R$ is \emph{$s$-sparse} if it has at most $s$ non-zero Fourier coefficients. Motivated by applications to fast sparse Fourier transforms over $\F_2^n$, we study efficient algorithms for the problem of approximating the $\ell_2$-distance from a given function to the closest $s$-sparse function. While previous works (e.g., Gopalan \emph{et al.} SICOMP 2011) study the problem of distinguishing $s$-sparse functions from those that are far from $s$-sparse under Hamming distance, to the best of our knowledge no prior work has explicitly focused on the more general problem of distance estimation in the $\ell_2$ setting, which is particularly well-motivated for noisy Fourier spectra. Given the focus on efficiency, our main result is an algorithm that solves this problem with query complexity $\O{s}$ for constant accuracy and error parameters, which is only quadratically worse than applicable lower bounds.

Approximate F₂-Sketching of Valuation Functions
Grigory Yaroslavtsev, Samson Zhou
RANDOM 2019 [abstract] [slides] [pdf]

We study the problem of constructing a linear sketch of minimum dimension that allows approximation of a given real-valued function $f \colon \ftwo^n \rightarrow \mathbb R$ with small expected squared error. We develop a general theory of linear sketching for such functions through which we analyze their dimension for most commonly studied types of valuation functions: additive, budget-additive, coverage, $\alpha$-Lipschitz submodular and matroid rank functions. This gives a characterization of how many bits of information have to be stored about the input $x$ so that one can compute $f$ under additive updates to its coordinates. Our results are tight in most cases and we also give extensions to the distributional version of the problem where the input $x \in \ftwo^n$ is generated uniformly at random. Using known connections with dynamic streaming algorithms, both upper and lower bounds on dimension obtained in our work extend to the space complexity of algorithms evaluating $f(x)$ under long sequences of additive updates to the input $x$ presented as a stream. Similar results hold for simultaneous communication in a distributed setting.

Improved Methods on Time-Decay Streams
Vladimir Braverman, Harry Lang, Enayat Ullah, Samson Zhou
APPROX 2019 [abstract] [pdf]

In the time-decay model for data streams, elements of an underlying data set arrive sequentially with the recently arrived elements being more important. A common approach for handling large data sets is to maintain a \emph{coreset}, a succinct summary of the processed data that allows approximate recovery of a predetermined query. We provide a general framework that takes any offline-coreset and gives a time-decay coreset for polynomial time decay functions. We also consider the exponential time decay model for $k$-median clustering, where we provide a constant factor approximation algorithm that utilizes the online facility location algorithm. Our algorithm stores $O(k\log(h\Delta)+h)$ points where $h$ is the half-life of the decay function and $\Delta$ is the aspect ratio of the dataset. Our techniques extend to $k$-means clustering and $M$-estimators as well.

Data-Independent Memory Hard Functions: New Attacks and Stronger Constructions
Jeremiah Blocki, Ben Harsha, Siteng Kang, Seunghoon Lee, Lu Xing, Samson Zhou
CRYPTO 2019 [abstract] [Jeremiah's slides] [pdf]

Data-Independent Memory-hard functions (iMHFs) are a key cryptographic primitive underlying the design of moderately expensive password hashing algorithms and egalitarian proofs of work that are resistant to side-channel attacks. Several goals for MHFs have been proposed including bandwidth hardness, space-time (ST) complexity, amortized area-time (aAT) complexity and sustained space complexity. An iMHF can be specified using a directed acyclic graph (DAG) $G$ with $N=2^n$ nodes and low indegree, and the cost (aAT, ST etc...) to evaluate the iMHF can be analyzed using pebbling games. In particular, given a parameter $N$ (e.g., maximum acceptable running time) we would like to design the DAG $G$ to have maximum possible pebbling cost i.e., to ensure that the iMHF is as expensive as possible for an attacker to compute. Recently, Alwen et al.~\cite{CCS:AlwBloHar17} gave a randomized DAG construction called DRSample and proved that the aAT cost to pebble the graph was $\Omega\left( N^2/\log N\right)$. In an asymptotic sense the DRSample outperformed all prior constructions including Argon2i, the winner of the password hashing competition, which can be pebbled with aAT cost at most $\bigO\left(N^{1.767}\right)$. In this work we first prove a matching {\em upper bound} on the pebbling cost of DRSample by analyzing the greedy pebbling attack of Boneh et al.~\cite{AC:BonCorSch16}. This sequential attack on DRSample is simple, easy to implement and has good concrete performance. In fact, our results show that, for practical values of $N\leq 2^{24}$, Argon2i provides {\em stronger} resistance to known pebbling attacks than DRSample reversing a finding of Alwen et al.~\cite{CCS:AlwBloHar17}. We then develop a new iMHF candidate by extending DRSample with the bit-reversal graph, and show that the iMHF resists {\em all known attacks} in practice and has {\em optimal} asymptotic performance under every MHF metric. In particular, we prove that (1) {\em any} (nearly) sequential pebbling attack (including the greedy pebbling attack) has aAT cost $\Omega\left( N^2\right)$, (2) {\em any} parallel attacker has aAT cost at least $\Omega\left(N^2/\log N\right)$ and {\em at least} $\Omega\left(N^2 \log \log N/\log N\right)$ unless one can find new depth-reducing attacks against DRSample which significantly improve upon the state of the art, (3) the graph has high bandwidth-complexity, and (4) any pebbling {\em either} has aAT cost $\omega(N^2)$ or {\em requires} at least $\Omega(N)$ steps with $\Omega(N/\log N)$ pebbles on the DAG. This makes our construction the first practical iMHF with strong guarantees on the sustained space-complexity. We also observe that the Argon2i round function can (trivially) be evaluated in parallel, which would allow an attacker to reduce aAT costs by (nearly) an order of magnitude, and we develop an {\em inherently} sequential version of the Argon2i round function that prevents this attack. We implement our new iMHF candidate (with and without the sequential round function) and show that evaluation speed is nearly identical to Argon2i. Finally, we provide a pebbling reduction which proves that in the parallel random oracle model (PROM) the cost of evaluating an iMHF like Argon2i or DRSample+BRG is given by the pebbling cost of the underlying DAG.

Adversarially Robust Submodular Maximization under Knapsack Constraints
Dmitrii Avdiukhin, Slobodan Mitrović, Grigory Yaroslavtsev, Samson Zhou
KDD 2019 (selected for oral presentation) [abstract] [poster] [slides] [video] [pdf]

We propose the first adversarially robust algorithm for monotone submodular maximization under single and multiple knapsack constraints with scalable implementations in distributed and streaming settings. For a single knapsack constraint, our algorithm outputs a robust summary of almost optimal (up to polylogarithmic factors) size, from which a constant-factor approximation to the optimal solution can be constructed. For multiple knapsack constraints, our approximation is within a constant-factor of the best known non-robust solution. We evaluate the performance of our algorithms by comparison to natural robustifications of existing non-robust algorithms under two objectives: 1) dominating set for large social network graphs from Facebook and Twitter collected by the Stanford Network Analysis Project (SNAP), 2) movie recommendations on a dataset from MovieLens. Experimental results show that our algorithms give the best objective for a majority of the inputs and show strong performance even compared to offline algorithms that are given the set of removals in advance.

Structural Results on Matching Estimation with Applications to Streaming
Marc Bury, Elena Grigorescu, Andrew McGregor, Morteza Monemizadeh, Chris Schwiegelshohn, Sofya Vorotnikova, Samson Zhou
Algorithmica 2019 [abstract] [pdf]

We study the problem of estimating the size of a matching when the graph is revealed in a streaming fashion. Our results are multifold: \begin{enumerate} \item We give a tight structural result relating the size of a maximum matching to the {\em arboricity} of a graph, which has been one of the most studied graph parameters for matching algorithms in data streams. \item We further show that the weight of a maximum weighted matching can be efficiently estimated by augmenting any routine for estimating the size of an unweighted matching. Namely, given an algorithm for computing a $\lambda$-approximation in the unweighted case, we obtain a $2(1+\varepsilon)\cdot \lambda$ approximation for the weighted case, while only incurring a multiplicative logarithmic factor in the space bounds. The algorithm is implementable in any streaming model, including {\em dynamic} streams. \item We also investigate algebraic aspects of computing matchings in data streams, by proposing new algorithms and lower bounds based on analyzing the rank of the {\em Tutte-matrix} of the graph. In particular, we present an algorithm determining whether there exists a matching of size $k$ using $k^2\text{polylog } n $ space, where $n$ is the number of nodes in the graph. We also show a lower bound of $\Omega(n^{1-\varepsilon})$ space for small approximation factors to the rank of a matrix in {\em insertion-only} streams. \end{enumerate}

Bandwidth-Hard Functions: Reductions and Lower Bounds
Jeremiah Blocki, Ling Ren, Samson Zhou
CCS 2018 [abstract] [Jeremiah's slides] [Jeremiah's video@CCS] [pdf]

Memory Hard Functions (MHFs) have been proposed as an answer to the growing inequality between the computational speed of general purpose CPUs and Application Specific Integrated Circuits (ASICs). MHFs have seen widespread applications including password hashing, key stretching and proofs of work. Several metrics have been proposed to quantify the `memory hardness' of a function. Cumulative memory complexity (CMC) \cite{STOC:AlwSer15} (or amortized Area $\times$ Time complexity \cite{CCS:AlwBloHar17}) attempts to quantify the amortized cost to acquire/build the hardware to evaluate the function --- amortized by the number of instances of the function that can be evaluated of this hardware. By contrast, bandwidth hardness \cite{TCC:RenDev17} attempts to quantify the amortized energy costs of evaluating this function on hardware --- which in turn is largely dominated by the number of cache misses. Ideally, a good MHF would be both bandwidth hard and have high cumulative memory complexity. While the cumulative memory complexity of leading MHF candidates is well understood, little is known about the bandwidth hardness of many of the most prominent MHF candidates. Our contributions are as follows: First, we provide the first reduction proving that, in the parallel random oracle model, the bandwidth hardness of a Data-Independent Memory Hard Function (iMHF) is described by the red-blue pebbling cost of the directed acyclic graph (DAG) associated with that iMHF. Second, we show that the goals of designing an MHF with high CMC/bandwidth hardness are well aligned. In particular, we prove that {\em any} function with high CMC also has relatively high bandwidth costs. This result leads to the first {\em unconditional} lower bound on the bandwidth cost of scrypt. Third, we analyze the bandwidth hardness of several prominent iMHF candidates such as Argon2i \cite{BiryukovDK15}, winner of the password hashing competition, aATSample and DRSample \cite{CCS:AlwBloHar17} --- the first practical iMHF with asymptotically optimal CMC. More specifically, we show that Argon2i is maximally bandwidth hard as long as the cache-size $m$ is at most $m \in\O{n^{2/3-\epsilon}}$ where $n$ is the total number of data-labels produced during computation. We also show that aATSample and DRSample are maximally bandwidth hard as long as the cache-size is $m \in\O{n^{1-\epsilon}}$. Finally, we show that the problem of finding a red-blue pebbling with minimum bandwidth cost is NP-hard.

Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows
Vladimir Braverman, Elena Grigorescu, Harry Lang, David P. Woodruff, Samson Zhou
APPROX 2018 [abstract] [poster] [slides] [pdf]

We study the \emph{distinct elements} and \emph{$\ell_p$-heavy hitters} problems in the \emph{sliding window} model, where only the most recent $n$ elements in the data stream form the underlying set. We first introduce the \emph{\histogram}, a simple twist on the exponential (Datar \etal, SODA 2002) and smooth histograms (Braverman and Ostrovsky, FOCS 2007) that may be of independent interest. We then show that the \histogram{} along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and $\ell_p$-heavy hitters that are nearly optimal in both $n$ and $\eps$. Applying our new \histogram{} framework, we provide an algorithm that outputs a $(1+\eps)$-approximation to the number of distinct elements in the sliding window model and uses $\O{\frac{1}{\eps^2}\log n\log\frac{1}{\eps}\log\log n+\frac{1}{\eps}\log^2 n}$ bits of space. For $\ell_p$-heavy hitters, we provide an algorithm using space $\O{\frac{1}{\eps^p}\log^2 n\left(\log\log n+\log\frac{1}{\eps}\right)}$ for $p\in(0,2]$, improving upon the best-known algorithm for $\ell_2$-heavy hitters (Braverman \etal, COCOON 2014), which has space complexity $\O{\frac{1}{\eps^4}\log^3 n}$. We also show complementing nearly optimal lower bounds of $\Omega\left(\frac{1}{\eps}\log^2 n+\frac{1}{\eps^2}\log n\right)$ for distinct elements and $\Omega\left(\frac{1}{\eps^p}\log^2 n\right)$ for $\ell_p$-heavy hitters, both tight up to $\O{\log\log n}$ and $\O{\log\frac{1}{\eps}}$ factors.

Relaxed Locally Correctable Codes in Computationally Bounded Channels
Jeremiah Blocki, Venkata Gandikota, Elena Grigorescu, Samson Zhou
IEEE Transactions on Information Theory 2021
ISIT 2019 [abstract] [pdf]
ICALP 2018 (brief announcement)

Error-correcting codes that admit {\em local} decoding and correcting algorithms have been the focus of much recent research due to their numerous theoretical and practical applications. The goal is to obtain the best possible tradeoffs between the number of queries the algorithm can make to its oracle (the {\em locality} of the task), the amount of redundancy in the encoding (the {\em rate} of the code), and the amount of error it withstands. In the standard adverstimes channel model the current tradeoffs are dramatic, allowing either small query complexity and superpolynomial blocklength, or small blocklength but high query complexity. However, in the realistic, computationally bounded channel model, constructions of locally decodable codes (\LDCs) suddenly exhibit small locality and small blocklength, for constant error rate. The first such constructions are due to Ostrovsky, Pandey and Sahai (ICALP 2007) who built private \LDCs under the assumption that one-way functions exist, and in the setting where the sender and receiver share a private key. We study variants of locally decodable and locally correctable codes in computationally bounded but adverstimes channels, under the much weaker assumption that collision-resistant hash functions exist, and with no public-key or private-key cryptographic setup. Specifically, we provide constructions of {\em relaxed locally correctable codes} (\RLCCs) and {\em relaxed locally decodable codes} (\RLDCs) over binary alphabets, with constant rate and poly-logarithmic locality, that compare favorably with existing schemes built under much stronger cryptographic assumptions, and with classical \RLCCs in the computationally unbounded Hamming channel. Our constructions crucially employ {\em collision resistant hash functions} and {\em local expander graphs}, extending ideas from recent cryptographic constructions of memory-hard functions.

Periodicity in Data Streams with Wildcards
Funda Ergün, Elena Grigorescu, Erfan Sadeqi Azer, Samson Zhou
CSR 2018 (invited to special issue of Theory of Computing) [abstract] [pdf]

We investigate the problem of detecting periodic trends within a string $S$ of length $n$, arriving in the streaming model, containing at most $k$ wildcard characters, where $k=o(n)$. We say $S$ has wildcard-period $p$ if there exists an assignment to each of the wildcard characters so that in the resulting stream the length $n-p$ prefix equals the length $n-p$ suffix. We present a two-pass streaming algorithm that computes wildcard-periods of $S$ using $\O{k^3\,\polylog\,n}$ bits of space, while we also show that this problem cannot be solved in sublinear space in one pass. In addition, we present complementing lower bounds, while showing a new communication complexity on the sparse index problem.

On the Computational Complexity of Minimal Cumulative Cost Graph Pebbling
Jeremiah Blocki, Samson Zhou
FC 2018 [abstract] [slides] [pdf]

We consider the computational complexity of finding a legal black pebbling of a DAG $G=(V,E)$ with minimum cumulative cost. A black pebbling is a sequence $P_0,\ldots, P_t \subseteq V$ of sets of nodes which must satisfy the following properties: $P_0 = \emptyset$ (we start off with no pebbles on $G$), $\sinks(G) \subseteq \bigcup_{j \leq t} P_j$ (every sink node was pebbled at some point) and $\parents\big(P_{i+1}\backslash P_i\big) \subseteq P_i$ (we can only place a new pebble on a node $v$ if all of $v$'s parents had a pebble during the last round). The cumulative cost of a pebbling $P_0,P_1,\ldots, P_t \subseteq V$ is $\cc(P) = \left| P_1\right| + \ldots + \left| P_t\right|$. The cumulative pebbling cost is an especially important security metric for data-independent memory hard functions, an important primitive for password hashing. Thus, an efficient (approximation) algorithm would be an invaluable tool for the cryptanalysis of password hash functions as it would provide an automated tool to establish tight bounds on the amortized space-time cost of computing the function. We show that such a tool is unlikely to exist in the most general case. In particular, we prove the following results. \begin{itemize} \item It is $\NPhard$ to find a pebbling minimizing cumulative cost. \item The natural linear program relaxation for the problem has integrality gap $\tilde{O}(n)$, where $n$ is the number of nodes in $G$. We conjecture that the problem is hard to approximate. \item We show that a related problem, find the minimum size subset $S\subseteq V$ such that $\depth(G-S) \leq d$, is also $\NPhard$. In fact, under the Unique Games Conjecture there is no $(2-\epsilon)$-approximation algorithm. \end{itemize}

On the Economics of Offline Password Cracking
Jeremiah Blocki, Ben Harsha, Samson Zhou
S&P 2018 [abstract] [Jeremiah's slides] [Jeremiah's video@S&P] [pdf]

We develop an economic model of an offline password cracker which allows us to make quantitative predictions about the fraction of accounts that a rational password attacker would crack in the event of an authentication server breach. We apply our economic model to analyze recent massive password breaches at Yahoo!, Dropbox, LastPass and AshleyMadison. All four organizations were using key-stretching to protect user passwords. In fact, LastPass' use of PBKDF2-SHA256 with $10^5$ hash iterations exceeds 2017 NIST minimum recommendation by an order of magnitude. Nevertheless, our analysis paints a bleak picture: the adopted key-stretching levels provide insufficient protection for user passwords. In particular, we present strong evidence that most user passwords follow a Zipf's law distribution, and characterize the behavior of a rational attacker when user passwords are selected from a Zipf's law distribution. We show that there is a finite threshold which depends on the Zipf's law parameters that characterizes the behavior of a rational attacker --- if the value of a cracked password (normalized by the cost of computing the password hash function) exceeds this threshold then the adversary's optimal strategy is {\em always} to continue attacking until each user password has been cracked. In all cases (Yahoo!, Dropbox, LastPass and AshleyMadison) we find that the value of a cracked password almost certainly exceeds this threshold meaning that a rational attacker would crack all passwords that are selected from the Zipf's law distribution (i.e., most user passwords). This prediction holds even if we incorporate an aggressive model of diminishing returns for the attacker (e.g., the total value of $500$ million cracked passwords is less than $100$ times the total value of $5$ million passwords). On a positive note our analysis demonstrates that memory hard functions (MHFs) such as SCRYPT or Argon2i can significantly reduce the damage of an offline attack. In particular, we find that because MHFs substantially increase guessing costs a rational attacker will give up well before he cracks most user passwords and this prediction holds even if the attacker does not encounter diminishing returns for additional cracked passwords. Based on our analysis we advocate that password hashing standards should be updated to require the use of memory hard functions for password hashing and disallow the use of non-memory hard functions such as BCRYPT or PBKDF2.

Streaming for Aibohphobes: Longest Palindrome with Mismatches
Elena Grigorescu, Erfan Sadeqi Azer, Samson Zhou
FSTTCS 2017 [abstract] [slides] [pdf]

A palindrome is a string that reads the same as its reverse, such as ``aibohphobia'' (fear of palindromes). Given an integer $d>0$, a {\em $d$-near-palindrome} is a string of Hamming distance at most $d$ from its reverse. We study the natural problem of identifying a longest $d$-near-palindrome in data streams. The problem is relevant to the analysis of DNA databases, and to the task of repairing recursive structures in documents such as XML and JSON. We present an algorithm that returns a $d$-near-palindrome whose length is within a multiplicative $(1+\eps)$-factor of the longest $d$-near-palindrome. Our algorithm also returns the set of mismatched indices of the $d$-near-palindrome, using $\bigO{\frac{d\log^7 n}{\eps\log(1+\eps)}}$ bits of space, and $\bigO{\frac{d\log^6 n}{\eps\log(1+\eps)}}$ update time per arriving symbol. We show that $\Omega(d\log n)$ space is necessary for estimating the length of longest $d$-near-palindromes with high probability. We further obtain an additive-error approximation algorithm and a comparable lower bound, as well as an {\em exact} two-pass algorithm that solves the longest $d$-near-palindrome problem using $\bigO{d^2\sqrt{n}\log^6 n}$ bits of space.

On the Depth-Robustness and Cumulative Pebbling Cost of Argon2i
Jeremiah Blocki, Samson Zhou
TCC 2017 [abstract] [Jeremiah's slides] [pdf]

Argon2i is a data-independent memory hard function that won the password hashing competition. The password hashing algorithm has already been incorporated into several open source crypto libraries such as libsodium. In this paper we analyze the cumulative memory cost of computing Argon2i. On the positive side we provide a lower bound for Argon2i. On the negative side we exhibit an improved attack against Argon2i which demonstrates that our lower bound is nearly tight. In particular, we show that \begin{enumerate} \item An Argon2i DAG is $\left(e,O\left(n^3/e^3\right)\right))$-reducible. \item The cumulative pebbling cost for Argon2i is at most $O\left(n^{1.768}\right)$. This improves upon the previous best upper bound of $O\left(n^{1.8}\right)$ \cite{AB17}. \item Argon2i DAG is $\left(e,\tilde{\Omega}\left(n^3/e^3\right)\right))$-depth robust. By contrast, analysis of \cite{ABP17} only established that Argon2i was $\left(e,\tilde{\Omega}\left(n^3/e^2\right)\right))$-depth robust. \item The cumulative pebbling complexity of Argon2i is at least $\tilde{\Omega}\left( n^{1.75}\right)$. This improves on the previous best bound of $\Omega\left( n^{1.66}\right)$ \cite{ABP17} and demonstrates that Argon2i has higher cumulative memory cost than competing proposals such as Catena or Balloon Hashing. \end{enumerate} We also show that Argon2i has high {\em fractional} depth-robustness which strongly suggests that data-dependent modes of Argon2 are resistant to space-time tradeoff attacks.

Longest Alignment with Edits in Data Streams
Elena Grigorescu, Erfan Sadeqi Azer, Samson Zhou
Allerton 2017 [abstract] [pdf]

Analyzing patterns in streamed data generated by network traffic, sensor networks, or satellite feeds is a challenge for systems in which the available storage is limited. In addition, real data is noisy, which makes designing data stream algorithms even more challenging. Motivated by such challenges, we study algorithms for detecting the similarity of two data streams that can be read in sync. Two strings $S, T\in \Sigma^n$ form a $d$-near-alignment if the distance between them in some given metric is at most $d$. We study the problem of identifying a longest substring of $S$ and $T$ that forms a {\em $d$-near-alignment} under the {\em edit} distance, in the {\em simultaneous streaming model}. In this model, symbols of strings $S$ and $T$ are streamed at the same time, and the amount of available processing space is sublinear in the length of the strings. We give several algorithms, including an exact one-pass algorithm that uses $\O{d^2+d\log n}$ bits of space. We couple these results with comparable lower bounds.

Streaming Periodicity with Mismatches
Funda Ergün, Elena Grigorescu, Erfan Sadeqi Azer, Samson Zhou
RANDOM 2017 [abstract] [slides] [pdf]

We study the problem of finding all k-periods of a length-n string S, presented as a data stream. S is said to have k-period p if its prefix of length n-p differs from its suffix of length n-p in at most k locations. The study of periodic patterns in sequences is fundamental to string algorithms, time series data mining, and computational biology. Since real data is rarely perfect, exact pattern finding in streamed data can be unrealistic; consequently, one needs to design algorithms that can withstand errors in the patterns. It is often the case that such tasks become much more difficult to analyze than their no-error analogues. This turns out to be the case in the study of near periodicity here. While our algorithms are similar to the ones in the exact version previously studied, our analysis requires a new structural understanding of k-periodicity. We give a one-pass streaming algorithm that computes the k-periods of a string S using poly(k, log n) bits of space, for k-periods of length at most n/2. We also present a two-pass streaming algorithm that computes k-periods of S using poly(k, log n) bits of space, regardless of period length. We complement these results with comparable lower bounds.

Nearly Optimal Sparse Group Testing
Venkata Gandikota, Elena Grigorescu, Sidharth Jaggi, Samson Zhou
IEEE Transactions on Information Theory 2019 [journal]
Allerton 2016 [abstract] [slides] [conference]

Group testing is the process of pooling arbitrary subsets from a set of $n$ items so as to identify, with a minimal number of disjunctive tests, a ``small'' subset of $d$ defective items. In ``classical'' non-adaptive group testing, it is known that when $d = o(n^{1-\delta})$ for any $\delta>0$, $\theta(d\log(n))$ tests are both information-theoretically necessary, and sufficient to guarantee recovery with high probability. Group testing schemes in the literature meeting this bound require most items to be tested $\Omega(\log(n))$ times, and most tests to incorporate $\Omega(n/d)$ items.
Motivated by physical considerations, we study group testing models in which the testing procedure is constrained to be ``sparse''. Specifically, we consider (separately) scenarios in which (a) items are finitely divisible and hence may participate in at most $\gamma$ tests; and (b) tests are size-constrained to pool no more than $\rho$ items per test. For both scenarios we provide information-theoretic lower bounds on the number of tests required to guarantee high probability recovery. In particular, one of our main results shows that $\gamma$-finite divisibility of items forces {\it any} group testing algorithm with probability of recovery error at most $\epsilon$ to perform at least $\Omega(\gamma d(n/d)^{(1-2\epsilon)/((1+2\epsilon)\gamma)})$ tests. Analogously, for $\rho$-sized constrained tests, we show an information-theoretic lower bound of $\Omega(n\log(n/d)/(\rho\log(n/\rho d)))$. In both scenarios we provide both randomized constructions (under both $\epsilon$-error and zero-error reconstruction guarantees) and explicit constructions of computationally efficient group-testing algorithms (under $\epsilon$-error reconstruction guarantees) that require a number of tests that are optimal up to constant factors in some regimes of $n, d, \gamma \text{ and } \rho$. We also investigate the effect of unreliability/noise in test outcomes.

Preprints

Fast, Space-Optimal Streaming Algorithms for Clustering and Subspace Embeddings
Vincent Cohen-Addad, Liudeng Wang, David P. Woodruff, Samson Zhou
[abstract] [pdf]

We show that both clustering and subspace embeddings can be performed in the streaming model with the same asymptotic efficiency as in the central/offline setting. For $(k, z)$-clustering in the streaming model, we achieve a number of words of memory which is independent of the number $n$ of input points and the aspect ratio $\Delta$, yielding an optimal bound of $\tilde{\mathcal{O}}\left(\frac{dk}{\min(\varepsilon^4,\varepsilon^{z+2})}\right)$ words for accuracy parameter $\varepsilon$ on $d$-dimensional points. Additionally, we obtain amortized update time of $d\,\log(k)\cdot\text{polylog}(\log(n\Delta))$, which is an exponential improvement over the previous $d\,\text{poly}(k,\log(n\Delta))$. Our method also gives the fastest runtime for $(k,z)$-clustering even in the offline setting. For subspace embeddings in the streaming model, we achieve $\mathcal{O}(d)$ update time and space-optimal constructions, using $\tilde{\mathcal{O}}\left(\frac{d^2}{\varepsilon^2}\right)$ words for $p\le 2$ and $\tilde{\mathcal{O}}\left(\frac{d^{p/2+1}}{\varepsilon^2}\right)$ words for $p>2$, showing that streaming algorithms can match offline algorithms in both space and time complexity.