Samson Zhou 


Hi, I'm a Postdoctoral Fellow at Carnegie Mellon University, hosted by davidwoodruff.
My current research interests are sublinear algorithms, machine learning, and password hashing.
Previously, I was a postdoc at Indiana University hosted by grigoryyaroslavtsev, a postdoc at
Purdue University hosted by jeremiahblocki, and a graduate student
at Purdue University advised by gregfrederickson and elenagrigorescu.
Here are some of my favorite links.
Feel free to reach me at:
samsonzhou AT gmail DOT com
The cumulative pebbling complexity of a directed acyclic graph G is defined as cc(G) = \min_P \sum_i P_i, where the minimum is taken over all legal (parallel) black pebblings of G and P_i denotes the number of pebbles on the graph during round i. Intuitively, cc(G) captures the amortized SpaceTime complexity of pebbling m copies of G in parallel. The cumulative pebbling complexity of a graph G is of particular interest in the field of cryptography as cc(G) is tightly related to the amortized AreaTime complexity of the dataindependent memory hard function (iMHF) f_{G,H} (Alwen and Serbinenko, STOC 2015) defined using a constant indegree directed acyclic graph (DAG) G and a random oracle H. A secure iMHF should have amortized SpaceTime complexity as high as possible e.g., to deter bruteforce password attacker who wants to find x such that f_{G,H}(x) = h. Thus, to analyze the (in)security of a candidate iMHF f_{G,H}, it is crucial to estimate the value cc(G) but currently, upper and lower bounds for leading iMHF candidates differ by several orders of magnitude. Blocki and Zhou recently showed that is NPHard to compute cc(G), but their techniques do not even rule out an efficient (1+epsilon)approximation algorithm for any constant epsilon>0. We show that for any constant c > 0, it is Unique Games hard to approximate cc(G) to within a factor of c. Along the way, we show the hardness of approximation of the DAG Vertex Deletion problem on DAGs of constant indegree. Namely, we show that for any k,epsilon>0 given a DAG G with N nodes and constant indegree, the it is Unique Games hard to distinguish between the case that (e_1, d_1)reducible with e_1=N^{1/(1+2epsilon)}/k and d_1=k N^{2epsilon/(1+2epsilon)} and the case that G is (e_2, d_2)depthrobust with e_2 = (1epsilon)k e_1 and d_2= 0.9 N^{(1+epsilon)/(1+2epsilon)}, which may be of independent interest. Our result generalizes a result of Svensson who proved an analogous result for DAGs with indegree O(N).
Memory hard functions (MHFs) are an important cryptographic primitive that are used to design egalitarian proofs of work and in the construction of moderately expensive keyderivation functions resistant to bruteforce attacks. Broadly speaking, MHFs can be divided into two categories: datadependent memory hard functions (dMHFs) and dataindependent memory hard functions (iMHFs). iMHFs are resistant to certain sidechannel attacks as the memory access pattern induced by the honest evaluation algorithm is independent of the potentially sensitive input e.g., password. While dMHFs are potentially vulnerable to sidechannel attacks (the induced memory access pattern might leak useful information to a bruteforce attacker), they can achieve higher cumulative memory complexity (CMC) in comparison than an iMHF. In particular, any iMHF that can be evaluated in $N$ steps on a sequential machine has CMC {\em at most} $\O{\frac{N^2\log\log N}{\log N}}$. By contrast, the dMHF scrypt achieves maximal CMC $\Omega(N^2)$  though the CMC of scrypt would be reduced to just $\O{N}$ after a sidechannel attack. In this paper, we introduce the notion of computationally dataindependent memory hard functions (ciMHFs). Intuitively, we require that memory access pattern induced by the (randomized) ciMHF evaluation algorithm appears to be independent from the standpoint of a computationally bounded eavesdropping attacker  even if the attacker selects the initial input. We then ask whether it is possible to circumvent known upper bound for iMHFs and build a ciMHF with CMC $\Omega(N^2)$. Surprisingly, we answer the question in the affirmative when the ciMHF evaluation algorithm is executed on a twotiered memory architecture (RAM/Cache). We introduce the notion of a $k$restricted dynamic graph to quantify the continuum between unrestricted dMHFs $(k=n)$ and iMHFs ($k=1$). For any $\eps > 0$ we show how to construct a $k$restricted dynamic graph with $k=\Omega(N^{1\eps})$ that provably achieves maximum cumulative pebbling cost $\Omega(N^2)$. We can use $k$restricted dynamic graphs to build a ciMHF provided that cache is large enough to hold $k$ hash outputs and the dynamic graph satisfies a certain property that we call ``amenable to shuffling.'' In particular, we prove that the induced memory access pattern is indistinguishable to a polynomial time attacker who can monitor the locations of read/write requests to RAM, but not cache. We also show that when $k=o\left(N^{1/\log\log N}\right)$, then any $k$restricted graph with constant indegree has cumulative pebbling cost $o(N^2)$. Our results almost completely characterize the spectrum of $k$restricted dynamic graphs.
Network performance problems are notoriously difficult to diagnose. Prior profiling systems collect performance statistics by keeping information about each network flow, but maintaining perflow state is not scalable on resourceconstrained NIC and switch hardware. Instead, we propose sketchbased performance monitoring using memory that is sublinear in the number of flows. Existing sketches estimate metrics based on flow sizes. In contrast, performance monitoring typically requires combining information across pairs of packets, such as matching a data packet with its acknowledgment to compute a roundtrip time. We define a new class of \emph{lean} algorithms that use memory sublinear in both the size of input data and the number of flows. We then introduce lean algorithms for a set of important statistics, such as identifying flows with high latency, loss, outoforder, or retransmitted packets. We implement prototypes of our lean algorithms on a commodity programmable switch using the P4 language. Our experiments show that lean algorithms detect $\sim$82\% of top 100 problematic flows among realworld packet traces using just 40KB memory.
A function $f : \F_2^n \to \R$ is \emph{$s$sparse} if it has at most $s$ nonzero Fourier coefficients. Motivated by applications to fast sparse Fourier transforms over $\F_2^n$, we study efficient algorithms for the problem of approximating the $\ell_2$distance from a given function to the closest $s$sparse function. While previous works (e.g., Gopalan \emph{et al.} SICOMP 2011) study the problem of distinguishing $s$sparse functions from those that are far from $s$sparse under Hamming distance, to the best of our knowledge no prior work has explicitly focused on the more general problem of distance estimation in the $\ell_2$ setting, which is particularly wellmotivated for noisy Fourier spectra. Given the focus on efficiency, our main result is an algorithm that solves this problem with query complexity $\O{s}$ for constant accuracy and error parameters, which is only quadratically worse than applicable lower bounds.
We study the problem of constructing a linear sketch of minimum dimension that allows approximation of a given realvalued function $f \colon \ftwo^n \rightarrow \mathbb R$ with small expected squared error. We develop a general theory of linear sketching for such functions through which we analyze their dimension for most commonly studied types of valuation functions: additive, budgetadditive, coverage, $\alpha$Lipschitz submodular and matroid rank functions. This gives a characterization of how many bits of information have to be stored about the input $x$ so that one can compute $f$ under additive updates to its coordinates. Our results are tight in most cases and we also give extensions to the distributional version of the problem where the input $x \in \ftwo^n$ is generated uniformly at random. Using known connections with dynamic streaming algorithms, both upper and lower bounds on dimension obtained in our work extend to the space complexity of algorithms evaluating $f(x)$ under long sequences of additive updates to the input $x$ presented as a stream. Similar results hold for simultaneous communication in a distributed setting.
In the timedecay model for data streams, elements of an underlying data set arrive sequentially with the recently arrived elements being more important. A common approach for handling large data sets is to maintain a \emph{coreset}, a succinct summary of the processed data that allows approximate recovery of a predetermined query. We provide a general framework that takes any offlinecoreset and gives a timedecay coreset for polynomial time decay functions. We also consider the exponential time decay model for $k$median clustering, where we provide a constant factor approximation algorithm that utilizes the online facility location algorithm. Our algorithm stores $O(k\log(h\Delta)+h)$ points where $h$ is the halflife of the decay function and $\Delta$ is the aspect ratio of the dataset. Our techniques extend to $k$means clustering and $M$estimators as well.
DataIndependent Memoryhard functions (iMHFs) are a key cryptographic primitive underlying the design of moderately expensive password hashing algorithms and egalitarian proofs of work that are resistant to sidechannel attacks. Several goals for MHFs have been proposed including bandwidth hardness, spacetime (ST) complexity, amortized areatime (aAT) complexity and sustained space complexity. An iMHF can be specified using a directed acyclic graph (DAG) $G$ with $N=2^n$ nodes and low indegree, and the cost (aAT, ST etc...) to evaluate the iMHF can be analyzed using pebbling games. In particular, given a parameter $N$ (e.g., maximum acceptable running time) we would like to design the DAG $G$ to have maximum possible pebbling cost i.e., to ensure that the iMHF is as expensive as possible for an attacker to compute. Recently, Alwen et al.~\cite{CCS:AlwBloHar17} gave a randomized DAG construction called DRSample and proved that the aAT cost to pebble the graph was $\Omega\left( N^2/\log N\right)$. In an asymptotic sense the DRSample outperformed all prior constructions including Argon2i, the winner of the password hashing competition, which can be pebbled with aAT cost at most $\bigO\left(N^{1.767}\right)$. In this work we first prove a matching {\em upper bound} on the pebbling cost of DRSample by analyzing the greedy pebbling attack of Boneh et al.~\cite{AC:BonCorSch16}. This sequential attack on DRSample is simple, easy to implement and has good concrete performance. In fact, our results show that, for practical values of $N\leq 2^{24}$, Argon2i provides {\em stronger} resistance to known pebbling attacks than DRSample reversing a finding of Alwen et al.~\cite{CCS:AlwBloHar17}. We then develop a new iMHF candidate by extending DRSample with the bitreversal graph, and show that the iMHF resists {\em all known attacks} in practice and has {\em optimal} asymptotic performance under every MHF metric. In particular, we prove that (1) {\em any} (nearly) sequential pebbling attack (including the greedy pebbling attack) has aAT cost $\Omega\left( N^2\right)$, (2) {\em any} parallel attacker has aAT cost at least $\Omega\left(N^2/\log N\right)$ and {\em at least} $\Omega\left(N^2 \log \log N/\log N\right)$ unless one can find new depthreducing attacks against DRSample which significantly improve upon the state of the art, (3) the graph has high bandwidthcomplexity, and (4) any pebbling {\em either} has aAT cost $\omega(N^2)$ or {\em requires} at least $\Omega(N)$ steps with $\Omega(N/\log N)$ pebbles on the DAG. This makes our construction the first practical iMHF with strong guarantees on the sustained spacecomplexity. We also observe that the Argon2i round function can (trivially) be evaluated in parallel, which would allow an attacker to reduce aAT costs by (nearly) an order of magnitude, and we develop an {\em inherently} sequential version of the Argon2i round function that prevents this attack. We implement our new iMHF candidate (with and without the sequential round function) and show that evaluation speed is nearly identical to Argon2i. Finally, we provide a pebbling reduction which proves that in the parallel random oracle model (PROM) the cost of evaluating an iMHF like Argon2i or DRSample+BRG is given by the pebbling cost of the underlying DAG.
We propose the first adversarially robust algorithm for monotone submodular maximization under single and multiple knapsack constraints with scalable implementations in distributed and streaming settings. For a single knapsack constraint, our algorithm outputs a robust summary of almost optimal (up to polylogarithmic factors) size, from which a constantfactor approximation to the optimal solution can be constructed. For multiple knapsack constraints, our approximation is within a constantfactor of the best known nonrobust solution. We evaluate the performance of our algorithms by comparison to natural robustifications of existing nonrobust algorithms under two objectives: 1) dominating set for large social network graphs from Facebook and Twitter collected by the Stanford Network Analysis Project (SNAP), 2) movie recommendations on a dataset from MovieLens. Experimental results show that our algorithms give the best objective for a majority of the inputs and show strong performance even compared to offline algorithms that are given the set of removals in advance.
We study the problem of estimating the size of a matching when the graph is revealed in a streaming fashion. Our results are multifold: \begin{enumerate} \item We give a tight structural result relating the size of a maximum matching to the {\em arboricity} of a graph, which has been one of the most studied graph parameters for matching algorithms in data streams. \item We further show that the weight of a maximum weighted matching can be efficiently estimated by augmenting any routine for estimating the size of an unweighted matching. Namely, given an algorithm for computing a $\lambda$approximation in the unweighted case, we obtain a $2(1+\varepsilon)\cdot \lambda$ approximation for the weighted case, while only incurring a multiplicative logarithmic factor in the space bounds. The algorithm is implementable in any streaming model, including {\em dynamic} streams. \item We also investigate algebraic aspects of computing matchings in data streams, by proposing new algorithms and lower bounds based on analyzing the rank of the {\em Tuttematrix} of the graph. In particular, we present an algorithm determining whether there exists a matching of size $k$ using $k^2\text{polylog } n $ space, where $n$ is the number of nodes in the graph. We also show a lower bound of $\Omega(n^{1\varepsilon})$ space for small approximation factors to the rank of a matrix in {\em insertiononly} streams. \end{enumerate}
Memory Hard Functions (MHFs) have been proposed as an answer to the growing inequality between the computational speed of general purpose CPUs and Application Specific Integrated Circuits (ASICs). MHFs have seen widespread applications including password hashing, key stretching and proofs of work. Several metrics have been proposed to quantify the `memory hardness' of a function. Cumulative memory complexity (CMC) \cite{STOC:AlwSer15} (or amortized Area $\times$ Time complexity \cite{CCS:AlwBloHar17}) attempts to quantify the amortized cost to acquire/build the hardware to evaluate the function  amortized by the number of instances of the function that can be evaluated of this hardware. By contrast, bandwidth hardness \cite{TCC:RenDev17} attempts to quantify the amortized energy costs of evaluating this function on hardware  which in turn is largely dominated by the number of cache misses. Ideally, a good MHF would be both bandwidth hard and have high cumulative memory complexity. While the cumulative memory complexity of leading MHF candidates is well understood, little is known about the bandwidth hardness of many of the most prominent MHF candidates. Our contributions are as follows: First, we provide the first reduction proving that, in the parallel random oracle model, the bandwidth hardness of a DataIndependent Memory Hard Function (iMHF) is described by the redblue pebbling cost of the directed acyclic graph (DAG) associated with that iMHF. Second, we show that the goals of designing an MHF with high CMC/bandwidth hardness are well aligned. In particular, we prove that {\em any} function with high CMC also has relatively high bandwidth costs. This result leads to the first {\em unconditional} lower bound on the bandwidth cost of scrypt. Third, we analyze the bandwidth hardness of several prominent iMHF candidates such as Argon2i \cite{BiryukovDK15}, winner of the password hashing competition, aATSample and DRSample \cite{CCS:AlwBloHar17}  the first practical iMHF with asymptotically optimal CMC. More specifically, we show that Argon2i is maximally bandwidth hard as long as the cachesize $m$ is at most $m \in\O{n^{2/3\epsilon}}$ where $n$ is the total number of datalabels produced during computation. We also show that aATSample and DRSample are maximally bandwidth hard as long as the cachesize is $m \in\O{n^{1\epsilon}}$. Finally, we show that the problem of finding a redblue pebbling with minimum bandwidth cost is NPhard.
We study the \emph{distinct elements} and \emph{$\ell_p$heavy hitters} problems in the \emph{sliding window} model, where only the most recent $n$ elements in the data stream form the underlying set. We first introduce the \emph{\histogram}, a simple twist on the exponential (Datar \etal, SODA 2002) and smooth histograms (Braverman and Ostrovsky, FOCS 2007) that may be of independent interest. We then show that the \histogram{} along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and $\ell_p$heavy hitters that are nearly optimal in both $n$ and $\eps$. Applying our new \histogram{} framework, we provide an algorithm that outputs a $(1+\eps)$approximation to the number of distinct elements in the sliding window model and uses $\O{\frac{1}{\eps^2}\log n\log\frac{1}{\eps}\log\log n+\frac{1}{\eps}\log^2 n}$ bits of space. For $\ell_p$heavy hitters, we provide an algorithm using space $\O{\frac{1}{\eps^p}\log^2 n\left(\log\log n+\log\frac{1}{\eps}\right)}$ for $0
Errorcorrecting codes that admit {\em local} decoding and correcting algorithms have been the focus of much recent research due to their numerous theoretical and practical applications. The goal is to obtain the best possible tradeoffs between the number of queries the algorithm can make to its oracle (the {\em locality} of the task), the amount of redundancy in the encoding (the {\em rate} of the code), and the amount of error it withstands. In the standard adverstimes channel model the current tradeoffs are dramatic, allowing either small query complexity and superpolynomial blocklength, or small blocklength but high query complexity. However, in the realistic, computationally bounded channel model, constructions of locally decodable codes (\LDCs) suddenly exhibit small locality and small blocklength, for constant error rate. The first such constructions are due to Ostrovsky, Pandey and Sahai (ICALP 2007) who built private \LDCs under the assumption that oneway functions exist, and in the setting where the sender and receiver share a private key. We study variants of locally decodable and locally correctable codes in computationally bounded but adverstimes channels, under the much weaker assumption that collisionresistant hash functions exist, and with no publickey or privatekey cryptographic setup. Specifically, we provide constructions of {\em relaxed locally correctable codes} (\RLCCs) and {\em relaxed locally decodable codes} (\RLDCs) over binary alphabets, with constant rate and polylogarithmic locality, that compare favorably with existing schemes built under much stronger cryptographic assumptions, and with classical \RLCCs in the computationally unbounded Hamming channel. Our constructions crucially employ {\em collision resistant hash functions} and {\em local expander graphs}, extending ideas from recent cryptographic constructions of memoryhard functions.
We investigate the problem of detecting periodic trends within a string $S$ of length $n$, arriving in the streaming model, containing at most $k$ wildcard characters, where $k=o(n)$. We say $S$ has wildcardperiod $p$ if there exists an assignment to each of the wildcard characters so that in the resulting stream the length $np$ prefix equals the length $np$ suffix. We present a twopass streaming algorithm that computes wildcardperiods of $S$ using $\O{k^3\,\polylog\,n}$ bits of space, while we also show that this problem cannot be solved in sublinear space in one pass. In addition, we present complementing lower bounds, while showing a new communication complexity on the sparse index problem.
We consider the computational complexity of finding a legal black pebbling of a DAG $G=(V,E)$ with minimum cumulative cost. A black pebbling is a sequence $P_0,\ldots, P_t \subseteq V$ of sets of nodes which must satisfy the following properties: $P_0 = \emptyset$ (we start off with no pebbles on $G$), $\sinks(G) \subseteq \bigcup_{j \leq t} P_j$ (every sink node was pebbled at some point) and $\parents\big(P_{i+1}\backslash P_i\big) \subseteq P_i$ (we can only place a new pebble on a node $v$ if all of $v$'s parents had a pebble during the last round). The cumulative cost of a pebbling $P_0,P_1,\ldots, P_t \subseteq V$ is $\cc(P) = \left P_1\right + \ldots + \left P_t\right$. The cumulative pebbling cost is an especially important security metric for dataindependent memory hard functions, an important primitive for password hashing. Thus, an efficient (approximation) algorithm would be an invaluable tool for the cryptanalysis of password hash functions as it would provide an automated tool to establish tight bounds on the amortized spacetime cost of computing the function. We show that such a tool is unlikely to exist in the most general case. In particular, we prove the following results. \begin{itemize} \item It is $\NPhard$ to find a pebbling minimizing cumulative cost. \item The natural linear program relaxation for the problem has integrality gap $\tilde{O}(n)$, where $n$ is the number of nodes in $G$. We conjecture that the problem is hard to approximate. \item We show that a related problem, find the minimum size subset $S\subseteq V$ such that $\depth(GS) \leq d$, is also $\NPhard$. In fact, under the Unique Games Conjecture there is no $(2\epsilon)$approximation algorithm. \end{itemize}
We develop an economic model of an offline password cracker which allows us to make quantitative predictions about the fraction of accounts that a rational password attacker would crack in the event of an authentication server breach. We apply our economic model to analyze recent massive password breaches at Yahoo!, Dropbox, LastPass and AshleyMadison. All four organizations were using keystretching to protect user passwords. In fact, LastPass' use of PBKDF2SHA256 with $10^5$ hash iterations exceeds 2017 NIST minimum recommendation by an order of magnitude. Nevertheless, our analysis paints a bleak picture: the adopted keystretching levels provide insufficient protection for user passwords. In particular, we present strong evidence that most user passwords follow a Zipf's law distribution, and characterize the behavior of a rational attacker when user passwords are selected from a Zipf's law distribution. We show that there is a finite threshold which depends on the Zipf's law parameters that characterizes the behavior of a rational attacker  if the value of a cracked password (normalized by the cost of computing the password hash function) exceeds this threshold then the adversary's optimal strategy is {\em always} to continue attacking until each user password has been cracked. In all cases (Yahoo!, Dropbox, LastPass and AshleyMadison) we find that the value of a cracked password almost certainly exceeds this threshold meaning that a rational attacker would crack all passwords that are selected from the Zipf's law distribution (i.e., most user passwords). This prediction holds even if we incorporate an aggressive model of diminishing returns for the attacker (e.g., the total value of $500$ million cracked passwords is less than $100$ times the total value of $5$ million passwords). On a positive note our analysis demonstrates that memory hard functions (MHFs) such as SCRYPT or Argon2i can significantly reduce the damage of an offline attack. In particular, we find that because MHFs substantially increase guessing costs a rational attacker will give up well before he cracks most user passwords and this prediction holds even if the attacker does not encounter diminishing returns for additional cracked passwords. Based on our analysis we advocate that password hashing standards should be updated to require the use of memory hard functions for password hashing and disallow the use of nonmemory hard functions such as BCRYPT or PBKDF2.
A palindrome is a string that reads the same as its reverse, such as ``aibohphobia'' (fear of palindromes). Given an integer $d>0$, a {\em $d$nearpalindrome} is a string of Hamming distance at most $d$ from its reverse. We study the natural problem of identifying a longest $d$nearpalindrome in data streams. The problem is relevant to the analysis of DNA databases, and to the task of repairing recursive structures in documents such as XML and JSON. We present an algorithm that returns a $d$nearpalindrome whose length is within a multiplicative $(1+\eps)$factor of the longest $d$nearpalindrome. Our algorithm also returns the set of mismatched indices of the $d$nearpalindrome, using $\bigO{\frac{d\log^7 n}{\eps\log(1+\eps)}}$ bits of space, and $\bigO{\frac{d\log^6 n}{\eps\log(1+\eps)}}$ update time per arriving symbol. We show that $\Omega(d\log n)$ space is necessary for estimating the length of longest $d$nearpalindromes with high probability. We further obtain an additiveerror approximation algorithm and a comparable lower bound, as well as an {\em exact} twopass algorithm that solves the longest $d$nearpalindrome problem using $\bigO{d^2\sqrt{n}\log^6 n}$ bits of space.
Argon2i is a dataindependent memory hard function that won the password hashing competition. The password hashing algorithm has already been incorporated into several open source crypto libraries such as libsodium. In this paper we analyze the cumulative memory cost of computing Argon2i. On the positive side we provide a lower bound for Argon2i. On the negative side we exhibit an improved attack against Argon2i which demonstrates that our lower bound is nearly tight. In particular, we show that \begin{enumerate} \item An Argon2i DAG is $\left(e,O\left(n^3/e^3\right)\right))$reducible. \item The cumulative pebbling cost for Argon2i is at most $O\left(n^{1.768}\right)$. This improves upon the previous best upper bound of $O\left(n^{1.8}\right)$ \cite{AB17}. \item Argon2i DAG is $\left(e,\tilde{\Omega}\left(n^3/e^3\right)\right))$depth robust. By contrast, analysis of \cite{ABP17} only established that Argon2i was $\left(e,\tilde{\Omega}\left(n^3/e^2\right)\right))$depth robust. \item The cumulative pebbling complexity of Argon2i is at least $\tilde{\Omega}\left( n^{1.75}\right)$. This improves on the previous best bound of $\Omega\left( n^{1.66}\right)$ \cite{ABP17} and demonstrates that Argon2i has higher cumulative memory cost than competing proposals such as Catena or Balloon Hashing. \end{enumerate} We also show that Argon2i has high {\em fractional} depthrobustness which strongly suggests that datadependent modes of Argon2 are resistant to spacetime tradeoff attacks.
Analyzing patterns in streamed data generated by network traffic, sensor networks, or satellite feeds is a challenge for systems in which the available storage is limited. In addition, real data is noisy, which makes designing data stream algorithms even more challenging. Motivated by such challenges, we study algorithms for detecting the similarity of two data streams that can be read in sync. Two strings $S, T\in \Sigma^n$ form a $d$nearalignment if the distance between them in some given metric is at most $d$. We study the problem of identifying a longest substring of $S$ and $T$ that forms a {\em $d$nearalignment} under the {\em edit} distance, in the {\em simultaneous streaming model}. In this model, symbols of strings $S$ and $T$ are streamed at the same time, and the amount of available processing space is sublinear in the length of the strings. We give several algorithms, including an exact onepass algorithm that uses $\O{d^2+d\log n}$ bits of space. We couple these results with comparable lower bounds.
We study the problem of finding all kperiods of a lengthn string S, presented as a data stream. S is said to have kperiod p if its prefix of length np differs from its suffix of length np in at most k locations. The study of periodic patterns in sequences is fundamental to string algorithms, time series data mining, and computational biology. Since real data is rarely perfect, exact pattern finding in streamed data can be unrealistic; consequently, one needs to design algorithms that can withstand errors in the patterns. It is often the case that such tasks become much more difficult to analyze than their noerror analogues. This turns out to be the case in the study of near periodicity here. While our algorithms are similar to the ones in the exact version previously studied, our analysis requires a new structural understanding of kperiodicity. We give a onepass streaming algorithm that computes the kperiods of a string S using poly(k, log n) bits of space, for kperiods of length at most n/2. We also present a twopass streaming algorithm that computes kperiods of S using poly(k, log n) bits of space, regardless of period length. We complement these results with comparable lower bounds.
Group testing is the process of pooling arbitrary subsets from a set of $n$ items so as to identify,
with a minimal number of disjunctive tests, a ``small'' subset of $d$ defective items.
In ``classical'' nonadaptive group testing, it is known that when $d = o(n^{1\delta})$ for any $\delta>0$,
$\theta(d\log(n))$ tests are both informationtheoretically necessary, and sufficient to guarantee recovery
with high probability. Group testing schemes in the literature meeting this bound require most items to be
tested $\Omega(\log(n))$ times, and most tests to incorporate $\Omega(n/d)$ items.
Motivated by physical considerations, we study group testing models in which the testing procedure is
constrained to be ``sparse''. Specifically, we consider (separately) scenarios in which (a) items are
finitely divisible and hence may participate in at most $\gamma$ tests; and (b) tests are sizeconstrained
to pool no more than $\rho$ items per test. For both scenarios we provide informationtheoretic lower bounds
on the number of tests required to guarantee high probability recovery. In particular, one of our main
results shows that $\gamma$finite divisibility of items forces {\it any} group testing algorithm with
probability of recovery error at most $\epsilon$ to perform at least
$\Omega(\gamma d(n/d)^{(12\epsilon)/((1+2\epsilon)\gamma)})$ tests.
Analogously, for $\rho$sized constrained tests, we show an informationtheoretic lower bound of
$\Omega(n\log(n/d)/(\rho\log(n/\rho d)))$. In both scenarios we provide both randomized constructions
(under both $\epsilon$error and zeroerror reconstruction guarantees) and explicit constructions of computationally
efficient grouptesting algorithms (under $\epsilon$error reconstruction guarantees) that require a number of tests
that are optimal up to constant factors in some regimes of $n, d, \gamma \text{ and } \rho$. We also investigate the
effect of unreliability/noise in test outcomes.
The problem of selecting a smallsize representative summary of a large dataset is a cornerstone of machine learning, optimization and data science. Motivated by applications to recommendation systems and other scenarios with querylimited access to vast amounts of data, we propose a new rigorous algorithmic framework for a standard formulation of this problem as a submodular maximization subject to a linear (knapsack) constraint. Our framework is based on augmenting all partial Greedy solutions with the best additional item. It can be instantiated with negligible overhead in any model of computation, which allows the classic Greedy algorithm and its variants to be implemented. We give such instantiations in the offline (Greedy+Max), multipass streaming (Sieve+Max) and distributed (Distributed+Max) settings. Our algorithms give (1/2epsilon)approximation with most other key parameters of interest being nearoptimal. Our analysis is based on a new set of firstorder linear differential inequalities and their robust approximate versions. Experiments on typical datasets (movie recommendations, influence maximization) confirm scalability and high quality of solutions obtained via our framework. Instancespecific approximations are typically in the 0.60.7 range and frequently beat even the (11/epsilon) \approx 0.63 worstcase barrier for polynomialtime algorithms.
Constructions of locally decodable codes (\LDCs) have one of two undesirable properties: low rate or high locality (polynomial in the length of the message). In settings where the encoder/decoder have already exchanged cryptographic keys and the channel is a probabilistic polynomial time (\PPT) algorithm, it is possible to circumvent these barriers and design \LDCs\ with constant rate and small locality. However, the assumption that the encoder/decoder have exchanged cryptographic keys is often prohibitive. We thus consider the problem of designing explicit and efficient \LDCs\ in settings where the channel is {\em slightly} more constrained than the encoder/decoder with respect to some resource e.g., space or (sequential) time. Given an explicit function $f$ that the channel cannot compute, we show how the encoder can transmit a random secret key to the local decoder using $f(\cdot)$ and a random oracle $\oracleH$. This allows bootstrap from the private key \LDC\ construction of Ostrovsky, Pandey and Sahai (ICALP, 2007), thereby answering an open question posed by Guruswami and Smith (FOCS 2010) of whether such bootstrapping techniques may apply to \LDCs\ in weaker channel models than just \PPT\ algorithms. Specifically, in the random oracle model we show how to construct explicit constant rate \LDCs\ with optimal locality of $\polylog$ in the security parameter against various resource constrained channels.
Model compression provides a means to efficiently deploy deep neural networks (DNNs) on devices that limited computation resources and tight power budgets, such as mobile and IoT (Internet of Things) devices. Consequently, model compression is one of the most critical topics in modern deep learning. Typically, the stateoftheart model compression methods suffer from a big limitation: they are only based on heuristics rather than theoretical foundation and thus offer no worstcase guarantees. To bridge this gap, Baykal et al. [2018a] suggested using a coreset, a small weighted subset of the data that provably approximates the original data set, to sparsify the parameters of a trained fullyconnected neural network by sampling a number of neural network parameters based on the importance of the data. However, the sampling procedure is datadependent and can only be only be performed after an expensive training phase. We propose the use of dataindependent coresets to perform provable model compression without the need for training. We first prove that there exists a coreset whose size is independent of the input size of the data for any neuron whose activation function is from a family of functions that includes variants of ReLU, sigmoid and others. We then provide a compressionbased algorithm that constructs these coresets and explicitly applies neuron pruning for the underlying model. We demonstrate the effectiveness of ourmethods with experimental evaluations for both synthetic and realworld benchmark network compression. In particular, our framework provides up to 90% compression on the LeNet300100 architecture on MNIST and actually improves the accuracy
We initiate the study of numerical linear algebra in the sliding window model, where only the most recent $W$ updates in the data stream form the underlying set. Although most existing work in the sliding window model uses the {\em smooth histogram} framework, most interesting linearalgebraic problems are not smooth; we show that the spectral norm, vector induced matrix norms, generalized regression, and lowrank approximation are not amenable for the smooth histogram framework. To overcome this challenge, we first give a deterministic algorithm that achieves spectral approximation in the sliding window model that can be viewed as a generalization of smooth histograms, using the Loewner ordering of positive semidefinite matrices. We then give algorithms for both spectral approximation and lowrank approximation that are spaceoptimal up to polylogarithmic factors. Our algorithms are based on a new notion of ``reverse online'' leverage scores that account for both how unique and how recent a row is. We show that by sampling rows based on their reverse online leverage scores and repeatedly downsampling each time a new row arrives, we can both oversample rows with respect to their true leverage scores, and also bound the total number of rows stored. The downsampling procedure can be streamlined so that both our spectral approximation algorithm and our lowrank approximation algorithm run in input sparsity runtime, up to lower order factors. We show that our techniques have a number of applications to linearalgebraic problems in other settings. Specifically, we show that our analysis immediately implies an algorithm for lowrank approximation in the online setting that is spaceoptimal up to logarithmic factors, as well as nearly input sparsity time. We then show our deterministic spectral approximation algorithm can be used to handle $\ell_1$ spectral approximation in the sliding window model under a certain assumption on the bit complexity of the entries. Finally, we show that our downsampling framework can be applied to the problem of approximate matrix multiplication and provide upper and lower bounds that are tight up to $\log\log W$ factors.
We present lineartime algorithms for partitioning a path or a tree with weights on the vertices by removing $k$ edges to maximize the minimumweight component. We also use the same framework to partition a path with weight on the vertices, removing $k$ edges to minimize the maximumweight component. The algorithms use the parametric search paradigm, testing candidate values until an optimum is found while simultaneously reducing the running time needed for each test. For pathpartitioning, the algorithm employs a synthetic weighting scheme that results in a constant fraction reduction in running time after each test. For treepartitioning, our dualpronged strategy makes progress no matter what the layout of our tree is.
We consider the problem of estimating the weight of a maximum weighted matching of a weighted graph $G(V,E)$ whose edges are revealed in a streaming fashion. Extending the framework from Crouch and Stubbs (APPROX 2014), we develop a reduction from the maximum weighted matching problem to the maximum cardinality matching problem that only doubles the approximation factor of a streaming algorithm developed for the maximum cardinality matching problem. Our results hold for the insertiononly and the dynamic (i.e, insertion and deletion) edgearrival streaming models. The previous bestknown reduction is due to Bury and Schwiegelshohn (ESA 2015) who develop an algorithm whose approximation guarantee scales by a polynomial factor. As an application, we obtain improved estimators for weighted planar graphs and, more generally, for weighted boundedarboricity graphs, by feeding into our reduction the recent estimators due to Esfandiari \etal\ (SODA 2015) and to Chitnis \etal\ (SODA 2016). In particular, we obtain a $(48+\eps)$approximation estimator for the weight of a maximum weighted matching in planar graphs.