Streaming algorithm

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). These algorithms have limited memory available to them (much less than the input size) and also limited processing time per item.

These constraints may mean that an algorithm produces an approximate answer based on a summary or "sketch" of the data stream in memory.

History

Though streaming algorithms had already been studied by Munro and Paterson^[1] as early as 1980, as well as Flajolet and Martin in 1982/83,^[2] the field of streaming algorithms was first formalized and popularized in a 1996 paper by Noga Alon, Yossi Matias, and Mario Szegedy.^[3] For this paper, the authors later won the Gödel Prize in 2005 "for their foundational contribution to streaming algorithms." There has since been a large body of work centered around data streaming algorithms that spans a diverse spectrum of computer science fields such as theory, databases, networking, and natural language processing.

Semi-streaming algorithms were introduced in 2005 as an extension of streaming algorithms that allows for a constant or logarithmic number of passes over the dataset .

Models

In the data stream model, some or all of the input data that are to be operated on are not available for random access from disk or memory, but rather arrive as one or more continuous data streams.

Streams can be denoted as an ordered sequence of points (or "updates") that must be accessed in order and can be read only once or a small number of times.

Much of the streaming literature is concerned with computing statistics on frequency distributions that are too large to be stored. For this class of problems, there is a vector $\mathbf {a} =(a_{1},\dots ,a_{n})$ (initialized to the zero vector $\mathbf {0}$ ) that has updates presented to it in a stream. The goal of these algorithms is to compute functions of $\mathbf {a}$ using considerably less space than it would take to represent $\mathbf {a}$ precisely. There are two common models for updating such streams, called the "cash register" and "turnstile" models.^[4]

In the cash register model each update is of the form $\langle i,c\rangle$ , so that $a_{i}$ is incremented by some positive integer $c$ . A notable special case is when $c=1$ (only unit insertions are permitted).

In the turnstile model each update is of the form $\langle i,c\rangle$ , so that $a_{i}$ is incremented by some (possibly negative) integer $c$ . In the "strict turnstile" model, no $a_{i}$ at any time may be less than zero.

Several papers also consider the "sliding window" model. In this model, the function of interest is computing over a fixed-size window in the stream. As the stream progresses, items from the end of the window are removed from consideration while new items from the stream take their place.

Besides the above frequency-based problems, some other types of problems have also been studied. Many graph problems are solved in the setting where the adjacency matrix or the adjacency list of the graph is streamed in some unknown order. There are also some problems that are very dependent on the order of the stream (i.e., asymmetric functions), such as counting the number of inversions in a stream and finding the longest increasing subsequence.

Evaluation

The performance of an algorithm that operates on data streams is measured by three basic factors:

The number of passes the algorithm must make over the stream.
The available memory.
The running time of the algorithm.

These algorithms have many similarities with online algorithms since they both require decisions to be made before all data are available, but they are not identical. Data stream algorithms only have limited memory available but they may be able to defer action until a group of points arrive, while online algorithms are required to take action as soon as each point arrives.

If the algorithm is an approximation algorithm then the accuracy of the answer is another key factor. The accuracy is often stated as an $(\epsilon ,\delta )$ approximation meaning that the algorithm achieves an error of less than $\epsilon$ with probability $1-\delta$ .

Applications

Streaming algorithms have several applications in networking such as monitoring network links for elephant flows, counting the number of distinct flows, estimating the distribution of flow sizes, and so on.^[5] They also have applications in databases, such as estimating the size of a join .

Some streaming problems

Frequency moments

The $k$ th frequency moment of a set of frequencies $\mathbf {a}$ is defined as $F_{k}(\mathbf {a} )=\sum _{i=1}^{n}a_{i}^{k}$ .

The first moment $F_{1}$ is simply the sum of the frequencies (i.e., the total count). The second moment $F_{2}$ is useful for computing statistical properties of the data, such as the Gini coefficient of variation. $F_{\infty }$ is defined as the frequency of the most frequent item(s).

The seminal paper of Alon, Matias, and Szegedy dealt with the problem of estimating the frequency moments.

Calculating Frequency Moments

A direct approach to find the frequency moments requires to maintain a register $m i$ for all distinct elements $a i \in (1,2,3,4,..., N)$ which requires at least memory of order $\Omega (N)$ .^[3] But we have space limitations and requires an algorithm that computes in much lower memory. This can be achieved by using approximations instead of exact values. An algorithm that computes an (ε,δ)approximation of $F k$ , where $F' k$ is the (ε,δ)- approximated value of $F k$ .^[6] Where ε is the approximation parameter and δ is the confidence parameter.^[7]

Calculating F₀ (Distinct Elements in a DataStream)

FM-Sketch Algorithm

Flajolet et al. in ^[2] introduced probabilistic method of counting which was inspired from a paper by Robert Morris Counting large numbers of events in small registers. Morris in his paper says that if the requirement of accuracy is dropped, a counter n can be replaced by a counter $log n$ which can be stored in $log log n$ bits.^[8] Flajolet et al. in ^[2] improved this method by using a hash function $h$ which is assumed to uniformly distribute the element in the hash space (a binary string of length $L$ ).

h:[m]\rightarrow [0,2^{L}-1]

Let $bit(y,k)$ represent the kth bit in binary representation of $y$

y=\sum _{k\geq 0}\mathrm {bit} (y,k)*2^{k}

Let $\rho (y)$ represents the position of least significant 1-bit in the binary representation of $y i$ with a suitable convention for $\rho (0)$ .

\rho (y)={\begin{cases}\mathrm {Min} (\mathrm {bit} (y,k))&{\text{if }}y>0\\L&{\text{if }}y=0\end{cases}}

Let A be the sequence of data stream of length M whose cardinality need to be determined. Let BITMAP [0...L − 1] be the

hash space where the $ρ$ (hashedvalues) are recorded. The below algorithm the determines approximate cardinality of A.

Procedure FM-Sketch:

    for i in 0 to L − 1 do
        BITMAP[i]:=0 
    end for
    for x in A: do
        Index:=ρ(hash(x))
        i
        end if
    end for
    B:= Position of left most 0 bit of BITMAP[] 
    return 2^B

If there are N distinct elements in a data stream.

For $i\gg \log(N)$ then BITMAP[i] is certainly 0
For $i\ll \log(N)$ then BITMAP[i] is certainly 1
For $i\approx \log(N)$ then BITMAP[i] is a fringes of 0's and 1's

K-Minimum Value Algorithm

The previous algorithm describes the first attempt to approximate F₀ in the data stream by Flajolet and Martin. Their algorithm picks a random hash function which they assume to uniformly distribute the hash values in hash space.

Bar-Yossef et al. in,^[7] introduces k-minimum value algorithm for determining number of distinct elements in data stream. They uses a similar hash function h which can be normalized to [0,1] as $h:[m]\rightarrow [0,1]$ . But they fixed a limit t to number of values in hash space. The value of t is assumed of the order $O\left({\dfrac {1}{\varepsilon _{2}}}\right)$ (i.e. less approximation-value ε requires more t). KMV algorithm keeps only t-smallest hash values in the hash space. After all the m values of stream are arrived, $\upsilon =\mathrm {Max} (h(a_{i}))$ is used to calculate $F'_{0}={\dfrac {t}{\upsilon }}$ . That is, in a close-to uniform hash space, they expect at-least t elements to be less than $O\left({\dfrac {t}{F_{0}}}\right)$ .

Procedure 2 K-Minimum Value

Initialize first t values of KMV 
for a in a1 to an do
	if h(a) < Max(KMV) then
		Remove Max(KMV) from KMV set
		Insert h(a) to KMV 
	end if
end for 
return t/Max(KMV)

Complexity analysis of KMV

KMV algorithm can be implemented in $O\left(\left({\dfrac {1}{\varepsilon _{2}}}\right)\cdot \log(m)\right)$ memory bits space. Each hash value requires space of order $O(\log(m))$ memory bits. There are hash values of the order $O\left({\dfrac {1}{\varepsilon _{2}}}\right)$ . The access time can be reduced if we store the t hash values in a binary tree. Thus the time complexity will be reduced to $O\left(\log \left({\dfrac {1}{\varepsilon }}\right)\cdot \log(m)\right)$ .

Calculating $F k$

Alon et al. in ^[3] estimates $F k$ by defining random variables that can be computed within given space and time. The expected value of random variable gives the approximate value of $F k$ .

Let us assume length of sequence m is known in advance.

Construct a random variable X as follows:

Select $a p$ be a random member of sequence $A$ with index at $p$ , $a_{p}=l\in (1,2,3,\ldots ,n)$
Let $r=|\{q:q\geq p,a_{p}=l\}|$ , represents the number of occurrences of $l$ within the members of the sequence $A$ following $a p$ .
Random variable $X=m(r^{k}-(r-1)^{k})$ .

Assume S₁ be of the order $O(n^{1-1/k}/\lambda ^{2})$ and S₂ be of the order $O(\log(1/\varepsilon ))$ . Algorithm takes S₂ random variable Y₁,Y₂,...,Y_S₂ and outputs the median Y . Where $Y i$ is the average of $X ij$ where 1 ≤ j ≤ S₁.

Now calculate expectation of random variable $E (X)$ .

{\begin{array}{lll}E(X)&=&\sum _{i=1}^{n}\sum _{i=1}^{m_{i}}(j^{k}-(j-1)^{k})\\&=&{\frac {m}{m}}[(1^{k}+(2^{k}-1^{k})+\ldots +(m_{1}^{k}-(m_{1}-1)^{k}))\\&&\;+\;(1^{k}+(2^{k}-1^{k})+\ldots +(m_{2}^{k}-(m_{2}-1)^{k}))+\ldots \\&&\;+\;(1^{k}+(2^{k}-1^{k})+\ldots +(m_{n}^{k}-(m_{n}-1)^{k}))]\\&=&\sum _{i=1}^{n}m_{i}^{k}=F_{k}\end{array}}

Complexity of $F k$

From the algorithm to calculate $F k$ discussed above, we can see that each random variable $X$ stores value of $a p$ and $r$ . So, to compute $X$ we need to maintain only $log(n)$ bits for storing $a p$ and $log(n)$ bits for storing $r$ . Total number of random variable $X$ will be the $S_{1}*S_{2}$ .

Hence the total space complexity the algorithm takes is of the order of $O\left({\dfrac {k\log \left({\dfrac {1}{\varepsilon }}\right)}{\lambda ^{2}}}n^{1-{\dfrac {1}{k}}}\left(\log n+\log m\right)\right)$

Simpler approach to calculate $F 2$

The previous algorithm calculates $F_{2}$ in order of $O({\sqrt {n}}(\log m+\log n))$ memory bits. Alon et al. in ^[3] simplified this algorithm using four-wise independent random variable with values mapped to $\{-1,1\}$ .

This further reduces the complexity to calculate $F_{2}$ to $O\left({\dfrac {\log \left({\dfrac {1}{\varepsilon }}\right)}{\lambda ^{2}}}\left(\log n+\log m\right)\right)$

Heavy hitters

Find the most frequent (popular) elements in a data stream. Some notable algorithms are:

Boyer–Moore majority vote algorithm
Karp-Papadimitriou-Shenker algorithm
Count-Min sketch
Sticky sampling
Lossy counting
Sample and Hold
Multi-stage Bloom filters
Count-sketch
Sketch-guided sampling

Event detection

Detecting events in data streams is often done using a heavy hitters algorithm as listed above: the most frequent items and their frequency are determined using one of these algorithms, then the largest increase over the previous time point is reported as trend. This approach can be refined by using exponentially weighted moving averages and variance for normalization.^[9]

Counting distinct elements

Counting the number of distinct elements in a stream (sometimes called the $F 0$ moment) is another problem that has been well studied. The first algorithm for it was proposed by Flajolet and Martin. In 2010, D. Kane, J. Nelson and D. Woodruff found an asymptotically optimal algorithm for this problem.^[10] It uses $O (ε 2 + log d)$ space, with $O (1)$ worst-case update and reporting times, as well as universal hash functions and a $r$ -wise independent hash family where $r = Ω(log(1/ ε) / log log(1/ ε))$ .

Entropy

The (empirical) entropy of a set of frequencies $\mathbf {a}$ is defined as $F_{k}(\mathbf {a} )=\sum _{i=1}^{n}{\frac {a_{i}}{m}}\log {\frac {a_{i}}{m}}$ , where $m=\sum _{i=1}^{n}a_{i}$ .

Estimation of this quantity in a stream has been done by:

McGregor et al.
Do Ba et al.
Lall et al.
Chakrabarti et al.

Online learning

Main article: Online machine learning

Learn a model (e.g. a classifier) by a single pass over a training set.

Lower bounds

Lower bounds have been computed for many of the data streaming problems that have been studied. By far, the most common technique for computing these lower bounds has been using communication complexity.

Notes

↑ Munro & Paterson (1980)
1 2 3 Flajolet & Martin (1985)
1 2 3 4 Alon, Matias & Szegedy (1996)
↑ Gilbert et al. (2001)
↑ Xu (2007)
↑ Indyk, Piotr; Woodruff, David (2005-01-01). "Optimal Approximations of the Frequency Moments of Data Streams". Proceedings of the Thirty-seventh Annual ACM Symposium on Theory of Computing. STOC '05. New York, NY, USA: ACM: 202–208. doi:10.1145/1060590.1060621. ISBN 1-58113-960-8.
1 2 Bar-Yossef, Ziv; Jayram, T. S.; Kumar, Ravi; Sivakumar, D.; Trevisan, Luca (2002-09-13). Rolim, José D. P.; Vadhan, Salil, eds. Counting Distinct Elements in a Data Stream. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 1–10. ISBN 978-3-540-44147-2.
↑ Flajolet, Philippe (1985-03-01). "Approximate counting: A detailed analysis". BIT Numerical Mathematics. 25 (1): 113–134. doi:10.1007/BF01934993. ISSN 0006-3835.
↑ Schubert, E.; Weiler, M.; Kriegel, H. P. (2014). SigniTrend: scalable detection of emerging topics in textual streams by hashed significance thresholds. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '14. pp. 871–880. doi:10.1145/2623330.2623740. ISBN 9781450329569.
↑ Kane, Nelson & Woodruff (2010)

References

Alon, Noga; Matias, Yossi; Szegedy, Mario (1999), "The space complexity of approximating the frequency moments", Journal of Computer and System Sciences, 58 (1): 137–147, doi:10.1006/jcss.1997.1545, ISSN 0022-0000 . First published as Alon, Noga; Matias, Yossi; Szegedy, Mario (1996), "The space complexity of approximating the frequency moments", Proceedings of the 28th ACM Symposium on Theory of Computing (STOC 1996), pp. 20–29, doi:10.1145/237814.237823, ISBN 0-89791-785-5 .
Babcock, Brian; Babu, Shivnath; Datar, Mayur; Motwani, Rajeev; Widom, Jennifer (2002), "Models and issues in data stream systems", Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 2002) (PDF), pp. 1–16, doi:10.1145/543613.543615 .
Gilbert, A. C.; Kotidis, Y.; Muthukrishnan, S.; Strauss, M. J. (2001), "Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries" (PDF), Proceedings of the International Conference on Very Large Data Bases: 79–88 .
Kane, Daniel M.; Nelson, Jelani; Woodruff, David P. (2010), An optimal algorithm for the distinct elements problem, PODS '10, New York, NY, USA: ACM, pp. 41–52, doi:10.1145/1807085.1807094, ISBN 978-1-4503-0033-9 .
Karp, R. M.; Papadimitriou, C. H.; Shenker, S. (2003), "A simple algorithm for finding frequent elements in streams and bags", ACM Transactions on Database Systems, 28 (1): 51–55, doi:10.1145/762471.762473 .
Lall, Ashwin; Sekar, Vyas; Ogihara, Mitsunori; Xu, Jun; Zhang, Hui (2006), "Data streaming algorithms for estimating entropy of network traffic", Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (ACM SIGMETRICS 2006) (PDF), doi:10.1145/1140277.1140295 .
Xu, Jun (Jim) (2007), A Tutorial on Network Data Streaming (PDF) .
Heath, D., Kasif, S., Kosaraju, R., Salzberg, S., Sullivan, G., "Learning Nested Concepts With Limited Storage", Proceeding IJCAI'91 Proceedings of the 12th international joint conference on Artificial intelligence - Volume 2, Pages 777-782, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA ©1991

External links

Princeton Lecture Notes
Streaming Algorithms for Geometric Problems, by Piotr Indyk, MIT
Dagstuhl Workshop on Sublinear Algorithms
IIT Kanpur Workshop on Data Streaming
List of open problems in streaming (compiled by Andrew McGregor) from discussion at the IITK Workshop on Algorithms for Data Streams, 2006.
StreamIt - programming language and compilation infrastructure by MIT CSAIL
IBM Spade - Stream Processing Application Declarative Engine
IBM InfoSphere Streams

Tutorials and surveys

Data Stream Algorithms and Applications by S. Muthu Muthukrishnan
Stanford STREAM project survey
Network Applications of Bloom filters, by Broder and Mitzenmacher
Xu's SIGMETRICS 2007 tutorial
Lecture notes from Data Streams course at Barbados in 2009, by Andrew McGregor and S. Muthu Muthukrishnan

Courses

This article is issued from Wikipedia - version of the 11/28/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.