CAI: Cerca i Anàlisi d’Informació Grau en Ciència i Enginyeria de Dades, UPC
- 7. Streaming
January 5, 2020
Slides by Marta Arias, José Luis Balcázar, Ramon Ferrer-i-Cancho, Ricard Gavaldà, Department of Computer Science, UPC
1 / 73
7. Streaming January 5, 2020 Slides by Marta Arias, Jos Luis - - PowerPoint PPT Presentation
CAI: Cerca i Anlisi dInformaci Grau en Cincia i Enginyeria de Dades, UPC 7. Streaming January 5, 2020 Slides by Marta Arias, Jos Luis Balczar, Ramon Ferrer-i-Cancho, Ricard Gavald, Department of Computer Science, UPC 1 / 73
1 / 73
2 / 73
◮ Telcos - phone calls ◮ Satellite, radar, sensor data ◮ Computer systems and network
◮ Search logs, access logs ◮ RSS feeds, social network activity ◮ Websites, clickstreams, query streams ◮ E-commerce, credit card sales ◮ . . .
3 / 73
◮ Is this “customer” a robot? ◮ Does this customer want to buy? ◮ Is customer lost? Finding what s/he wants? ◮ What products should we recommend to this user? ◮ What ads should we show to this user? ◮ Should we get more machines from the cloud to handle
4 / 73
◮ What are the top queries right now? ◮ Which terms are gaining popularity now? ◮ What ads should we show for this query and user?
5 / 73
◮ Each call about 1000 bytes per switch ◮ I.e., about 1Tb/month; must keep for billing ◮ Is this call fraudulent? ◮ Why do we get so many call drops in area X? ◮ Should we reroute differently tomorrow? ◮ Is this customer thinking of leaving us? ◮ How to cross-sell / up-sell this customer?
6 / 73
◮ Detect abusive users ◮ Detect anomalous traffic patterns ◮ . . . DDOS attacks, intrusions, etc.
7 / 73
◮ Social networks: Planet-scale streams ◮ Smart cities. Smart vehicles ◮ Internet of Things ◮ (more phones connected to devices than used by humans) ◮ Open data; governmental and scientific ◮ We generate far more data than we can store
8 / 73
◮ Data arrives as sequence of items ◮ At high speed ◮ Forever ◮ Can’t store them all ◮ Can’t go back; or too slow ◮ Evolving, non-stationary reality
9 / 73
10 / 73
◮ Approximate answers are often OK ◮ Specifically, in learning and mining contexts ◮ Often computable with surprisingly low memory, one pass
11 / 73
◮ Algorithms use a source of independent random bits ◮ So different runs give different outputs ◮ But “most runs” are “approximately correct”
12 / 73
◮ (absolute approximation)
◮ (relative approximation)
13 / 73
14 / 73
◮ Keeping a uniform sample ◮ Counting total elements ◮ Counting distinct elements ◮ Counting frequent elements - heavy hitters ◮ Counting in a sliding window
15 / 73
16 / 73
17 / 73
◮ Add the first k stream elements to S ◮ Choose to keep t-th item with probability k/t ◮ If chosen, replace one element from S at random
18 / 73
19 / 73
21 / 73
22 / 73
23 / 73
24 / 73
From High Performance Python, M. Gorelick & I. Oswald. O’Reilly 2014 25 / 73
26 / 73
◮ Run r parallel, independent copies of the algorithm ◮ On Query, average their estimates ◮ E[Query] ≃ t,
◮ Space r log log t ◮ Time per item multiplied by r
27 / 73
◮ Places t in the series 1, b, b2, . . . , bi, . . . (“resolution” b) ◮ E[bc] ≃ t, σ ≃
◮ Space log log t − log log b
◮ For b = 1.08, 3 extra bits, σ ≃ 0.2 t
28 / 73
29 / 73
◮ I’m a web searcher. How many different queries did I get? ◮ I’m a router. How many pairs (sourceIP
◮ itemspace: potentially 2128 in IPv6
◮ I’m a text message service. How many distinct messages
◮ itemspace: essentially infinite
◮ I’m an streaming classifier builder. How many distinct
30 / 73
◮ Item space I, cardinality n, identified with range [n] ◮ fi,t = # occurrences of i ∈ I among first t stream elements ◮ dt = number of i’s for which fi,t > 0 ◮ Often omit subindex t
31 / 73
◮ Bloom Filters: O(d) bits ◮ Cohen’s filter: O(log d) bits ◮ HyperLogLog O(log log d) bits
32 / 73
33 / 73
◮ let b be the position of the leftmost 1 bit of f(x) ◮ if (b > p) p ← b
34 / 73
◮ Problem 1: runtime multiplied by r ◮ Problem 2: independent runs = generate independent
◮ And we don’t know how to generate several independent
35 / 73
◮ Divide stream into r = O(ǫ−2) substreams ◮ Use first bits of f(i) to decide substream for i ◮ Track p separately for each substream ◮ Same f can be used for all copies ◮ One sketch update per item
ǫ2 log log(# distinct))
36 / 73
◮ Original [Flajolet-Martin 85]: Geometric average of
◮ SuperLogLog [Durand+03]: Remove top 30%, then
◮ HyperLogLog [Flajolet+07]: Harmonic average
37 / 73
◮ Graph statistics (HyperANF for neighborhood profiles
◮ Computing temporal trajectories in patient database
◮ . . . ◮ When you want to store cardinalities of sets & compute
38 / 73
39 / 73
◮ Heavy hitters: Find all elements with frequency > θt ◮ Top-k: Find the k most frequent elements
40 / 73
◮ Keep a sample S (reservoir sampling) of size k ◮ Find the heavy hitters in the sample ◮ Claim those are also the heavy hitters in the stream
41 / 73
42 / 73
43 / 73
◮ At all times t, x countt[x] = t. ◮ (1) is then clear. ◮ (2) and (3) are proved by simultaneous induction on t.
44 / 73
◮ We omit discussion of efficient implementation -
◮ Appropriate for very skewed distributions ◮ Very frequent elements large counters; infrequent
◮ → good approximation of frequent element frequencies ◮ Paper contains space analysis for powerlaw - Zipf
45 / 73
◮ Provides an approximation f′ x to fx, for every x ◮ Can be used (less directly) to find θ-heavy hitters ◮ Uses memory O(1/θ)
◮ It is randomized - hash functions instead of counters ◮ Supports additions and deletions ◮ Can be used as basis for several other queries
46 / 73
◮ Only last n items matter ◮ Clear way to bound memory ◮ Natural in applications: emphasizes most recent data ◮ Data that is too old does not affect our decisions
◮ Study network packets in the last day ◮ Detect top-10 queries in search engine in last month ◮ Analyze phone calls in last hours
47 / 73
◮ Want to maintain mean, variance, histograms, frequency
◮ SQL on streams. Extension of relational algebra ◮ Want quick answers to queries at all times
48 / 73
◮ Keep window explicitly ◮ At each time t, add new bit b to head, remove oldest bit b′
◮ Add b and subtract b′ from count
49 / 73
◮ n = 106; ǫ = 0.1 → 200 counters, 4000 bits
50 / 73
◮ Each bit has a timestamp - the time at which it arrived ◮ At time t, bits with timestamp ≤ t − n are expired ◮ We have up to k buckets of each capacity 1, 2, 4, 8 . . . ◮ Each bucket of capacity 2i represents 2i 1s in a subwindow. ◮ But the only information it stores about them is the
◮ Larger buckets forget more and more information
51 / 73
◮ When all the bits in a bucket are guaranteed to be expired,
◮ All bits in all buckets except the last are non-expired ◮ An unknown number of bits in the last bucket may be
◮ If we have T bits in total among all buckets, and L bits in
52 / 73
53 / 73
◮ If b is a 0, ignore it. Otherwise, if it’s a 1: ◮ Add a bucket with 1 bit and current timestamp t to the front ◮ for i = 0, 1, . . .
◮ if timestamp of oldest bucket is ≤ t − n, drop it (it is fully
54 / 73
◮ Let 2C be the capacity of the largest bucket ◮ In the worst case, there is only 1 bucket of this capacity.
◮ For each smaller capacity, there are at least k − 1 buckets
◮ So they contain (k − 1) · (2C−1 + · · · + 1) = (k − 1)(2C − 1)
◮ Because the query returns total−2C/2 as number of
◮ The relative error is then at most
◮ This is an ǫ-approximation if we take ǫ = 1/(2k)
55 / 73
◮ Largest bucket needed: k C i=0 2i ≃ n → C ≃ log(n/k) ◮ Total number of buckets: k · (C + 1) ≃ k log(n/k) ◮ Each bucket contains a timestamp only (perhaps its
◮ timestamps are in t − n . . . t: recycle timestamps mod n ◮ Memory is O(k log(n/k)) = O((log n)/ǫ) integers; ◮ Multiply this by O(log n) to get bits - we only need ints
56 / 73
◮ Variance ◮ Distinct elements (using Flajolet-Martin) ◮ Max, min ◮ Histograms ◮ Hash tables ◮ Frequency moments
57 / 73
◮ Many sources generating streams concurrently ◮ No synchrony assumption ◮ Want to compute global statistics ◮ Streams can send short summaries to central
59 / 73
◮ given two sketches S1 and S2 generated by the algorithm
◮ one can compute a sketch S that answers queries correctly
60 / 73
◮ Morris’ counter ◮ Bloom filters, Cohen, Flajolet-Martin, HyperLogLog ◮ SpaceSaving ◮ CM-sketch ◮ Exponential Histograms (though order dependent problem)
61 / 73
◮ Survey by Liberty and Nelson: http://www.cs.yale.edu/homes/
◮ J. Ullman and A. Rajaraman, Mining of Massive Datasets, Chapter 3 -
◮ A very general bibliography by K. Tufte: http:
◮ Book by G. Cormode, M. Garofalakis, P
◮ Survey by G. Cormode:
◮ Chapter 4 of https://mitpress.mit.edu/books/
62 / 73
◮ The original Morris77 paper:
◮ An analysis of Morris’ counter (math intensive): http://algo.
◮ The application of Morris’ counters to counting n-grams, by Van
63 / 73
◮ G. Lugosi: http://www.econ.upf.edu/~lugosi/anu.pdf ◮ A. Sinclair: http:
◮ C. Shalizi list of references (much beyond the scope of this
64 / 73
◮ Good general survey of distinct element counting up to 2008: Ahmed
◮ Also general discussion on distinct element counting:
◮ Presentation including some sketches I didn’t mention: http://www.
◮ Bloom filter. K.Y. Whang, B. Vander-Zanden, H.M. Taylor, A Linear-time
◮ Cohen’s log(n) solution: Edith Cohen, Size-Estimation Framework with
65 / 73
◮ The Flajolet-Martin probabilistic counter. Philippe Flajolet, G. Nigel
◮ SuperLogLog counter (and insight on FM probabilistic counter) Durand,
◮ The HyperLogLog paper: Flajolet, P
◮ Flajolet’s contributions beautifully explained:
66 / 73
◮ http://en.wikipedia.org/wiki/HyperLogLog ◮ http://research.neustar.biz/2012/10/25/
◮ A live demo of hyperloglog at the web above:
◮ http://www.slideshare.net/sunnyujjawal/
◮ http://stackoverflow.com/questions/12327004/
◮ Important optimizations that I’d like to try:
67 / 73
◮ J. Vitter. Random Sampling with a reservoir. ACM Trans. on
◮ Good survey of heavy hitter algorithms. Radu Berinde, Graham
◮ Also very good survey: Graham Cormode, Marios Hadjieleftheriou.
◮ Richard M. Karp, Scott Shenker, Christos H. Papadimitriou. A Simple
◮ The Space-Saving sketch paper. Ahmed Metwally, Divyakant Agrawal,
◮ M. Charikar, K. Chen and M. Farach-Colton. "Finding Frequent Items in
68 / 73
◮ The CM-Sketch paper. Graham Cormode and S.
◮ On Frugal Streaming, a neat sketch for estimating quantiles that
◮ http://en.wikipedia.org/wiki/Count-min_sketch ◮ https://sites.google.com/site/countminsketch/ ◮ https://tech.shareaholic.com/2012/12/03/
69 / 73
◮ Mayur Datar, Aristides Gionis, Piotr Indyk, Rajeev Motwani: Maintaining
◮ Mayur Datar, Rajeev Motwani: The Sliding-Window Computation Model
◮ Discussions on mergeability are a bit all over. This is sort of an
70 / 73
◮ Noga Alon, Yossi Matias, Mario Szegedy: The space complexity of approximating frequency moments. J. Computer and System Sciences 58(1): 137-147 (1999). Conference version (STOC) 1996 ◮ Paolo Boldi, Marco Rosa, and Sebastiano Vigna. HyperANF: Approximating the neighbourhood function of very large graphs on a budget. WWW, 2011. ◮ An application of the above to computing diameter of the Facebook graph: Lars Backstrom, Paolo Boldi, Marco Rosa, Johan Ugander, Sebastiano Vigna. Four Degrees of Separation. ACM Web Science 2012, 2012. ◮ A survey on streaming graph algorithms: http: //people.cs.umass.edu/~mcgregor/papers/13-graphsurvey.pdf ◮ Computing SVD on streams, this will be important in streaming ML: Mina Ghashami, Edo Liberty, Jeff M. Phillips, David P . Woodruff, Frequent Directions : Simple and Deterministic Matrix Sketching. http://arxiv.org/abs/1501.01711 ◮ This will also be important in streaming ML: Christos Boutsidis, Dan Garber, Zohar Karnin, Edo Liberty: Online Principal Component Analysis, SODA 2015. http://www.cs.yale.edu/homes/el327/papers/opca.pdf
71 / 73
◮ The MassDAL Code Bank. http:
◮ StreamLib: https://github.com/addthis/stream-lib. Check
◮ Hokusai: https://github.com/dgryski/hokusai. I have not
◮ Webgraph. Analysis of large graphs, contains the HyperANF and
72 / 73
◮ c++: https://github.com/hideo55/cpp-HyperLogLog/blob/
◮ Java: https://github.com/addthis/stream-lib/tree/
◮ Python: https://pypi.python.org/pypi/hyperloglog/0.0.8 ◮ Ruby: https://rubygems.org/gems/hyperloglog ◮ Perl: http:
◮ JavaScript: http://cnpmjs.org/package/hyperloglog ◮ node.js: https://www.npmjs.org/package/streamcount ◮ https://github.com/eclesh/hyperloglog/blob/master/
73 / 73