scalable machine learning
play

Scalable Machine Learning 3. Data Streams Alex Smola Yahoo! - PowerPoint PPT Presentation

Scalable Machine Learning 3. Data Streams Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12 3. Data Streams Building realtime *Analytics at home Data Streams Data & Applications Moments


  1. Scalable Machine Learning 3. Data Streams Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12

  2. 3. Data Streams Building realtime *Analytics at home

  3. Data Streams Data & Applications • Moments • Flajolet-Martin counter • Alon-Matias-Szegedy sketch • Heavy hitter detection • Lossy counting • Space saving • Semiring statistics • Bloom filter • CountMin sketch • Realtime analytics • Fault tolerance and scalability • Interpolating sketches

  4. 3.1 Streams

  5. Data Streams • Cannot replay data • Limited memory / computation / realtime analytics • Time series Observe instances (x t , t) stock symbols, acceleration data, video, server logs, surveillance • Cash register Observe instances x i (weighted), always positive increments query stream, user activity, network traffic, revenue, clicks • Turnstile Increments and decrements (possibly require nonnegativity) caching, windowed statistics

  6. Website Analytics NIPS • Continuous stream of users (tracked with cookie) • Many sites signed up for analytics service • Find hot links / frequent users / click probability / right now

  7. Query Stream • Item stream • Find heavy hitters • Detect trends early (e.g. Obsama bin Laden killed) • Frequent combinations (cf. frequent items) • Source distribution • In real time

  8. Network traffic analysis • TCP/IP packets • On switch with limited memory footprint • Realtime analytics • Busiest connections • Trends • Protocol-level data • Distributed information gathering

  9. Financial Time Series • real time prediction • missing data • metadata (news, quarterly reports, financial background) • time-stamped data stream • multiple sources • different time resolution

  10. News • Realtime news stream • Multiple sources (Reuters, AP, CNN, ...) • Same story from multiple sources • Stories are related

  11. 3.2 Moments

  12. Warmup ? ... • Stream of m items x i • Want to compute statistics of what we’ve seen • Small cardinality n • Trivial to compute aggregate counts (dictionary lookup) • Memory is O(n) • Computation is O(log n) for storage & lookup • Large cardinality n • Exact storage of counts impossible • Exact test for previous occurrence impossible • Need approximate (dynamic) data structure

  13. Warmup ? ... • Stream of m items x i • Want to compute statistics of what we’ve seen • Small cardinality n • Trivial to compute aggregate counts (dictionary lookup) • Memory is O(n) • Computation is O(log n) for storage & lookup • Large cardinality n • Exact storage of counts impossible • Exact test for previous occurrence impossible • Need approximate (dynamic) data structure

  14. Finding the missing item • Sequence of instances [1..N] • One of them is missing • Identify it • Algorithm N X • Compute sum s := i i =1 • For each item decrement s via s ← s − x i • At the end identify missing item • We only need least significant log N bits

  15. Finding the missing item • Sequence of instances [1..N] • One of them is missing • Identify it • Algorithm N X • Compute sum s := i i =1 • For each item decrement s via s ← s − x i • At the end identify missing item • We only need least significant log N bits

  16. Finding the missing item • Sequence of instances [1..N] • One of them is missing • Identify it • Algorithm N X • Compute sum s := i i =1 • For each item decrement s via s ← s − x i • At the end identify missing item • We only need least significant log N bits

  17. Finding the missing item • Sequence of instances [1..N] • Up to k of them are missing • Identify them • Algorithm N X i p • Compute sum for p up to k s p := i =1 • For each item decrement all s p via s p ← s p − x p i • Identify missing item by solving polynomial system • We only need least significant log N bits

  18. Finding the missing item • Sequence of instances [1..N] • Up to k of them are missing • Identify them • Algorithm N X i p • Compute sum for p up to k s p := i =1 • For each item decrement all s p via s p ← s p − x p i • Identify missing item by solving polynomial system • We only need least significant log N bits

  19. Estimating F k

  20. Moments • Characterize the skewness of distribution • Sequence of instances • Instantaneous estimates X n p F p := x • Special cases x ∈ X • F 0 is number of distinct items • F 1 is number of items (trivial to estimate) • F 2 describes ‘variance’ (used e.g. for database query plans)

  21. Flajolet-Martin counter • Assume perfect hash functions (simplifies proof) • Design hash with Pr( h ( x ) = j ) = 2 − j log n bits 0 0 1 0 0 1 1 0 0 2 1 0 0 1 1 0 1 1 4 0 0 1 0 1 1 1 1 • Position of the rightmost 0 (LSB is position 1) • CDF for maximum over n items F ( j ) = (1 − 2 − j ) n (CDF of maximum over n random variables is F n )

  22. Flajolet-Martin counter 0 0 1 0 0 1 1 0 0 2 1 0 0 1 1 0 1 1 4 0 0 1 0 1 1 1 1 • Intuitively expect that max x ∈ X h ( j ) ≈ log |X| • Repetitions of same element do not matter • Need O(log log |X|) bits to store counter • High probability bounding range ✓� � ◆ ≤ 2 � � Pr � max x ∈ X h ( j ) − log |X| � > log c � � c

  23. Proof (for a version with 2-way independent hash functions see Alon, Matias and Szegedy) • Upper bound trivial |X| · 2 − j ≤ 1 ⇒ 2 j ≥ c |X| c = With probability at most 1/c the upper bound is exceeded (using union bound) • Lower bound • Probability of not exceeding j is bounded by ≤ 1 (1 − 2 − j ) |X| ≤ exp |X| · 2 − j � c ≤ e − c � Solve for j to obtain 2 j ≥ |X| c

  24. Variations on FM counter • Lossy counting • Increment counter j to c with probability p -c for p<0.5 • Yields estimate of log-count (normalization!) • FM instead of bits inside Bloom filter ... more later • log n rather than log log n array • Set bit according to hash waste waste 0 0 0 0 0 1 0 1 1 0 1 1 1 1 • Count consecutive 1 instead of largest bit and fill gaps. • The log log bounds are tight (see AMS lower bound)

  25. Computing F 2 • Strategy • Design random variable with E [ X ij ] = F 2 • Take average over subsets a X i := 1 X ¯ X ij a j =1 • Estimate is median ⇥ ¯ ¯ X 1 , . . . , ¯ ⇤ X := med X b • Random variable # 2 " X X ij := σ ( x, i, j ) x ∈ stream • σ is Rademacher hash with equiprobable {± 1 } • In expectation all cross terms cancel out yielding F 2

  26. Average-Median Theorem • Random variables X ij with mean μ , variance σ a ⇥ ¯ • Mean estimate and X i := 1 ¯ X 1 , . . . , ¯ X ¯ ⇤ X := med X b X ij a j =1 • The probability of deviation is bounded by ≤ � for a = 8 � 2 ✏ − 2 and b = − 8 | ¯ � X − µ | ≥ ✏ 3 log � Pr • Note - Alon, Matias & Szegedy claim b = − 2 log δ but the Chernoff bounds don’t work out AFAIK

  27. Proof • Bounding the mean Pick and apply Chebyshev bound to see a = 8 � 2 ✏ − 2 ≤ 1 that | ¯ � X i − µ | > ✏ Pr 8 • Bounding the median • Ensure that for at least half deviation is small ¯ X i • Failure probability is at most 1/8 • Chernoff (Mitzenmacher & Upfahl Theorem 4.4) Pr { x ≥ (1 + δ ) µ ) } ≤ e − µ δ 2 3 Plug in ✓ ◆ ✏ = 3; µ = b − 3 b and b ≤ − 8 8 hence � ≤ exp 3 log � 8

  28. Computing F 2 • Mean # 2 # 2 " "X X X n 2 E [ X ij ] = E σ ( x, i, j ) = E σ ( x, i, j ) n x = x x ∈ stream x ∈ X x ∈ X • Variance # 4 " X X X X 2 n 2 x n 2 n 4 ⇥ ⇤ = E σ ( x, i, j ) = 3 x 0 − 2 E ij x x ∈ stream x,x 0 ∈ X x ∈ X X X X 2 n 2 x n 2 n 4 x ≤ 2 F 2 ⇥ ⇤ − [ E [ X ij ]] 2 = 2 x 0 − 2 E ij 2 x,x 0 ∈ X x ∈ X • Plugging into the Average-Median theorem shows that algorithm uses bits O ( ✏ − 2 log(1 / � ) log |X| n )

  29. Computing F k in general • Random variable with expectation F k • Pick uniformly random element in sequence • Start counting instances until end a s r a n d o m a s c a n b e 3 1 2 3 1 1 • Use count r ij for r k ij − ( r ij − 1) k � � X ij = m • Apply the Average-Median theorem

  30. More F k • Mean via telescoping sum h 1 k + (2 k − 1 k ) + . . . + ( n k 1 − ( n 1 − 1) k ) E [ X ij ] = i + . . . + ( n k |X| − ( n |X| − 1) k ) X n k = x = F k x ∈ X • Variance by brute force algebra Var [ X ij ] ≤ E [ X ij ] ≤ k |X| 1 − 1 /k F 2 k • We need at most O ( k |X| 1 − 1 /k ✏ − 2 log 1 / � (log m + log |X| ) bits to estimate F k . The rate is tight.

  31. More F k • Mean via telescoping sum h 1 k + (2 k − 1 k ) + . . . + ( n k 1 − ( n 1 − 1) k ) E [ X ij ] = i + . . . + ( n k |X| − ( n |X| − 1) k ) no better than brute X n k = x = F k force for large k x ∈ X • Variance by brute force algebra Var [ X ij ] ≤ E [ X ij ] ≤ k |X| 1 − 1 /k F 2 k • We need at most O ( k |X| 1 − 1 /k ✏ − 2 log 1 / � (log m + log |X| ) bits to estimate F k . The rate is tight.

  32. Uniform sampling

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend