big data algorithms overview reference http
play

Big-Data Algorithms: Overview Reference: - PowerPoint PPT Presentation

Big-Data Algorithms: Overview Reference: http://www.sketchingbigdata.org/fall17/lec/lec1.pdf Whats the problem here? I So far, linear (i.e., linear-cost) algorithms have been gold standard. I What if linear algorithms arent good


  1. Big-Data Algorithms: Overview Reference: http://www.sketchingbigdata.org/fall17/lec/lec1.pdf

  2. What’s the problem here? I So far, linear (i.e., linear-cost) algorithms have been “gold standard”. I What if linear algorithms aren’t good enough?

  3. What’s the problem here? I So far, linear (i.e., linear-cost) algorithms have been “gold standard”. I What if linear algorithms aren’t good enough? Example: Search the web for pages of interest.

  4. Topics of Interest I Sketching: Compression of a data set that allows queries. I Compression C ( x ) of some data set x that allows us to query f ( x ). I May want to compute f ( x , y ) from C ( x ) and C ( y ). I May want composable compression: if x = x 1 x 2 . . . x n , would like to compute C ( x 1 x 2 . . . x n x n +1 ) = C ( x x n +1 ) using just C ( x ) and x n +1 .

  5. Topics of Interest I Sketching: Compression of a data set that allows queries. I Compression C ( x ) of some data set x that allows us to query f ( x ). I May want to compute f ( x , y ) from C ( x ) and C ( y ). I May want composable compression: if x = x 1 x 2 . . . x n , would like to compute C ( x 1 x 2 . . . x n x n +1 ) = C ( x x n +1 ) using just C ( x ) and x n +1 . I Streaming: May not be able to store a huge dataset. Need to process stream of data, coming in one chunk at a time, on the fly. Must answer queries with sublinear memory.

  6. Topics of Interest I Sketching: Compression of a data set that allows queries. I Compression C ( x ) of some data set x that allows us to query f ( x ). I May want to compute f ( x , y ) from C ( x ) and C ( y ). I May want composable compression: if x = x 1 x 2 . . . x n , would like to compute C ( x 1 x 2 . . . x n x n +1 ) = C ( x x n +1 ) using just C ( x ) and x n +1 . I Streaming: May not be able to store a huge dataset. Need to process stream of data, coming in one chunk at a time, on the fly. Must answer queries with sublinear memory. I Dimensionality reduction: For example, spam filtering. Bag-of-words model: Let d be a dictionary of words. Represent email by vector v , where v i is the number of times d i appears in msg. Then dim v = | d | .

  7. I Large-scale matrix computation , such as least squares regression : Suppose we want to learn f : R n ! R , where f = h b , · i for some b 2 R n , where n X 8 u , v 2 R n . h u , v i = u i v i j =1 Collect data { ( x i 2 R n , y i 2 R ) : 1  i  m } . Want to compute b minimizing ✓ n ◆ 1 / 2 ( y i � h b , x i i ) 2 k Xb � y k 2 X 2 = , j =1 where X 2 R m × n is composed of the (column) vectors x T 1 , . . . , x T p m and k · k 2 = h · , · i is ` 2 -norm. Also, principal component analysis, given by singular value decomposition of matrix: which features are most important?

  8. Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time.

  9. Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n

  10. Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n Why approximation? If we want exact value, then can store n via a counter, a sequence of d log n e bits (“log” is “log 2 ”).

  11. Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n Why approximation? If we want exact value, then can store n via a counter, a sequence of d log n e bits (“log” is “log 2 ”). Can’t do better: If we use f ( n ) bits to store n , then there are 2 f ( n ) configurations. To store exact value of all integers up to n , must have 2 f ( n ) � n

  12. Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n Why approximation? If we want exact value, then can store n via a counter, a sequence of d log n e bits (“log” is “log 2 ”). Can’t do better: If we use f ( n ) bits to store n , then there are 2 f ( n ) configurations. To store exact value of all integers up to n , must have 2 f ( n ) � n = ) f ( n ) � log n

  13. Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n Why approximation? If we want exact value, then can store n via a counter, a sequence of d log n e bits (“log” is “log 2 ”). Can’t do better: If we use f ( n ) bits to store n , then there are 2 f ( n ) configurations. To store exact value of all integers up to n , must have 2 f ( n ) � n = ) f ( n ) � log n = ) f ( n ) � d log n e

  14. Approximate Counting Problem: Monitor a sequence of events, allow approximate count of number of events so far at any time. Create data structure maintaining a single integer n (initialize to zero) and supporting the operations I init() : set n 0. I update() : increments n . I query() : prints (estimate of) n Why approximation? If we want exact value, then can store n via a counter, a sequence of d log n e bits (“log” is “log 2 ”). Can’t do better: If we use f ( n ) bits to store n , then there are 2 f ( n ) configurations. To store exact value of all integers up to n , must have 2 f ( n ) � n = ) f ( n ) � log n = ) f ( n ) � d log n e since n 2 Z

  15. If we want sublinear-space algorithm, need an estimate ˜ n of n . Want to know that for some " , � 2 (0 , 1), we have P ( | ˜ n � n | > " n ) < � .

  16. If we want sublinear-space algorithm, need an estimate ˜ n of n . Want to know that for some " , � 2 (0 , 1), we have P ( | ˜ n � n | > " n ) < � . Equivalently: P ( | ˜ n � n |  " n ) � 1 � � .

  17. Morris’ algorithm : Uses an integer counter X , with data structure operations I init() : sets X 0 I update() : increments X with probability 2 − X n = 2 X � 1 I query() : outputs ˜ Intuitively, X attempts to store a value approximately log n . How good is this?

  18. Morris’ algorithm : Uses an integer counter X , with data structure operations I init() : sets X 0 I update() : increments X with probability 2 − X n = 2 X � 1 I query() : outputs ˜ Intuitively, X attempts to store a value approximately log n . How good is this? Not so great; we’ll see that 1 P ( | ˜ n � n | > " n ) < 2 " 2 Since " < 1, RHS exceeds 1 2 , which means that estimator may always be zero!

  19. Improvement Morris+: Create s independent copies of Morris, and average their outputs. Calling these estimators ˜ n 1 , . . . , ˜ n s , then output is n n = 1 X ˜ n i . ˜ s i =1 Then 1 P ( | ˜ n � n | > " n ) < 2 s " 2 So 1 P ( | ˜ n � n | > " n ) < � for s > 2 " 2 � = Θ (1 / � ) Better!

  20. Improvement Morris++: Reduces dependence of failure probability from Θ (1 / � ) to Θ (log 1 / � ).

  21. Improvement Morris++: Reduces dependence of failure probability from Θ (1 / � ) to Θ (log 1 / � ). Run t instances of Morris+, each with failure probability 1 3 . So s = Θ (1 / " 2 ) for each instance. Now output median estimate of these t Morris+ instances. Calling this output ˜ n , it turns out that P ( | ˜ n � n | > " n ) < � for t = Θ (log 1 / � ) .

  22. Probability Review Let X be a random variable taking values in S ✓ R . The expected value of X is X E X = j · P ( X = j ) . j ∈ S The variance of X is ( X � E X ) 2 � � Var[ X ] = E . Linearity of expected value: Let X and Y be random variables. Than E ( aX + bY ) = a E X + b E Y 8 a , b 2 R . Markov’s inequality: If X is a nonnegative random variable, then P ( X > � ) < E X 8 � > 0 . �

  23. Chebyshev’s inequality: Let X be a nonnegative random variable. Then P ( | X � E X | > � ) < E ( X � E X ) 2 = Var[ X ] 8 � > 0 . � 2 � 2 More generally, if p � 1, then P ( | X � E X | > � ) < E ( X � E X ) p . 8 � > 0 . � p Cherno ff ’s inequality: Suppose X 1 , X 2 , . . . , X n are independent random variables with X i 2 [0 , 1]. Let X = P n i =1 X i . Then P ( | X � E X | > " E X )  2 · e − ε 2 µ/ 3 8 " 2 (0 , 1) .

  24. Analysis of Morris’ algorithm Let X n be X after n updates. Claim: E 2 X n = n + 1 for n 2 N 0 . Proof of claim: By induction, the base case n = 0 being E 2 X n = E 2 X 0 = E 1 = n + 1 .

  25. Induction step: Suppose that E 2 X n = n + 1 for some n 2 N 0 . Then ∞ E 2 X n +1 = P ( X n = j ) · E (2 X n +1 | X n = j ) X j =0 ∞ ✓✓ 1 � 1 ◆ 2 j + 1 ◆ X 2 j · 2 j +1 = P ( X n = j ) · 2 j j =0 ∞ ∞ P ( X n = j ) 2 j + X X = P ( X n = j ) j =0 j =0 = E 2 X n + 1 = ( n + 1) + 1 , as required.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend