b669 sublinear algorithms for big data
play

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: - PowerPoint PPT Presentation

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model and challenge The data stream model (Alon, Matias and Szegedy 1996) RAM a n a 2 a 1 CPU Why hard? Cannot store everything. Applications : Internet


  1. B669 Sublinear Algorithms for Big Data Qin Zhang 1-1

  2. Part 1: Sublinear in Space 2-1

  3. The model and challenge The data stream model (Alon, Matias and Szegedy 1996) RAM a n a 2 a 1 CPU Why hard? Cannot store everything. Applications : Internet router, stock data, ad auction, flight logs on tape, etc. (next 4 slides, in courtesy of Jeff Phillps) 3-1

  4. Network routers Packets limited space Router Internet Router • data per day: at least 1 Terabyte • packet takes 8 nanoseconds to pass through router • few million packets per second What statistics can we keep on data? For example, want to detect anomalies for security. 4-1

  5. Telephone Switch txt, msg limited space Switch Cell phones connect through switches • each message 1000 Bytes • 500 million calls / day • 1 Terabyte per month second Search for characteristics for dropped calls? 5-1

  6. Ad Auction limited space page view ad served ad click Server keyword search delivery model Serving Ads on web Google, Yahoo!, Microsoft • Yahoo.com viewed 77 trillion times • 2 million / hour • Each page serves ads; which ones? How to update ad delivery model? 6-1

  7. Flight Logs on Tape CPU statistics All airplane logs over Washington, DC • About 500 - 1000 flights per day. • 50 years, total about 9 million flights • Each flight has trajectory, passenger count, control dialog. Stored on Tape. Can only make 1 (or O (1)) pass! What statistics can be gathered? 7-1

  8. (Last lecture) Maintain a sample for Sliding Windows Tasks : Find a uniform sample from the last w items . 8-1

  9. (Last lecture) Maintain a sample for Sliding Windows Tasks : Find a uniform sample from the last w items . Algorithm : – For each x i , we pick a random value v i ∈ (0 , 1). – In a window < x j − w +1 , . . . , x j > , return value x i with smallest v i . – To do this, maintain the set of all x i in sliding window whose v i value is minimal among subsequent values. 8-2

  10. (Last lecture) Maintain a sample for Sliding Windows Tasks : Find a uniform sample from the last w items . Algorithm : – For each x i , we pick a random value v i ∈ (0 , 1). – In a window < x j − w +1 , . . . , x j > , return value x i with smallest v i . – To do this, maintain the set of all x i in sliding window whose v i value is minimal among subsequent values. Correctness : Obvious. Space (expected): 1 / w + 1 / ( w − 1) + . . . + 1 / 1 = log w . 8-3

  11. § 1 . 0 An overview of problems 9-1

  12. Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. On seen a i = ( i , ∆), update f i ← f i + ∆ (special case: ∆ = { 1 , − 1 } , corresponding to ins/del). 10-1

  13. Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. On seen a i = ( i , ∆), update f i ← f i + ∆ (special case: ∆ = { 1 , − 1 } , corresponding to ins/del). Entropy: emprical entropy of the data set : m log m f i H ( A ) = � f i , i ∈ [ n ] App: Very useful in “change” (e.g., anomalous events) detection. 10-2

  14. Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. On seen a i = ( i , ∆), update f i ← f i + ∆ (special case: ∆ = { 1 , − 1 } , corresponding to ins/del). Entropy: emprical entropy of the data set : m log m f i H ( A ) = � f i , i ∈ [ n ] App: Very useful in “change” (e.g., anomalous events) detection. i f p Frequent moments: F p = � i • F 0 : number of distinct items. • F 1 : total number of items. • F 2 : size of self-join. General F P ( p > 1), good measurements of the skewness of the data. 10-3

  15. Statistics (cont.) Heavy-hitter: a set of items whose frequency ≥ a threshold. App: popular IP destinations, . . . Included 0 . 01 m 1 2 3 4 5 6 7 8 | A | = m 11-1

  16. Statistics (cont.) Heavy-hitter: a set of items whose frequency ≥ a threshold. App: popular IP destinations, . . . Quantile: The φ -quantile of A is some x such Included that there are at most φ m items of A that are smaller than x and at most (1 − φ ) m items of A that are greater than x . 0 . 01 m All-quantile: a data structure from which all φ -quantiles for any 1 2 3 4 5 6 7 8 0 ≤ φ ≤ 1 can be extracted. | A | = m App: distribution of package sizes . . . 11-2

  17. Statistics (cont.) L p sampling: Let x ∈ R n be a non-zero vector. For p > 0 we call the L p distribution corresponding to x the distribution on [ n ] that takes i with probability | x i | p , � x i � p p i ∈ [ n ] | x i | p ) 1 / p . In particular, for p = 0, the with � x � p = ( � L 0 sampling is to select an element uniform at random from the non-zero coordinates of x . App: an extremely useful tool for constructing graph sketches, finding duplications, etc. 12-1

  18. Graphs Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = (( u i , v i ) , insert/delete), where ( u i , v i ) is an edge. 13-1

  19. Graphs Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = (( u i , v i ) , insert/delete), where ( u i , v i ) is an edge. Connectivity: Test if a graph is connected. Matching: Estimate the size of the maximum matching of a graph. Diameter: Compute the diameter of a graph (that is, the maximum distance between two nodes). 13-2

  20. Graphs Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = (( u i , v i ) , insert/delete), where ( u i , v i ) is an edge. Connectivity: Test if a graph is connected. Matching: Estimate the size of the maximum matching of a graph. Diameter: Compute the diameter of a graph (that is, the maximum distance between two nodes). Triangle counting: Compute # triangles of a graph. App: Useful for finding communities in a social network. (fraction of v’s neighbors who are neighbors themselves) 13-3

  21. Graphs (cont.) Spanner: Given a graph G = ( V , E ), we say that a subgraph H = ( V , E ′ ) is an α -spanner for G if ∀ u , v , ∈ V , d G ( u , v ) ≤ d H ( u , v ) ≤ α · d G ( u , v ) A subgraph (approximately) maintains pair-wise distances. 14-1

  22. Graphs (cont.) Spanner: Given a graph G = ( V , E ), we say that a subgraph H = ( V , E ′ ) is an α -spanner for G if ∀ u , v , ∈ V , d G ( u , v ) ≤ d H ( u , v ) ≤ α · d G ( u , v ) A subgraph (approximately) maintains pair-wise distances. Graph sparcification: Given a graph G = ( V , E ), denote the minimum cut of G by λ ( G ), and λ A ( G ) the capacity of the cut ( A , V \ A ). We say that a weighted subgraph H = ( V , E ′ , w ) is an ǫ -sparsification for G if ∀ A ⊂ V , (1 − ǫ ) λ A ( G ) ≤ λ A ( H ) ≤ (1 + ǫ ) λ A ( G ) . App: Synopses for massive graphs. A graph synopse is a subgraph of much smaller size that keeps properties of the original graph. 14-2

  23. Geometry Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( location , ins/del). 15-1

  24. Geometry Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( location , ins/del). Earth-mover distance: Given two multisets A , B in the grid [∆] 2 of the same size, the earth-mover distance is defined as the minimum cost of a perfect matching between points in A and B . � EMD ( A , B ) = min � a − π ( a ) � . π : A → B a bijection a ∈ A App: a good measurement of the similarity of two images 15-2

  25. Geometry Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( location , ins/del). Earth-mover distance: Given two multisets A , B in the grid [∆] 2 of the same size, the earth-mover distance is defined as the minimum cost of a perfect matching between points in A and B . � EMD ( A , B ) = min � a − π ( a ) � . π : A → B a bijection a ∈ A App: a good measurement of the similarity of two images Clustering: ( k -Center) Cluster a set of points X = ( x 1 , x 2 , . . . , x m ) to clusters c 1 , c 2 , . . . , c k with representatives r 1 ∈ c 1 , r 2 ∈ c 2 , . . . , r k ∈ c k to minimize max min d ( x i , r j ) i j . App: (see wiki page) 15-3

  26. Strings Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( i , ins/del). 16-1

  27. Strings Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( i , ins/del). Distance to the sortedness: LIS( A )= length of longest increasing subsequence of sequence A . DistSort( A )= minimum number of elements needed to be deleted from A to get a sorted sequence = | A | − LIS( A ). App: a good measurement of network latency. 16-2

  28. Strings Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( i , ins/del). Distance to the sortedness: LIS( A )= length of longest increasing subsequence of sequence A . DistSort( A )= minimum number of elements needed to be deleted from A to get a sorted sequence = | A | − LIS( A ). App: a good measurement of network latency. Edit distance: Given two strings A and B , the number of insertion/deletion/substitution that is needed to convert A to B . App: a standard measurement of the similarity of two strings/documents 16-3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend