b669 sublinear algorithms for big data
play

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview - PowerPoint PPT Presentation

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview of problems 2-1 Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe.


  1. B669 Sublinear Algorithms for Big Data Qin Zhang 1-1

  2. An overview of problems 2-1

  3. Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. On seen a i = ( i , ∆), update f i ← f i + ∆ (special case: ∆ = { 1 , − 1 } , corresponding to ins/del). 3-1

  4. Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. On seen a i = ( i , ∆), update f i ← f i + ∆ (special case: ∆ = { 1 , − 1 } , corresponding to ins/del). Entropy: emprical entropy of the data set : m log m f i H ( A ) = � f i , i ∈ [ n ] App: Very useful in “change” (e.g., anomalous events) detection. 3-2

  5. Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. On seen a i = ( i , ∆), update f i ← f i + ∆ (special case: ∆ = { 1 , − 1 } , corresponding to ins/del). Entropy: emprical entropy of the data set : m log m f i H ( A ) = � f i , i ∈ [ n ] App: Very useful in “change” (e.g., anomalous events) detection. i f p Frequent moments: F p = � i • F 0 : number of distinct items. • F 1 : total number of items. • F 2 : size of self-join. General F P ( p > 1), good measurements of the skewness of the data. 3-3

  6. Statistics (cont.) Heavy-hitter: a set of items whose frequency ≥ a threshold. App: popular IP destinations, . . . Included 0 . 01 m 1 2 3 4 5 6 7 8 | A | = m 4-1

  7. Statistics (cont.) Heavy-hitter: a set of items whose frequency ≥ a threshold. App: popular IP destinations, . . . Quantile: The φ -quantile of A is some x such Included that there are at most φ m items of A that are smaller than x and at most (1 − φ ) m items of A that are greater than x . 0 . 01 m All-quantile: a data structure from which all φ -quantiles for any 1 2 3 4 5 6 7 8 0 ≤ φ ≤ 1 can be extracted. | A | = m App: distribution of package sizes . . . 4-2

  8. Statistics (cont.) L p sampling: Let x ∈ R n be a non-zero vector. For p > 0 we call the L p distribution corresponding to x the distribution on [ n ] that takes i with probability | x i | p , � x i � p p i ∈ [ n ] | x i | p ) 1 / p . In particular, for p = 0, the with � x � p = ( � L 0 sampling is to select an element uniform at random from the non-zero coordinates of x . App: an extremely useful tool for constructing graph sketches, finding duplications, etc. 5-1

  9. Graphs Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = (( u i , v i ) , insert/delete), where ( u i , v i ) is an edge. 6-1

  10. Graphs Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = (( u i , v i ) , insert/delete), where ( u i , v i ) is an edge. Connectivity: Test if a graph is connected. Matching: Estimate the size of the maximum matching of a graph. Diameter: Compute the diameter of a graph (that is, the maximum distance between two nodes). 6-2

  11. Graphs Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = (( u i , v i ) , insert/delete), where ( u i , v i ) is an edge. Connectivity: Test if a graph is connected. Matching: Estimate the size of the maximum matching of a graph. Diameter: Compute the diameter of a graph (that is, the maximum distance between two nodes). Triangle counting: Compute # triangles of a graph. App: Useful for finding communities in a social network. (fraction of v’s neighbors who are neighbors themselves) 6-3

  12. Graphs (cont.) Spanner: Given a graph G = ( V , E ), we say that a subgraph H = ( V , E ′ ) is an α -spanner for G if ∀ u , v , ∈ V , d G ( u , v ) ≤ d H ( u , v ) ≤ α · d G ( u , v ) A subgraph (approximately) maintains pair-wise distances. 7-1

  13. Graphs (cont.) Spanner: Given a graph G = ( V , E ), we say that a subgraph H = ( V , E ′ ) is an α -spanner for G if ∀ u , v , ∈ V , d G ( u , v ) ≤ d H ( u , v ) ≤ α · d G ( u , v ) A subgraph (approximately) maintains pair-wise distances. Graph sparcification: Given a graph G = ( V , E ), denote the minimum cut of G by λ ( G ), and λ A ( G ) the capacity of the cut ( A , V \ A ). We say that a weighted subgraph H = ( V , E ′ , w ) is an ǫ -sparsification for G if ∀ A ⊂ V , (1 − ǫ ) λ A ( G ) ≤ λ A ( H ) ≤ (1 + ǫ ) λ A ( G ) . App: Synopses for massive graphs. A graph synopse is a subgraph of much smaller size that keeps properties of the original graph. 7-2

  14. Geometry Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( location , ins/del). 8-1

  15. Geometry Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( location , ins/del). Earth-mover distance: Given two multisets A , B in the grid [∆] 2 of the same size, the earth-mover distance is defined as the minimum cost of a perfect matching between points in A and B . � EMD ( A , B ) = min � a − π ( a ) � . π : A → B a bijection a ∈ A App: a good measurement of the similarity of two images 8-2

  16. Geometry Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( location , ins/del). Earth-mover distance: Given two multisets A , B in the grid [∆] 2 of the same size, the earth-mover distance is defined as the minimum cost of a perfect matching between points in A and B . � EMD ( A , B ) = min � a − π ( a ) � . π : A → B a bijection a ∈ A App: a good measurement of the similarity of two images Clustering: ( k -Center) Cluster a set of points X = ( x 1 , x 2 , . . . , x m ) to clusters c 1 , c 2 , . . . , c k with representatives r 1 ∈ c 1 , r 2 ∈ c 2 , . . . , r k ∈ c k to minimize max min d ( x i , r j ) i j . App: (see wiki page) 8-3

  17. Strings Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( i , ins/del). 9-1

  18. Strings Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( i , ins/del). Distance to the sortedness: LIS( A )= length of longest increasing subsequence of sequence A . DistSort( A )= minimum number of elements needed to be deleted from A to get a sorted sequence = | A | − LIS( A ). App: a good measurement of network latency. 9-2

  19. Strings Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( i , ins/del). Distance to the sortedness: LIS( A )= length of longest increasing subsequence of sequence A . DistSort( A )= minimum number of elements needed to be deleted from A to get a sorted sequence = | A | − LIS( A ). App: a good measurement of network latency. Edit distance: Given two strings A and B , the number of insertion/deletion/substitution that is needed to convert A to B . App: a standard measurement of the similarity of two strings/documents 9-3

  20. Numerical linear algebra Denote the stream by A = a 1 , a 2 , . . . , a n , where a k = ( i , j , ∆) denotes the update M [ i , j ] ← M [ i , j ] + ∆, where M [ i , j ] is the cell in the i -th row, j -th column of matrix M . 10-1

  21. Numerical linear algebra Denote the stream by A = a 1 , a 2 , . . . , a n , where a k = ( i , j , ∆) denotes the update M [ i , j ] ← M [ i , j ] + ∆, where M [ i , j ] is the cell in the i -th row, j -th column of matrix M . Regression: Given an n × d matrix M and an n × 1 vector b , and one seeks x ∗ = argmin x � Mx − b � p , for a p ∈ [1 , ∞ ). 10-2

  22. Numerical linear algebra Denote the stream by A = a 1 , a 2 , . . . , a n , where a k = ( i , j , ∆) denotes the update M [ i , j ] ← M [ i , j ] + ∆, where M [ i , j ] is the cell in the i -th row, j -th column of matrix M . Regression: Given an n × d matrix M and an n × 1 vector b , and one seeks x ∗ = argmin x � Mx − b � p , for a p ∈ [1 , ∞ ). Low-rank approximation: Given an n × m matrix M , find orthonormal n × k matrices L , W , and a diagonal � � M − LDW T � k × k ( k < min { n , m } ) matrix D with F minimized, � where �·� F is the Frobenius norm App: Fundamental problem in many areas, including machine learning, recommendation system, natural language processing, etc. 10-3

  23. Sliding windows Sometimes we are only interested in recent items in the stream. RAM RAM Time-based sliding window w most recent time steps CPU Or, Sequence-based sliding window RAM w most recent items CPU 11-1

  24. Lower bounds What is the impossible? Or, what is the limit of the space usage to solve a problem? Usually by reductions from communication complexity . (not for this course) 12-1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend