B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: - PowerPoint PPT Presentation

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1

Part 1: Sublinear in Space 2-1

The model and challenge The data stream model (Alon, Matias and Szegedy 1996) RAM a n a 2 a 1 CPU Why hard? Cannot store everything. Applications : Internet router, stock data, ad auction, flight logs on tape, etc. (next 4 slides, in courtesy of Jeff Phillps) 3-1

Network routers Packets limited space Router Internet Router • data per day: at least 1 Terabyte • packet takes 8 nanoseconds to pass through router • few million packets per second What statistics can we keep on data? For example, want to detect anomalies for security. 4-1

Telephone Switch txt, msg limited space Switch Cell phones connect through switches • each message 1000 Bytes • 500 million calls / day • 1 Terabyte per month second Search for characteristics for dropped calls? 5-1

Ad Auction limited space page view ad served ad click Server keyword search delivery model Serving Ads on web Google, Yahoo!, Microsoft • Yahoo.com viewed 77 trillion times • 2 million / hour • Each page serves ads; which ones? How to update ad delivery model? 6-1

Flight Logs on Tape CPU statistics All airplane logs over Washington, DC • About 500 - 1000 flights per day. • 50 years, total about 9 million flights • Each flight has trajectory, passenger count, control dialog. Stored on Tape. Can only make 1 (or O (1)) pass! What statistics can be gathered? 7-1

(Last lecture) Maintain a sample for Sliding Windows Tasks : Find a uniform sample from the last w items . 8-1

(Last lecture) Maintain a sample for Sliding Windows Tasks : Find a uniform sample from the last w items . Algorithm : – For each x i , we pick a random value v i ∈ (0 , 1). – In a window < x j − w +1 , . . . , x j > , return value x i with smallest v i . – To do this, maintain the set of all x i in sliding window whose v i value is minimal among subsequent values. 8-2

(Last lecture) Maintain a sample for Sliding Windows Tasks : Find a uniform sample from the last w items . Algorithm : – For each x i , we pick a random value v i ∈ (0 , 1). – In a window < x j − w +1 , . . . , x j > , return value x i with smallest v i . – To do this, maintain the set of all x i in sliding window whose v i value is minimal among subsequent values. Correctness : Obvious. Space (expected): 1 / w + 1 / ( w − 1) + . . . + 1 / 1 = log w . 8-3

§ 1 . 0 An overview of problems 9-1

Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. On seen a i = ( i , ∆), update f i ← f i + ∆ (special case: ∆ = { 1 , − 1 } , corresponding to ins/del). 10-1

Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. On seen a i = ( i , ∆), update f i ← f i + ∆ (special case: ∆ = { 1 , − 1 } , corresponding to ins/del). Entropy: emprical entropy of the data set : m log m f i H ( A ) = � f i , i ∈ [ n ] App: Very useful in “change” (e.g., anomalous events) detection. 10-2

Statistics Denote the stream by A = a 1 , a 2 , . . . , a m , where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. On seen a i = ( i , ∆), update f i ← f i + ∆ (special case: ∆ = { 1 , − 1 } , corresponding to ins/del). Entropy: emprical entropy of the data set : m log m f i H ( A ) = � f i , i ∈ [ n ] App: Very useful in “change” (e.g., anomalous events) detection. i f p Frequent moments: F p = � i • F 0 : number of distinct items. • F 1 : total number of items. • F 2 : size of self-join. General F P ( p > 1), good measurements of the skewness of the data. 10-3

Statistics (cont.) Heavy-hitter: a set of items whose frequency ≥ a threshold. App: popular IP destinations, . . . Included 0 . 01 m 1 2 3 4 5 6 7 8 | A | = m 11-1

Statistics (cont.) Heavy-hitter: a set of items whose frequency ≥ a threshold. App: popular IP destinations, . . . Quantile: The φ -quantile of A is some x such Included that there are at most φ m items of A that are smaller than x and at most (1 − φ ) m items of A that are greater than x . 0 . 01 m All-quantile: a data structure from which all φ -quantiles for any 1 2 3 4 5 6 7 8 0 ≤ φ ≤ 1 can be extracted. | A | = m App: distribution of package sizes . . . 11-2

Statistics (cont.) L p sampling: Let x ∈ R n be a non-zero vector. For p > 0 we call the L p distribution corresponding to x the distribution on [ n ] that takes i with probability | x i | p , � x i � p p i ∈ [ n ] | x i | p ) 1 / p . In particular, for p = 0, the with � x � p = ( � L 0 sampling is to select an element uniform at random from the non-zero coordinates of x . App: an extremely useful tool for constructing graph sketches, finding duplications, etc. 12-1

Graphs Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = (( u i , v i ) , insert/delete), where ( u i , v i ) is an edge. 13-1

Graphs Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = (( u i , v i ) , insert/delete), where ( u i , v i ) is an edge. Connectivity: Test if a graph is connected. Matching: Estimate the size of the maximum matching of a graph. Diameter: Compute the diameter of a graph (that is, the maximum distance between two nodes). 13-2

Graphs Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = (( u i , v i ) , insert/delete), where ( u i , v i ) is an edge. Connectivity: Test if a graph is connected. Matching: Estimate the size of the maximum matching of a graph. Diameter: Compute the diameter of a graph (that is, the maximum distance between two nodes). Triangle counting: Compute # triangles of a graph. App: Useful for finding communities in a social network. (fraction of v’s neighbors who are neighbors themselves) 13-3

Graphs (cont.) Spanner: Given a graph G = ( V , E ), we say that a subgraph H = ( V , E ′ ) is an α -spanner for G if ∀ u , v , ∈ V , d G ( u , v ) ≤ d H ( u , v ) ≤ α · d G ( u , v ) A subgraph (approximately) maintains pair-wise distances. 14-1

Graphs (cont.) Spanner: Given a graph G = ( V , E ), we say that a subgraph H = ( V , E ′ ) is an α -spanner for G if ∀ u , v , ∈ V , d G ( u , v ) ≤ d H ( u , v ) ≤ α · d G ( u , v ) A subgraph (approximately) maintains pair-wise distances. Graph sparcification: Given a graph G = ( V , E ), denote the minimum cut of G by λ ( G ), and λ A ( G ) the capacity of the cut ( A , V \ A ). We say that a weighted subgraph H = ( V , E ′ , w ) is an ǫ -sparsification for G if ∀ A ⊂ V , (1 − ǫ ) λ A ( G ) ≤ λ A ( H ) ≤ (1 + ǫ ) λ A ( G ) . App: Synopses for massive graphs. A graph synopse is a subgraph of much smaller size that keeps properties of the original graph. 14-2

Geometry Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( location , ins/del). 15-1

Geometry Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( location , ins/del). Earth-mover distance: Given two multisets A , B in the grid [∆] 2 of the same size, the earth-mover distance is defined as the minimum cost of a perfect matching between points in A and B . � EMD ( A , B ) = min � a − π ( a ) � . π : A → B a bijection a ∈ A App: a good measurement of the similarity of two images 15-2

Geometry Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( location , ins/del). Earth-mover distance: Given two multisets A , B in the grid [∆] 2 of the same size, the earth-mover distance is defined as the minimum cost of a perfect matching between points in A and B . � EMD ( A , B ) = min � a − π ( a ) � . π : A → B a bijection a ∈ A App: a good measurement of the similarity of two images Clustering: ( k -Center) Cluster a set of points X = ( x 1 , x 2 , . . . , x m ) to clusters c 1 , c 2 , . . . , c k with representatives r 1 ∈ c 1 , r 2 ∈ c 2 , . . . , r k ∈ c k to minimize max min d ( x i , r j ) i j . App: (see wiki page) 15-3

Strings Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( i , ins/del). 16-1

Strings Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( i , ins/del). Distance to the sortedness: LIS( A )= length of longest increasing subsequence of sequence A . DistSort( A )= minimum number of elements needed to be deleted from A to get a sorted sequence = | A | − LIS( A ). App: a good measurement of network latency. 16-2

Strings Denote the stream by A = a 1 , a 2 , . . . , a m , where a i = ( i , ins/del). Distance to the sortedness: LIS( A )= length of longest increasing subsequence of sequence A . DistSort( A )= minimum number of elements needed to be deleted from A to get a sorted sequence = | A | − LIS( A ). App: a good measurement of network latency. Edit distance: Given two strings A and B , the number of insertion/deletion/substitution that is needed to convert A to B . App: a standard measurement of the similarity of two strings/documents 16-3

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: - PowerPoint PPT Presentation

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model and challenge The data stream model (Alon, Matias and Szegedy 1996) RAM a n a 2 a 1 CPU Why hard? Cannot store everything. Applications : Internet

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear in Time 2-1 Sublinear in

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview of problems 2-1 Statistics

Random Local Exploration Techniques for Sublinear-Time Algorithms Krzysztof Onak IBM Research

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 2: Sublinear in Communication 2-1

Sublinear Geometric Algorithms Sublinear Geometric Algorithms B. Chazelle, D. Liu, A. Magen B.

Sublinear Algorithms for ( + 1) Vertex Coloring Sepehr Assadi University of Pennsylvania

L ECTURE 2 Last time Introduction Basic models for sublinear-time computation Simple

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Sublinear Algorithms Lecture 5 Sofya Raskhodnikova Penn State University Thanks to Madhav Jha

L ECTURE 6 Last time Limitations of sublinear-time algorithms Yaos Minimax Principle

Sublinear Algorithms Lectures 1 and 2 Sofya Raskhodnikova Penn State University 1 Tentative

Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Boston University 1 Organizational Course

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 Topic 3: Random sampling

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Computer Science 210: Data Structures Intro to Java Graphics Summary Today GUIs in

Non-Uniform Windowed Decoding For Multi-Dimensional Spatially-Coupled LDPC Codes Lev Tauz, Homa

The Vertex Shader CS418 Computer Graphics John C. Hart Graphics Pipeline Model Model World

AIRS/IASI Radiance Comparisons Tom Pagano George Aumann Steve Broberg NASA AIRS Project Office

Off-chain Tejaswi Nadahalli ETH Zurich Distributed Computing Group www.disco.ethz.ch ETH

CSCI 3210: Computational Game Theory Graphical Games Ref: [AGT] Ch 7

srt sts tt

EI331 Signals and Systems Lecture 15 Bo Jiang John Hopcroft Center for Computer Science

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: - PowerPoint PPT Presentation

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model and challenge The data stream model (Alon, Matias and Szegedy 1996) RAM a n a 2 a 1 CPU Why hard? Cannot store everything. Applications : Internet

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear in Time 2-1 Sublinear in

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview of problems 2-1 Statistics

Random Local Exploration Techniques for Sublinear-Time Algorithms Krzysztof Onak IBM Research

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 2: Sublinear in Communication 2-1

Sublinear Geometric Algorithms Sublinear Geometric Algorithms B. Chazelle, D. Liu, A. Magen B.

Sublinear Algorithms for ( + 1) Vertex Coloring Sepehr Assadi University of Pennsylvania

L ECTURE 2 Last time Introduction Basic models for sublinear-time computation Simple

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Sublinear Algorithms Lecture 5 Sofya Raskhodnikova Penn State University Thanks to Madhav Jha

L ECTURE 6 Last time Limitations of sublinear-time algorithms Yaos Minimax Principle

Sublinear Algorithms Lectures 1 and 2 Sofya Raskhodnikova Penn State University 1 Tentative

Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Boston University 1 Organizational Course

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 Topic 3: Random sampling

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

Computer Science 210: Data Structures Intro to Java Graphics Summary Today GUIs in

Non-Uniform Windowed Decoding For Multi-Dimensional Spatially-Coupled LDPC Codes Lev Tauz, Homa

The Vertex Shader CS418 Computer Graphics John C. Hart Graphics Pipeline Model Model World

AIRS/IASI Radiance Comparisons Tom Pagano George Aumann Steve Broberg NASA AIRS Project Office

Off-chain Tejaswi Nadahalli ETH Zurich Distributed Computing Group www.disco.ethz.ch ETH

CSCI 3210: Computational Game Theory Graphical Games Ref: [AGT] Ch 7

srt sts tt

EI331 Signals and Systems Lecture 15 Bo Jiang John Hopcroft Center for Computer Science

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams