B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: - PowerPoint PPT Presentation

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1

Part 1: Sublinear in Space 2-1

The model and challenge The data stream model (Alon, Matias and Szegedy 1996) RAM a n a 2 a 1 CPU Why hard? Cannot store everything. Applications : Internet router, stock data, ad auction, flight logs on tape, etc. 3-1

§ 1 . 1 Point Queries (part I) RAM 9 7 6 3 3 9 Which items are most frequent? CPU Approximation allowed. 4-1

More about the streaming model Denote the stream by A = a 1 , . . . , a m , where m = n O (1) is the length of the stream, which is unknown at the beginning. Let [ n ] be the item universe. Let x j be the frequency of item j in the steam. Each a i = ( j , ∆) denotes x j ← x j + ∆. 5-1

More about the streaming model Denote the stream by A = a 1 , . . . , a m , where m = n O (1) is the length of the stream, which is unknown at the beginning. Let [ n ] be the item universe. Let x j be the frequency of item j in the steam. Each a i = ( j , ∆) denotes x j ← x j + ∆. We call an algorithm insertion-only if it only works for ∆ = 1. 5-2

More about the streaming model Denote the stream by A = a 1 , . . . , a m , where m = n O (1) is the length of the stream, which is unknown at the beginning. Let [ n ] be the item universe. Let x j be the frequency of item j in the steam. Each a i = ( j , ∆) denotes x j ← x j + ∆. We call an algorithm insertion-only if it only works for ∆ = 1. Can represent the stream as a vector x = ( x 1 , . . . , x n ). When a i = ( j , ∆) comes, x j ← x j + ∆. – for insertion only, m = � x � 1 5-3

The MAJORITY problem MAJORITY: if ∃ j : f j > m / 2, then output j , otherwise, output ⊥ . 6-1

Heavy hitters and point queries L p heavy hitter set : HH p φ ( x ) = { i : | x i | ≥ φ � x � p } 7-1

Heavy hitters and point queries L p heavy hitter set : HH p φ ( x ) = { i : | x i | ≥ φ � x � p } L p Heavy Hitter Problem: Given φ, φ ′ , (often φ ′ = φ − ǫ ), return a set S such that HH p φ ( x ) ⊆ S ⊆ HH p φ ′ ( x ) 7-2

Heavy hitters and point queries L p heavy hitter set : HH p φ ( x ) = { i : | x i | ≥ φ � x � p } L p Heavy Hitter Problem: Given φ, φ ′ , (often φ ′ = φ − ǫ ), return a set S such that HH p φ ( x ) ⊆ S ⊆ HH p φ ′ ( x ) L p Point Query Problem: Given ǫ , after reading the whole stream, given i , report x i = x i ± ǫ � x � p ˜ 7-3

The Misra-Gries algorithm The algorithm (Misra-Gries ’82) 1. Maintain a set A ; each item is a counter pair ( i , x i ). A ← ∅ 2. For each new coming item e , (a) if e ∈ A then set ( e , x e ) ← ( e , x e + 1) (b) else if | A | < 1 /ǫ , add ( e , 1) to A (c) else, for each e ∈ A , set ( e , x e ) ← ( e , x e − 1), and if x e − 1 = 0, then remove ( e , 0) from A . 3. On query i , if i ∈ A , then return x i , otherwise return 0 8-1

The Misra-Gries algorithm The algorithm (Misra-Gries ’82) 1. Maintain a set A ; each item is a counter pair ( i , x i ). A ← ∅ 2. For each new coming item e , (a) if e ∈ A then set ( e , x e ) ← ( e , x e + 1) (b) else if | A | < 1 /ǫ , add ( e , 1) to A (c) else, for each e ∈ A , set ( e , x e ) ← ( e , x e − 1), and if x e − 1 = 0, then remove ( e , 0) from A . 3. On query i , if i ∈ A , then return x i , otherwise return 0 Analysis (on board) Theorem Misra-Gries uses O (1 /ǫ · log n ) bits, and for any j , produces an x j satisfing x j − ǫ m ≤ ˜ x j ≤ x j . estimate ˜ 8-2

Space-saving: an algorithm for insertion only Algorithm Space-saving [Metwally et al. ’05] When a new item e comes, we have two cases. 1. If e is already in the array. We just increment ˜ x e by 1 and reinsert the ( e , ˜ x e ) into the array. 2. If e is not in the array, we create a new tuple ( e , MIN + 1 ) where MIN = min { ˜ x e : e is in the array } . We always keep the array sorted according to ˜ x e , and then MIN is just the estimated frequency of the last item. If the length array is larger than 1 /ǫ , we delete the last tuple. At the query of e , report ˜ x e if e is in the array, otherwise report MIN 9-1

Space-saving: an algorithm for insertion only Algorithm Space-saving [Metwally et al. ’05] When a new item e comes, we have two cases. 1. If e is already in the array. We just increment ˜ x e by 1 and reinsert the ( e , ˜ x e ) into the array. 2. If e is not in the array, we create a new tuple ( e , MIN + 1 ) where MIN = min { ˜ x e : e is in the array } . We always keep the array sorted according to ˜ x e , and then MIN is just the estimated frequency of the last item. If the length array is larger than 1 /ǫ , we delete the last tuple. At the query of e , report ˜ x e if e is in the array, otherwise report MIN Theorem (Analysis on board) Space-saving uses O (1 /ǫ · log n ) bits, and for any j , produces an x j satisfing x j ≤ ˜ x j ≤ x j + ǫ m . estimate ˜ 9-2

§ 1 . 2 Distinct Elements RAM 9 7 6 3 3 9 How many distinct elements? CPU Approximation needed. 10-1

Universal hash function A family H ⊆ { h : X → Y } is said to be 2-universal if the following property holds, with h ∈ R H picked uniformly at random: ∀ x , x ′ ∈ X , ∀ y , y ′ ∈ Y , � � 1 x � = x ′ ⇒ Pr h [ h ( x ) = y ∧ h ( x ′ ) = y ′ ] = | Y | 2 11-1

The Flajoet-Martin algorithm The algorithm (Flajoet and Martin ’83) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0. Let zeros( h ( e )) be the # tailing zeros of the binary representation of h ( e ). 2. For each new coming item e , if zeros( h ( e )) > z , then set z = zeros( h ( e )); 3. Output 2 z +0 . 5 . 12-1

The Flajoet-Martin algorithm The algorithm (Flajoet and Martin ’83) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0. Let zeros( h ( e )) be the # tailing zeros of the binary representation of h ( e ). 2. For each new coming item e , if zeros( h ( e )) > z , then set z = zeros( h ( e )); 3. Output 2 z +0 . 5 . Analysis (on board) 12-2

The Flajoet-Martin algorithm The algorithm (Flajoet and Martin ’83) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0. Let zeros( h ( e )) be the # tailing zeros of the binary representation of h ( e ). 2. For each new coming item e , if zeros( h ( e )) > z , then set z = zeros( h ( e )); 3. Output 2 z +0 . 5 . Analysis (on board) Theorem The number of distinct elements can be O (1)-approximated with probability 2 / 3 using O (log n ) bits. 12-3

Probability amplification Can we boost the success probability to 1 − δ ? The idea is to run k = Θ(log(1 /δ )) copies of this algorithm in parallel, using mutually independent random hash functions, and output the median of the k answers. 13-1

An improved algorithm Idea: two-level hashing. The algorithm (Bar-Yossef et al. ’02) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0 , B = ∅ . Choose a secondary 2-universal hash function g : [ n ] → [(log n /ǫ ) O (1) ]. 2. For each new coming item e , if zeros( h ( e )) ≥ z , then (a) set B ← B ∪ { ( g ( e ) , zeros ( h ( e ))) } ; (b) if | B | > c /ǫ 2 then set z ← z + 1 and remove all ( α, β ) in B with β < z . 3. Output | B | 2 z . 14-1

An improved algorithm Idea: two-level hashing. The algorithm (Bar-Yossef et al. ’02) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0 , B = ∅ . Choose a secondary 2-universal hash function g : [ n ] → [(log n /ǫ ) O (1) ]. 2. For each new coming item e , if zeros( h ( e )) ≥ z , then (a) set B ← B ∪ { ( g ( e ) , zeros ( h ( e ))) } ; (b) if | B | > c /ǫ 2 then set z ← z + 1 and remove all ( α, β ) in B with β < z . 3. Output | B | 2 z . Analysis (on board) 14-2

An improved algorithm Idea: two-level hashing. The algorithm (Bar-Yossef et al. ’02) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0 , B = ∅ . Choose a secondary 2-universal hash function g : [ n ] → [(log n /ǫ ) O (1) ]. 2. For each new coming item e , if zeros( h ( e )) ≥ z , then (a) set B ← B ∪ { ( g ( e ) , zeros ( h ( e ))) } ; (b) if | B | > c /ǫ 2 then set z ← z + 1 and remove all ( α, β ) in B with β < z . 3. Output | B | 2 z . Analysis (on board) Theorem The number of distinct elements can be (1 + ǫ )-approximated with probability 2 / 3 using O (log n + 1 /ǫ 2 · (log(1 /ǫ ) + log log n )) bits. 14-3

§ 1 . 3 Linear Sketches 15-1

We have seen Misra-Gries, Space-saving, Flajolet-Martin and its improvement. 16-1

We have seen Misra-Gries, Space-saving, Flajolet-Martin and its improvement. Nice algorithms, but only work for insertion-only sequences ... Can we handle deletions? 16-2

We have seen Misra-Gries, Space-saving, Flajolet-Martin and its improvement. Nice algorithms, but only work for insertion-only sequences ... Can we handle deletions? A popular way is to use linear sketches. 16-3

Linear sketch Random linear projection M : R n → R k that preserves properties of any v ∈ R n with high prob. where k ≪ n . = answer M Mv v 17-1

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: - PowerPoint PPT Presentation

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model and challenge The data stream model (Alon, Matias and Szegedy 1996) RAM a n a 2 a 1 CPU Why hard? Cannot store everything. Applications : Internet

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear in Time 2-1 Sublinear in

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview of problems 2-1 Statistics

Random Local Exploration Techniques for Sublinear-Time Algorithms Krzysztof Onak IBM Research

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 2: Sublinear in Communication 2-1

Sublinear Geometric Algorithms Sublinear Geometric Algorithms B. Chazelle, D. Liu, A. Magen B.

Sublinear Algorithms for ( + 1) Vertex Coloring Sepehr Assadi University of Pennsylvania

L ECTURE 2 Last time Introduction Basic models for sublinear-time computation Simple

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Sublinear Algorithms Lecture 5 Sofya Raskhodnikova Penn State University Thanks to Madhav Jha

L ECTURE 6 Last time Limitations of sublinear-time algorithms Yaos Minimax Principle

Sublinear Algorithms Lectures 1 and 2 Sofya Raskhodnikova Penn State University 1 Tentative

Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Boston University 1 Organizational Course

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 Topic 3: Random sampling

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

How do you graphically show the heights of the students in your class? In this lesson you will

Translation Synchronization via Truncated Least Squares Xiangru Huang 1 * Zhenxiao Liang 2 *

Photo 2. Looking southeast from the median towards Slide 1. The backscarp has a 1 m drop, is

Haberfeld Success - Core Deposit Growth St Strategiz ize Prosp sperity Yield on Yield on

Distributed selection Toni Kylml toni.kylmala@tkk.fi 1 Distiributed Selection - Basics

Some progress, but more needed Updated from January flex-day with explanatory text. As we all

Volumetric Image Visualization Alexandre Xavier Falc ao LIDS - Institute of Computing -

ImageCLEF 2020 Tasks in ImageCLEF 2020 Medical Lifelog Coral reef DrawnUI 2

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: - PowerPoint PPT Presentation

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model and challenge The data stream model (Alon, Matias and Szegedy 1996) RAM a n a 2 a 1 CPU Why hard? Cannot store everything. Applications : Internet

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 3: Sublinear in Time 2-1 Sublinear in

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 An overview of problems 2-1 Statistics

Random Local Exploration Techniques for Sublinear-Time Algorithms Krzysztof Onak IBM Research

Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 2: Sublinear in Communication 2-1

Sublinear Geometric Algorithms Sublinear Geometric Algorithms B. Chazelle, D. Liu, A. Magen B.

Sublinear Algorithms for ( + 1) Vertex Coloring Sepehr Assadi University of Pennsylvania

L ECTURE 2 Last time Introduction Basic models for sublinear-time computation Simple

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Sublinear Algorithms Lecture 5 Sofya Raskhodnikova Penn State University Thanks to Madhav Jha

L ECTURE 6 Last time Limitations of sublinear-time algorithms Yaos Minimax Principle

Sublinear Algorithms Lectures 1 and 2 Sofya Raskhodnikova Penn State University 1 Tentative

Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Boston University 1 Organizational Course

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 Topic 3: Random sampling

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

How do you graphically show the heights of the students in your class? In this lesson you will

Translation Synchronization via Truncated Least Squares Xiangru Huang 1 * Zhenxiao Liang 2 *

Photo 2. Looking southeast from the median towards Slide 1. The backscarp has a 1 m drop, is

Haberfeld Success - Core Deposit Growth St Strategiz ize Prosp sperity Yield on Yield on

Distributed selection Toni Kylml toni.kylmala@tkk.fi 1 Distiributed Selection - Basics

Some progress, but more needed Updated from January flex-day with explanatory text. As we all

Volumetric Image Visualization Alexandre Xavier Falc ao LIDS - Institute of Computing -

ImageCLEF 2020 Tasks in ImageCLEF 2020 Medical Lifelog Coral reef DrawnUI 2

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams