Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong University Oct. 11, 2019 Algorithms for Big Data (IV) 1/19

Review of the Last Lecture Last time, we introduced AMS algorithm for counting distinct elements in the streaming Algorithms for Big Data (IV) 2/19 model. We are given a sequence of numbers ⟨ a 1 , . . . , a m ⟩ where each a i ∈ [ n ] . � }� �{ � . It defines a frequency vector f = ( f 1 , . . . , f n ) where f i = k ∈ [ m ] : a k = i � }� �{ � . We want to compute the number d = i ∈ [ n ] : f i > 0

Algorithm AMS Algorithm for Counting Distinct Elements Init: On Input y : end if Output: Algorithms for Big Data (IV) 3/19 A random Hash function h : [ n ] → [ n ] from a 2 -universal family Z ← 0 if zero ( h ( y )) > Z then Z ← zero ( h ( y )) d = 2 Z + 1 � 2 .

We also introduced the BJKST algorithm, a refinement of the AMS algorithm. Pr Algorithms for Big Data (IV) 4/19 Using O ( log 1 δ log n ) bits of memory, we can obtain [ d ] 3 ≤ � d ≤ 3 d ≥ 1 − δ . We will show today that the BJKST algorithm can produce � d which is a 1 ± ε approximation of d for any ε > 0 .

The BJKST Algorithm The following refinement is due to Bar-Yossef, Jayram, Kumar, Sivakumar and Trevisan. Algorithms for Big Data (IV) end if end while 5/19 Algorithm BJKST Algorithm for Counting Distinct Elements On Input y : Init: Random Hash functions h : [ n ] → [ n ] , g : [ n ] → [ b ε − 4 log 2 n ] , both from 2 - universal families; Z ← 0 , B ← � if zero ( h ( y )) ≥ Z then { } B ← B ∪ ( g ( y ) , zeros ( h ( y ))) while | B | ≥ c / ε 2 do Z ← Z + 1 Remove all ( α , β ) with β < Z from B Output: � d = | B | 2 Z

than the current Z . c Therefore, the size of B is a trade-ofg between the memory consumption and the accuracy of the algorithm. Algorithms for Big Data (IV) 6/19 The algorithm maintains a bucket B , which stores those y whose zeros ( h ( y )) is larger We set a cap L = ε 2 for the size of B : ▶ if L = ∞ , B stores all entries, and the algorithm is exact; ▶ if L = 2 , the algorithm is equivalent to AMS.

Analysis zeros. Algorithms for Big Data (IV) 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] . We need to store the whole B , whose size is O ( ε − 2 ) . Similar to AMS, for every k ∈ [ n ] , X k , r is the indicator that h ( k ) has at least r trailing Define Y r = ∑ k ∈ [ n ]: f k > 0 X k , r as the number of h ( a i ) with trailing zero at least r . We already know from the last lecture that E [ Y r ] = d 2 r and Var [ Y r ] ≤ d 2 r .

8/19 We will bound the probability of A using the following argument Algorithms for Big Data (IV) If Z = t at the end of the algorithm, then Y t = | B | and � d = Y t 2 t . � � � Y t 2 t − d � ≥ ε d , or equivalently We use A to denote the bad event that � � � � � ≥ ε d � � � Y t − d 2 t . 2 t ▶ if t is small, then E [ Y t ] = d 2 t is large, so we can apply concentration inequalities; ▶ the value t is unlikely to be very large. We let s be the threshold for small/large value mentioned above.

9/19 Pr Algorithms for Big Data (IV) Pr (depending on c ). log n log n [� � ] ∑ � � � ≥ ε d � � Pr [ A ] = � Y r − d 2 r ∧ t = r 2 r r =1 [� � ] ∑ s − 1 ∑ � � � ≥ ε d � � + Pr [ t = r ] ≤ � Y r − d 2 r 2 r r =1 r = s ∑ s − 1 [ Y s − 1 ≥ c / ε 2 ] Pr [ | Y r − E [ Y r ] | ≥ ε d /2 r ] + Pr = r =1 ∑ s − 1 ε 2 d + ε 2 d ε 2 d + ε 2 d 2 r c 2 s − 1 ≤ 2 s ≤ c 2 s − 1 . r =1 2 s = Θ ( ε − 2 ) , Pr [ A ] can be bounded by any constant So if we choose s such that d

Space Complexity We need to store Algorithms for Big Data (IV) This helps to reduce the memory needed (Exercise). probability). The botuleneck is to store B . . c 10/19 c ▶ the function h : O ( log n ) ; ▶ the function g : O ( log n ) ; ( ) ( ) ▶ the bucket B : O ε 2 · log ran ( g ) = O ε 2 log n Instead of using identity function g , we can tolerate collisions (with at most constant

Freqency Estimation It is closely related to the Frequency problem which asks for the set . Algorithms for Big Data (IV) 11/19 Consider a stream of numbers ⟨ a 1 , . . . , a m ⟩ and its frequency vector f = ( f 1 , . . . , f n ) . Another fundamental problem is to estimate f a for each query a ∈ [ n ] . { } j : f j > m / k We now describe a deterministic algorithm for Frequency-Estimation .

Misra-Gries Algorithm Misra-Gries Algorithm for Frequency-Estimation Algorithms for Big Data (IV) end if end for end if else On Input y : else if Init: A table A 12/19 if y ∈ keys ( A ) then A [ y ] ← A [ y ] + 1 � � � keys ( A ) � ≤ k − 1 then A [ j ] ← 1 for all ℓ ∈ keys ( A ) do A [ ℓ ] ← A [ ℓ ] − 1 if A [ ℓ ] = 0 then Remove ℓ from A

Algorithm Misra-Gries (cont’d)) Output: On query j , else end if Algorithms for Big Data (IV) 13/19 if j ∈ keys ( A ) then � f j = A [ j ] � f j = 0

Algorithms for Big Data (IV) Analysis 14/19 The algorithm uses O ( k ( log m + log n )) bits of memory. It is not hard to see that for each j ∈ [ n ] , the output � f j satisfies k ≤ � f j − m f j ≤ f j . If f j > m / k , then j is in the table A . The reverse is not correct!

In Misra-Gries, we compute a table A The table A stores information about the stream, so we can extract frequency from it. However, Misra-Gries sufgers from the following main drawbacks: sketches); Algorithms for Big Data (IV) 15/19 ▶ given two tables A 1 and A 2 with respect to σ 1 and σ 2 respectively, we don’t know how to obtain the table for σ 1 ◦ σ 2 (algorithms with this property are called ▶ it does not extend to the turnstile model. In the turnstile model, each entry of the stream is a pair ( a j , ∆ j ) . Upon receiving ( a j , ∆ j ) , we update f a j to f a j + ∆ j .

Count Sketch Algorithm Count Sketch Init: Output: On query a : Algorithms for Big Data (IV) 16/19 An array C [ j ] for j ∈ [ k ] where k = 3 ε 2 . A random Hash function h : [ n ] → [ k ] from a 2 -universal family. A random Hash function g : [ n ] → {− 1 , 1 } from a 2 -universal family. On Input ( y , ∆) : C [ h ( y )] ← C [ h ( y )] + ∆ · g ( y ) Output � f a = g ( a ) · C [ h ( a )] .

Analysis n Algorithms for Big Data (IV) We have 17/19 Let X = � f a be the output on the query a . For every j ∈ [ n ] , let Y j be the indicator of h ( j ) = h ( a ) . ∑ X = g ( a ) · f j · g ( j ) · Y j . j =1   ∑     E [ X ] = E g ( a ) · g ( a ) · f a · Y a + g ( a ) · f j · g ( j ) · Y j = f a .     j ∈ [ n ] \{ a }   Let Z ≜ ∑ j ∈ [ n ] \{ a } f j · g ( a ) · g ( j ) · Y j , then X = f a + Z and Var [ X ] = Var [ Z ] .

E j Algorithms for Big Data (IV) k k j E Therefore Y j j E j 18/19   ∑   [ Z 2 ]   = E f j · g ( a ) · g ( j ) Y j     j ∈ [ n ] \{ a }     ∑ ∑     f 2 j · Y 2 f j · f j ′ · g ( j ) · g ( j ′ ) · Y j · Y j ′ = E j +     j ∈ [ n ] \{ a } j , j ′ ∈ [ n ] \{ a } : j � j ;     [ ] ∑ ∑     f 2 j · Y 2 f 2 Y 2 = E = j · E     j ∈ [ n ] \{ a } j ∈ [ n ] \{ a }   Note that for every j � a , [ ] [ ] = Pr [ h ( j ) = h ( a )] = 1 Y 2 = E k . ∑ j ∈ [ n ] \{ a } f 2 ≤ ∥ f ∥ 2 [ Z 2 ] 2 = .

19/19 k Algorithms for Big Data (IV) By Chebyshev, Pr and Count Sketch (Exercise). Compare the performance (in terms of accuracy and space consumption) of Misra-Gries bits of memeory. − ( E [ Z ]) 2 ≤ ∥ f ∥ 2 [ Z 2 ] 2 Var [ X ] = Var [ Z ] = E . [� � ] 1 k ε 2 = 1 � � � � f a − f a � ≥ ε ∥ f ∥ 2 ≤ 3 . [� � ] We can then use Median trick to boost the algorithm so that � � � ▶ Pr � f a − f a � ≥ ε ∥ f ∥ 2 ≤ δ ; ( ) ε 2 log 1 1 ▶ it costs O δ ( log m + log n )

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong University Oct. 11, 2019 Algorithms for Big Data (IV) 1/19 Review of the Last Lecture Last time, we introduced AMS algorithm for counting distinct elements in the streaming

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Big- Big -O O Analyzing Algorithms Asymptotically Analyzing Algorithms Asymptotically P1 P2

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms & Data Structures Tuesday,

Analysis of Algorithms & Big-O CS16: Introduction to Algorithms & Data Structures Spring

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Algorithms for Big Data (VI) Chihao Zhang Shanghai Jiao Tong University Oct. 25, 2019

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

ARIMA and ARFIMA models Christopher F Baum ECON 8823: Applied Econometrics Boston College,

Computational Tools for the Exploration of Melodic Characteristics CompMusic Seminar, IIT-Madras,

WFSTs in ASR & Basics of Speech Production Lecture 6 CS 753 Instructor: Preethi Jyothi

Harmonic Structure Transform for Speaker Recognition Kornel Laskowski & Qin Jin Carnegie

PIP-II R&D Program Paul Derwent DOE Independent Project Review of PIP-II 16 June 2015 The

TIME-DEPENDENT PARAMETRIC AND HARMONIC TEMPLATES IN NON-NEGATIVE MATRIX FACTORIZATION 13 th

FuSe an(other) OCaml implementation of binary sessions Luca Padovani 1 / 8 target API Linear

Rcpp11 Romain Franois romain@r-enthusiasts.com @romain_francois R Rcpp Rcpp11 C++ =

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong University Oct. 11, 2019 Algorithms for Big Data (IV) 1/19 Review of the Last Lecture Last time, we introduced AMS algorithm for counting distinct elements in the streaming

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Big- Big -O O Analyzing Algorithms Asymptotically Analyzing Algorithms Asymptotically P1 P2

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms &amp; Data Structures Tuesday,

Analysis of Algorithms &amp; Big-O CS16: Introduction to Algorithms &amp; Data Structures Spring

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Algorithms for Big Data (VI) Chihao Zhang Shanghai Jiao Tong University Oct. 25, 2019

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

ARIMA and ARFIMA models Christopher F Baum ECON 8823: Applied Econometrics Boston College,

Computational Tools for the Exploration of Melodic Characteristics CompMusic Seminar, IIT-Madras,

WFSTs in ASR &amp; Basics of Speech Production Lecture 6 CS 753 Instructor: Preethi Jyothi

Harmonic Structure Transform for Speaker Recognition Kornel Laskowski &amp; Qin Jin Carnegie

PIP-II R&amp;D Program Paul Derwent DOE Independent Project Review of PIP-II 16 June 2015 The

TIME-DEPENDENT PARAMETRIC AND HARMONIC TEMPLATES IN NON-NEGATIVE MATRIX FACTORIZATION 13 th

FuSe an(other) OCaml implementation of binary sessions Luca Padovani 1 / 8 target API Linear

Rcpp11 Romain Franois romain@r-enthusiasts.com @romain_francois R Rcpp Rcpp11 C++ =

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms & Data Structures Tuesday,

Analysis of Algorithms & Big-O CS16: Introduction to Algorithms & Data Structures Spring

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

WFSTs in ASR & Basics of Speech Production Lecture 6 CS 753 Instructor: Preethi Jyothi

Harmonic Structure Transform for Speaker Recognition Kornel Laskowski & Qin Jin Carnegie

PIP-II R&D Program Paul Derwent DOE Independent Project Review of PIP-II 16 June 2015 The