Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong University Oct. 11, 2019 Algorithms for Big Data (IV) 1/19

We are given a sequence of numbers a a m where each a i It defines a frequency vector f f n where f i Review of the Last Lecture We want to compute the number d Algorithms for Big Data (IV) . f i n i m i . a k Last time, we introduced AMS algorithm for counting distinct elements in the streaming k f n . model. 2/19

a m where each a i It defines a frequency vector f f n where f i Review of the Last Lecture i . Algorithms for Big Data (IV) . f i n i We want to compute the number d m a k Last time, we introduced AMS algorithm for counting distinct elements in the streaming k f n . We are given a sequence of numbers a model. 2/19

It defines a frequency vector f f n where f i Review of the Last Lecture We want to compute the number d Algorithms for Big Data (IV) . f i n i a k i . Last time, we introduced AMS algorithm for counting distinct elements in the streaming m k f model. 2/19 We are given a sequence of numbers ⟨ a 1 , . . . , a m ⟩ where each a i ∈ [ n ] .

Review of the Last Lecture We want to compute the number d Algorithms for Big Data (IV) . f i n i 2/19 Last time, we introduced AMS algorithm for counting distinct elements in the streaming model. We are given a sequence of numbers ⟨ a 1 , . . . , a m ⟩ where each a i ∈ [ n ] . � }� �{ � . It defines a frequency vector f = ( f 1 , . . . , f n ) where f i = k ∈ [ m ] : a k = i

Review of the Last Lecture Last time, we introduced AMS algorithm for counting distinct elements in the streaming Algorithms for Big Data (IV) 2/19 model. We are given a sequence of numbers ⟨ a 1 , . . . , a m ⟩ where each a i ∈ [ n ] . � }� �{ � . It defines a frequency vector f = ( f 1 , . . . , f n ) where f i = k ∈ [ m ] : a k = i � }� �{ � . We want to compute the number d = i ∈ [ n ] : f i > 0

Algorithm AMS Algorithm for Counting Distinct Elements Init: On Input y : end if Output: Algorithms for Big Data (IV) 3/19 A random Hash function h : [ n ] → [ n ] from a 2 -universal family Z ← 0 if zero ( h ( y )) > Z then Z ← zero ( h ( y )) d = 2 Z + 1 � 2 .

We also introduced the BJKST algorithm, a refinement of the AMS algorithm. Pr We will show today that the BJKST algorithm can produce d which is a approximation of d for any . Algorithms for Big Data (IV) 4/19 Using O ( log 1 δ log n ) bits of memory, we can obtain [ d ] 3 ≤ � d ≤ 3 d ≥ 1 − δ .

We also introduced the BJKST algorithm, a refinement of the AMS algorithm. Pr Algorithms for Big Data (IV) 4/19 Using O ( log 1 δ log n ) bits of memory, we can obtain [ d ] 3 ≤ � d ≤ 3 d ≥ 1 − δ . We will show today that the BJKST algorithm can produce � d which is a 1 ± ε approximation of d for any ε > 0 .

The BJKST Algorithm The following refinement is due to Bar-Yossef, Jayram, Kumar, Sivakumar and Trevisan. Algorithms for Big Data (IV) end if end while 5/19 Algorithm BJKST Algorithm for Counting Distinct Elements On Input y : Init: Random Hash functions h : [ n ] → [ n ] , g : [ n ] → [ b ε − 4 log 2 n ] , both from 2 - universal families; Z ← 0 , B ← � if zero ( h ( y )) ≥ Z then { } B ← B ∪ ( g ( y ) , zeros ( h ( y ))) while | B | ≥ c / ε 2 do Z ← Z + 1 Remove all ( α , β ) with β < Z from B Output: � d = | B | 2 Z

c for the size of B : We set a cap L Therefore, the size of B is a trade-ofg between the memory consumption and the than the current Z . if L , B stores all entries, and the algorithm is exact; if L , the algorithm is equivalent to AMS. accuracy of the algorithm. Algorithms for Big Data (IV) 6/19 The algorithm maintains a bucket B , which stores those y whose zeros ( h ( y )) is larger

Therefore, the size of B is a trade-ofg between the memory consumption and the than the current Z . c if L , B stores all entries, and the algorithm is exact; if L , the algorithm is equivalent to AMS. accuracy of the algorithm. Algorithms for Big Data (IV) 6/19 The algorithm maintains a bucket B , which stores those y whose zeros ( h ( y )) is larger We set a cap L = ε 2 for the size of B :

Therefore, the size of B is a trade-ofg between the memory consumption and the than the current Z . c accuracy of the algorithm. Algorithms for Big Data (IV) 6/19 The algorithm maintains a bucket B , which stores those y whose zeros ( h ( y )) is larger We set a cap L = ε 2 for the size of B : ▶ if L = ∞ , B stores all entries, and the algorithm is exact; ▶ if L = 2 , the algorithm is equivalent to AMS.

than the current Z . c Therefore, the size of B is a trade-ofg between the memory consumption and the accuracy of the algorithm. Algorithms for Big Data (IV) 6/19 The algorithm maintains a bucket B , which stores those y whose zeros ( h ( y )) is larger We set a cap L = ε 2 for the size of B : ▶ if L = ∞ , B stores all entries, and the algorithm is exact; ▶ if L = 2 , the algorithm is equivalent to AMS.

to n , namely g y n , X k r is the indicator that h k has at least r trailing X k r as the number of h a i with trailing zero at least r . We already know from the last lecture that E Y r r and Var Y r r . Analysis n f k Algorithms for Big Data (IV) d d Define Y r k To analyze the algorithm, we first assume that g is simply the identity function from n zeros. Similar to AMS, for every k . We need to store the whole B , whose size is O n . y for all y 7/19

n , X k r is the indicator that h k has at least r trailing X k r as the number of h a i with trailing zero at least r . We already know from the last lecture that E Y r r and Var Y r r . Analysis Algorithms for Big Data (IV) d d k n f k Define Y r zeros. Similar to AMS, for every k . We need to store the whole B , whose size is O 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] .

n , X k r is the indicator that h k has at least r trailing X k r as the number of h a i with trailing zero at least r . We already know from the last lecture that E Y r r and Var Y r r . Analysis Algorithms for Big Data (IV) d d n f k k Define Y r zeros. Similar to AMS, for every k 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] . We need to store the whole B , whose size is O ( ε − 2 ) .

X k r as the number of h a i with trailing zero at least r . We already know from the last lecture that E Y r r and Var Y r r . Analysis Algorithms for Big Data (IV) d d n f k k Define Y r zeros. 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] . We need to store the whole B , whose size is O ( ε − 2 ) . Similar to AMS, for every k ∈ [ n ] , X k , r is the indicator that h ( k ) has at least r trailing

We already know from the last lecture that E Y r r and Var Y r r . Analysis zeros. d d Algorithms for Big Data (IV) 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] . We need to store the whole B , whose size is O ( ε − 2 ) . Similar to AMS, for every k ∈ [ n ] , X k , r is the indicator that h ( k ) has at least r trailing Define Y r = ∑ k ∈ [ n ]: f k > 0 X k , r as the number of h ( a i ) with trailing zero at least r .

Analysis zeros. Algorithms for Big Data (IV) 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] . We need to store the whole B , whose size is O ( ε − 2 ) . Similar to AMS, for every k ∈ [ n ] , X k , r is the indicator that h ( k ) has at least r trailing Define Y r = ∑ k ∈ [ n ]: f k > 0 X k , r as the number of h ( a i ) with trailing zero at least r . We already know from the last lecture that E [ Y r ] = d 2 r and Var [ Y r ] ≤ d 2 r .

We use A to denote the bad event that Y t t t is large, so we can apply concentration inequalities; We let s be the threshold for small/large value mentioned above. 8/19 We will bound the probability of A using the following argument Algorithms for Big Data (IV) the value t is unlikely to be very large. d if t is small, then E Y t t d t d Y t d , or equivalently d If Z = t at the end of the algorithm, then Y t = | B | and � d = Y t 2 t .

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong University Oct. 11, 2019 Algorithms for Big Data (IV) 1/19 We are given a sequence of numbers a a m where each a i It defines a frequency vector f f n where f i Review of the Last

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Big- Big -O O Analyzing Algorithms Asymptotically Analyzing Algorithms Asymptotically P1 P2

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms & Data Structures Tuesday,

Analysis of Algorithms & Big-O CS16: Introduction to Algorithms & Data Structures Spring

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Algorithms for Big Data (VI) Chihao Zhang Shanghai Jiao Tong University Oct. 25, 2019

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

CSC421/2516 Lecture 14: Exploding and Vanishing Gradients Roger Grosse and Jimmy Ba Roger Grosse

Decorators Functions That Make Functions C-START Python PD Workshop C-START Python PD Workshop

Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Feed-forward Networks Network

CS 220: Discrete Structures and their Applications Functions Chapter 5 in zybooks Functions A

Linear approximation and Taylor expansion of -terms F. Olimpieri Aix-Marseille Univ, CNRS,

Symmetric Functions, Alternating Sign Matrices, and Tokuyamas Identity Angle Hamel

Neural Networks (Reading: Kuncheva Section 2.5) Introduction Inspired by Biology But as used in

The WZ method and zeta function identities Ira M. Gessel Department of Mathematics Brandeis

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong University Oct. 11, 2019 Algorithms for Big Data (IV) 1/19 We are given a sequence of numbers a a m where each a i It defines a frequency vector f f n where f i Review of the Last

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Big- Big -O O Analyzing Algorithms Asymptotically Analyzing Algorithms Asymptotically P1 P2

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms &amp; Data Structures Tuesday,

Analysis of Algorithms &amp; Big-O CS16: Introduction to Algorithms &amp; Data Structures Spring

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Algorithms for Big Data (VI) Chihao Zhang Shanghai Jiao Tong University Oct. 25, 2019

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

CSC421/2516 Lecture 14: Exploding and Vanishing Gradients Roger Grosse and Jimmy Ba Roger Grosse

Decorators Functions That Make Functions C-START Python PD Workshop C-START Python PD Workshop

Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Feed-forward Networks Network

CS 220: Discrete Structures and their Applications Functions Chapter 5 in zybooks Functions A

Linear approximation and Taylor expansion of -terms F. Olimpieri Aix-Marseille Univ, CNRS,

Symmetric Functions, Alternating Sign Matrices, and Tokuyamas Identity Angle Hamel

Neural Networks (Reading: Kuncheva Section 2.5) Introduction Inspired by Biology But as used in

The WZ method and zeta function identities Ira M. Gessel Department of Mathematics Brandeis

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms & Data Structures Tuesday,

Analysis of Algorithms & Big-O CS16: Introduction to Algorithms & Data Structures Spring

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data