Algorithms for Big Data (II) Chihao Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

Algorithms for Big Data (II) Chihao Zhang Shanghai Jiao Tong University Sept. 27, 2019 Algorithms for Big Data (II) 1/17

Review of Last Lecture Last time, we met the streaming model. We studied Morris’ algorithm for counting the number of elements in a data stream. We used Averaging trick and Median trick to boost the quality of Morris’ algorithm. Today we will take a closer look at the mathematical tools needed in the course. Algorithms for Big Data (II) 2/17

Markov’s Ineqality Proof. Algorithms for Big Data (II) otherwise. Markov’s inequality 3/17 a For every nonnegative random variable X and every a ≥ 0 , it holds that Pr [ X ≥ a ] ≤ E [ X ] .    1 , if x ≥ a ,  Let 1 X ≥ a be the indicator random variable such that 1 X ≥ a ( x ) =  0 , Then it holds that X ≥ a · 1 X ≥ a . Take the expecation on both sides, we obtain E [ X ] ≥ a · E [ 1 X ≥ a ] = a · Pr [ X ≥ a ] . □

Chebyshev’s Ineqality Proof. Algorithms for Big Data (II) (Markov’s inequality) E Chebyshev’s inequality 4/17 For every random variable X and every a ≥ 0 , it holds that Pr [ � � X − E [ X ] � � ≥ a ] ≤ Var [ X ] . a 2 [ ( X − E [ X ]) 2 ≥ a 2 ] Pr [ � � X − E [ X ] � � ≥ a ] = Pr [ ( X − E [ X ]) 2 ] ≤ a 2 = Var [ X ] . a 2 □

Chernoff’s Bound random variable X . Algorithms for Big Data (II) n n E n Chernofg bound E It holds that 5/17 Let X 1 , . . . , X n be independent Bernoulli trials with E [ X i ] = p i for every i = 1 , . . . , n . Let X = ∑ n i =1 X i . Then for every 0 < ε < 1 , it holds that Pr [ � � X − E [ X ] � − ε 2 E [ X ] ( ) � > ε · E [ X ]] ≤ 2 exp . 3 The main tool to prove Chernofg bound is the moment generating function e tX for a [ e tX ] [ i =1 X i ] [ e tX i ] ∏ ∏ ( (1 − p i ) + p i e t ) e t ∑ n = E = = i =1 i =1 ∏ ( ∏ e − (1 − e t ) p i = e − (1 − e t ) E [ X ] . ) = 1 − (1 − e t ) p i ≤ i =1

Proof of Chernoff Bound E Algorithms for Big Data (II) Combining the bounds for both lower and upper tails, we finish the proof. We can similarly prove that 6/17 For every t > 0 , we have [ e tX ] [ e tX ≥ e t (1+ ε ) E [ X ] ] e t (1+ ε ) E [ X ] ≤ e − (1 − e t ) E [ X ] Pr [ X ≥ (1 + ε ) E [ X ]] = Pr ≤ e t (1+ ε ) E [ X ] . To find an optimal t , we calculate the derivative of above and obtain for t = log (1 + ε ) , ) E [ X ] e ε ( ≤ e − ε 2 E [ X ]/3 . Pr [ X ≥ (1 + ε ) E [ X ]] ≤ (1 + ε ) 1+ ε Pr [ X ≤ (1 − ε ) E [ X ]] ≤ e − ε 2 E [ X ]/2 .

Balls-into-Bins Balls-into-Bins is a simple yet important probabilistic model. log n log log n . It models an important object, the Hash functions. Algorithms for Big Data (II) 7/17 Suppose we throw m ball into n bins uniformly and independently, what is the (expected) maxload of the bins? ( ) When m = n , the answer is Θ

Independence n Algorithms for Big Data (II) 8/17 Pr A set of random variables X 1 , . . . , X n are mutually independent if for every index set I ⊆ [ n ] and values { x i } i ∈ I ,         ∧ ∏  X i = x i  = Pr [ X i = x i ] .   i =1 i ∈ I

k -wise Independence A weaker notion of independence is the k -wise independence. Algorithms for Big Data (II) n 9/17 Pr A set of random variables X 1 , . . . , X n are k -wise independent if for every index set I ⊆ [ n ] with | I | ≤ k and values { x i } i ∈ I ,         ∧ ∏ X i = x i = Pr [ X i = x i ] .     i ∈ I i =1 We call X 1 , . . . , X n pairwise independent if they are 2 -wise independent.

But they are not mutually independent! Examples Algorithms for Big Data (II) 10/17 Suppose we have n independent bits X 1 , . . . , X n ∈ { 0 , 1 } . (∑ ) For every I ∈ [ n ] , define Y I = mod 2 . j ∈ I X j The random bits { Y I } I ⊆ [ n ] are pairwise independent.

Property of Pairwise Independence i Algorithms for Big Data (II) n i E n X j Theorem n X i X j E 11/17 E n Proof. For pairwise independent X 1 , . . . , X n , we have Var [ X 1 + · · · + X n ] = Var [ X 1 ] + · · · + Var [ X n ] . [ ( X 1 + · · · + X n ) 2 ] − ( E [ X 1 + · · · + X n ]) 2 Var [ X 1 · · · + X n ] = E [ ] [ ] [ ]� − � � � E [ X i ] 2 + 2 ∑ ∑ ∑ ∑ X 2 = + 2 E [ X i ] E � � i =1 1 ≤ i < j ≤ n i =1 1 ≤ i < j ≤ n [ ] ∑ − E [ X i ] 2 ) ∑ ( X 2 = = Var [ X i ] . i =1 i =1 □

Hash Functions In Balls-into-Bins, we distribute balls uniformly and independently. Hash functions are important data structures that have been widely used in computer science. We will contruct Hash functions with theoretical guarantees. Algorithms for Big Data (II) 12/17 This can be implemented using Hash functions

Universal Hash Function Families have Algorithms for Big Data (II) k 13/17 Let H be a family of functions from [ m ] to [ n ] where m ≥ n . We call H k -universal if for every distinct x 1 , . . . , x k ∈ [ m ] , we have 1 Pr h ∈H [ h ( x 1 ) = h ( x 2 ) = · · · = h ( x k )] ≤ n k − 1 . We call H strongly k -universal if for every distinct x 1 , . . . , x k ∈ [ m ] , y 1 , . . . , y k ∈ [ n ] , we       = 1   ∧  h ( x i ) = y i  n k . Pr h ∈H   i =1

14/17 X ij Algorithms for Big Data (II) Therefore, Pr n n Pr E Assume the maxload is Y , which causes Balls-into-Bins with 2 -Universal Hash Family Let X ij be the indicator of the event: i -th ball and j -th ball fall into the same bin. Let X = ∑ 1 ≤ i ≤ j ≤ m X ij be the total number of collisions. Then [ ] ) 1 n < m 2 ( m ∑ E [ X ] = ≤ 2 n . 2 1 ≤ i < j ≤ m ( Y ) ≤ X collisions. Then 2 [ ( Y ] [ ] ≥ m 2 X ≥ m 2 ) ≤ 1 ≤ Pr n . 2 [ ] √ √ ≤ 1 Y − 1 ≥ m 2/ n 2 . The maxload is 1 + 2 n when m = n with probability at least 1/2 .

Algorithms for Big Data (II) The family is 15/17 Construction of 2 -Universal Family Now we explicitly construct a universal family of Hash functions from [ m ] to [ n ] . Let p ≥ m be a prime and let h a , b ( x ) = (( ax + b ) mod p ) mod n . H = { h a , b : 1 ≤ a ≤ p − 1 , 0 ≤ b ≤ p − 1 } .

Proof This is because Algorithms for Big Data (II) has a unique solution 16/17 We compute the colliding probability We show that H constructed above is indeed 2 -universal. Pr h a , b ∈H [ h a , b ( x ) = h a , b ( y )] for x � y . First, we have if x � y , then ax + b � ay + b mod p . Moreover ( a , b ) → ( ax + b , ay + b ) is a bijection from { 1 , . . . , p − 1 } × { 0 , . . . , p − 1 } to { ( u , v ) : 0 ≤ u , v ≤ p − 1 , u � v } .       a = v − u ax + b = u mod p   y − x mod p   ay + b = v mod p b = u − ax mod p .

Proof (cont’d) Therefore, The probabilty is therefore at most Algorithms for Big Data (II) 17/17 Pr h a , b ∈H [ h a , b ( x ) = h a , b ( y )] = Pr ( u , v ) ∈ F 2 p : u � v [ u = v mod n ] . The number of ( u , v ) with u � v is p ( p − 1) . For each u , the number of values of v with u = v mod n is at most ⌈ p / n ⌉ − 1 . p ( ⌈ p / n ⌉ − 1) ≤ 1 n . p ( p − 1)

Algorithms for Big Data (II) Chihao Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

Algorithms for Big Data (II) Chihao Zhang Shanghai Jiao Tong University Sept. 27, 2019 Algorithms for Big Data (II) 1/17 Review of Last Lecture Last time, we met the streaming model. We studied Morris algorithm for counting the number of

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Big- Big -O O Analyzing Algorithms Asymptotically Analyzing Algorithms Asymptotically P1 P2

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms & Data Structures Tuesday,

Analysis of Algorithms & Big-O CS16: Introduction to Algorithms & Data Structures Spring

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Algorithms for Big Data (VI) Chihao Zhang Shanghai Jiao Tong University Oct. 25, 2019

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Alex Psomas: Lecture 20. Chernoff and Erd os 1. Confidence intervals 2. Chernoff 3.

ISLab Intelligent Systems Lab Piet van Remortel piet.vanremortel@ua.ac.be 1 Overview Who we

M.Sc. in Meteorology Physical Meteorology Prof Peter Lynch Mathematical Computation Laboratory

Review Mill: the Harm Principle Review Locke: Labor is the Key to property Rights The

Pericles and the Polis Image courtesy of Jack Versloot on flickr. License CC BY. This image is in

CMS Phase-II Upgrade of the Muon System (iRPC) XXX Reunin Anual de la Divisin de Partculas y

Modal and Intuitionistic Natural Dualities via the Concept of Structure Dualizability Yoshihiro

Design Foundations M Bethancourt Design Interventions PARKing Day, Knit Please - Knit-bombing

Algorithms for Big Data (II) Chihao Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

Algorithms for Big Data (II) Chihao Zhang Shanghai Jiao Tong University Sept. 27, 2019 Algorithms for Big Data (II) 1/17 Review of Last Lecture Last time, we met the streaming model. We studied Morris algorithm for counting the number of

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Big- Big -O O Analyzing Algorithms Asymptotically Analyzing Algorithms Asymptotically P1 P2

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms &amp; Data Structures Tuesday,

Analysis of Algorithms &amp; Big-O CS16: Introduction to Algorithms &amp; Data Structures Spring

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Algorithms for Big Data (VI) Chihao Zhang Shanghai Jiao Tong University Oct. 25, 2019

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Alex Psomas: Lecture 20. Chernoff and Erd os 1. Confidence intervals 2. Chernoff 3.

ISLab Intelligent Systems Lab Piet van Remortel piet.vanremortel@ua.ac.be 1 Overview Who we

M.Sc. in Meteorology Physical Meteorology Prof Peter Lynch Mathematical Computation Laboratory

Review Mill: the Harm Principle Review Locke: Labor is the Key to property Rights The

Pericles and the Polis Image courtesy of Jack Versloot on flickr. License CC BY. This image is in

CMS Phase-II Upgrade of the Muon System (iRPC) XXX Reunin Anual de la Divisin de Partculas y

Modal and Intuitionistic Natural Dualities via the Concept of Structure Dualizability Yoshihiro

Design Foundations M Bethancourt Design Interventions PARKing Day, Knit Please - Knit-bombing

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms & Data Structures Tuesday,

Analysis of Algorithms & Big-O CS16: Introduction to Algorithms & Data Structures Spring

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data