Algorithms for Big Data (I) Chihao Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

A programmer for routers A router has limited memory, but needs to process large data… The router can monitor the id of devices connecting to it. 43 How many numbers? How many distinct numbers? What is the most frequent number? Algorithms for Big Data (I) 4/19

A programmer for routers A router has limited memory, but needs to process large data… The router can monitor the id of devices connecting to it. 46 How many distinct numbers? What is the most frequent number? Algorithms for Big Data (I) 4/19 ▶ How many numbers?

A programmer for routers A router has limited memory, but needs to process large data… The router can monitor the id of devices connecting to it. 46 What is the most frequent number? Algorithms for Big Data (I) 4/19 ▶ How many numbers? ▶ How many distinct numbers?

A programmer for routers A router has limited memory, but needs to process large data… The router can monitor the id of devices connecting to it. 46 Algorithms for Big Data (I) 4/19 ▶ How many numbers? ▶ How many distinct numbers? ▶ What is the most frequent number?

a m where each a i We say the algorithm is sublinear if s Streaming Model How many numbers (what is m ?) Algorithms for Big Data (I) … What is the most frequent number? ? What is the median of How many distinct numbers? . We can ask The input is a sequence o min m n One can process the input stream using at most s bits of memory n a a 5/19

We say the algorithm is sublinear if s Streaming Model One can process the input stream using at most s bits of memory o min m n . We can ask How many numbers (what is m ?) How many distinct numbers? What is the median of ? What is the most frequent number? … Algorithms for Big Data (I) 5/19 The input is a sequence σ = ⟨ a 1 , a 2 , . . . , a m ⟩ where each a i ∈ [ n ]

Streaming Model One can process the input stream using at most s bits of memory We can ask How many numbers (what is m ?) How many distinct numbers? What is the median of ? What is the most frequent number? … Algorithms for Big Data (I) 5/19 The input is a sequence σ = ⟨ a 1 , a 2 , . . . , a m ⟩ where each a i ∈ [ n ] We say the algorithm is sublinear if s = o ( min { m , n } ) .

Streaming Model One can process the input stream using at most s bits of memory We can ask How many distinct numbers? What is the median of ? What is the most frequent number? … Algorithms for Big Data (I) 5/19 The input is a sequence σ = ⟨ a 1 , a 2 , . . . , a m ⟩ where each a i ∈ [ n ] We say the algorithm is sublinear if s = o ( min { m , n } ) . ▶ How many numbers (what is m ?)

Streaming Model One can process the input stream using at most s bits of memory We can ask What is the median of ? What is the most frequent number? … Algorithms for Big Data (I) 5/19 The input is a sequence σ = ⟨ a 1 , a 2 , . . . , a m ⟩ where each a i ∈ [ n ] We say the algorithm is sublinear if s = o ( min { m , n } ) . ▶ How many numbers (what is m ?) ▶ How many distinct numbers?

Streaming Model One can process the input stream using at most s bits of memory We can ask What is the most frequent number? … Algorithms for Big Data (I) 5/19 The input is a sequence σ = ⟨ a 1 , a 2 , . . . , a m ⟩ where each a i ∈ [ n ] We say the algorithm is sublinear if s = o ( min { m , n } ) . ▶ How many numbers (what is m ?) ▶ How many distinct numbers? ▶ What is the median of σ ?

Streaming Model One can process the input stream using at most s bits of memory We can ask … Algorithms for Big Data (I) 5/19 The input is a sequence σ = ⟨ a 1 , a 2 , . . . , a m ⟩ where each a i ∈ [ n ] We say the algorithm is sublinear if s = o ( min { m , n } ) . ▶ How many numbers (what is m ?) ▶ How many distinct numbers? ▶ What is the median of σ ? ▶ What is the most frequent number?

Streaming Model One can process the input stream using at most s bits of memory We can ask Algorithms for Big Data (I) 5/19 The input is a sequence σ = ⟨ a 1 , a 2 , . . . , a m ⟩ where each a i ∈ [ n ] We say the algorithm is sublinear if s = o ( min { m , n } ) . ▶ How many numbers (what is m ?) ▶ How many distinct numbers? ▶ What is the median of σ ? ▶ What is the most frequent number? ▶ …

How many bits of memory needed? log m . Possible if allow approximation: How many numbers? We can maintain a counter k . Whenever one reads a number a i , let k k . Can be improved to o log m ? Impossible (Why?) For every , compute a number m such that m m Algorithms for Big Data (I) 6/19

How many bits of memory needed? log m . Possible if allow approximation: How many numbers? Can be improved to o log m ? Impossible (Why?) For every , compute a number m such that m m Algorithms for Big Data (I) 6/19 We can maintain a counter k . Whenever one reads a number a i , let k = k + 1 .

Possible if allow approximation: How many numbers? How many bits of memory needed? log m . Can be improved to o log m ? Impossible (Why?) For every , compute a number m such that m m Algorithms for Big Data (I) 6/19 We can maintain a counter k . Whenever one reads a number a i , let k = k + 1 .

Possible if allow approximation: How many numbers? Can be improved to o log m ? Impossible (Why?) For every , compute a number m such that m m Algorithms for Big Data (I) 6/19 We can maintain a counter k . Whenever one reads a number a i , let k = k + 1 . How many bits of memory needed? log 2 m .

Possible if allow approximation: How many numbers? Impossible (Why?) For every , compute a number m such that m m Algorithms for Big Data (I) 6/19 We can maintain a counter k . Whenever one reads a number a i , let k = k + 1 . How many bits of memory needed? log 2 m . Can be improved to o ( log m ) ?

How many numbers? Impossible (Why?) For every , compute a number m such that m m Algorithms for Big Data (I) 6/19 We can maintain a counter k . Whenever one reads a number a i , let k = k + 1 . How many bits of memory needed? log 2 m . Can be improved to o ( log m ) ? Possible if allow approximation:

How many numbers? Impossible (Why?) m such that m Algorithms for Big Data (I) 6/19 We can maintain a counter k . Whenever one reads a number a i , let k = k + 1 . How many bits of memory needed? log 2 m . Can be improved to o ( log m ) ? Possible if allow approximation: For every ε > 0 , compute a number � 1 − ε ≤ � m ≤ 1 + ε .

Morris’ algorithm Algorithm Morris’ Algorithm for Counting Elements Init: A variable X . On Input y : increase X with probability X . Output: Output m X . This is a randomized algorithm. Therefore we look at the expectation of its output. Algorithms for Big Data (I) 7/19

Morris’ algorithm Algorithm Morris’ Algorithm for Counting Elements Init: On Input y : Output: Therefore we look at the expectation of its output. Algorithms for Big Data (I) 7/19 A variable X ← 0 . increase X with probability 2 − X . m = 2 X − 1 . Output � ▶ This is a randomized algorithm.

Morris’ algorithm Algorithm Morris’ Algorithm for Counting Elements Init: On Input y : Output: Algorithms for Big Data (I) 7/19 A variable X ← 0 . increase X with probability 2 − X . m = 2 X − 1 . Output � ▶ This is a randomized algorithm. ▶ Therefore we look at the expectation of its output.

Assume it is true for smaller m , let X i denote the value of X afuer processing i th input. Analysis The output m is a random variable, we prove that its expectation E m m by induction on m . Since X when m , we have E m . Algorithms for Big Data (I) 8/19

Assume it is true for smaller m , let X i denote the value of X afuer processing i th input. Analysis on m . Since X when m , we have E m . Algorithms for Big Data (I) 8/19 The output � m is a random variable, we prove that its expectation E [ � m ] = m by induction

Assume it is true for smaller m , let X i denote the value of X afuer processing i th input. Analysis on m . Algorithms for Big Data (I) 8/19 The output � m is a random variable, we prove that its expectation E [ � m ] = m by induction Since X = 1 when m = 1 , we have E [ � m ] = 1 .

Analysis on m . Algorithms for Big Data (I) 8/19 The output � m is a random variable, we prove that its expectation E [ � m ] = m by induction Since X = 1 when m = 1 , we have E [ � m ] = 1 . Assume it is true for smaller m , let X i denote the value of X afuer processing i th input.

Analysis (cont’d) i Algorithms for Big Data (I) (induction hypothesis) m X m E i i Pr X m i m i i Pr X m E m i i Pr X m i m i i Pr X m i m X m E 9/19

Analysis (cont’d) m Algorithms for Big Data (I) (induction hypothesis) m 9/19 [ 2 X m ] E [ � m ] = E − 1 ∑ Pr [ X m = i ] · 2 i − 1 = i =0 ( Pr [ X m − 1 = i ] (1 − 2 − i ) + Pr [ X m − 1 = i − 1] · 2 1 − i ) ∑ · 2 i − 1 = i =0 ( ) ∑ m − 1 2 i + 1 = Pr [ X m − 1 = i ] − 1 i =0 [ 2 X m − 1 ] = E = m

It uses approximately O log log m bits of memory. That is, we want to establish concentration inequality of the form It is now clear that Morris’ algorithm is an unbiased estimator for m . However, for a practical randomized algorithm, we further require its output to concentrate on the expectation. Pr m m for . For fixed , the smaller is, the betuer the algorithm will be. Algorithms for Big Data (I) 10/19

That is, we want to establish concentration inequality of the form It is now clear that Morris’ algorithm is an unbiased estimator for m . However, for a practical randomized algorithm, we further require its output to concentrate on the expectation. Pr m m for . For fixed , the smaller is, the betuer the algorithm will be. Algorithms for Big Data (I) 10/19 It uses approximately O ( log log m ) bits of memory.

It is now clear that Morris’ algorithm is an unbiased estimator for m . However, for a practical randomized algorithm, we further require its output to concentrate on the expectation. For fixed , the smaller is, the betuer the algorithm will be. Algorithms for Big Data (I) 10/19 It uses approximately O ( log log m ) bits of memory. That is, we want to establish concentration inequality of the form Pr [ | � m − m | > ε ] ≤ δ , for ε , δ > 0 .

It is now clear that Morris’ algorithm is an unbiased estimator for m . However, for a practical randomized algorithm, we further require its output to concentrate on the expectation. Algorithms for Big Data (I) 10/19 It uses approximately O ( log log m ) bits of memory. That is, we want to establish concentration inequality of the form Pr [ | � m − m | > ε ] ≤ δ , for ε , δ > 0 . For fixed ε , the smaller δ is, the betuer the algorithm will be.

Concentration For every random variable X and every a Algorithms for Big Data (I) a Var X a E X Pr X , it holds that Chebyshev’s inequality We need some probabilistic tools to establish the concentration inequality. a E X a Pr X , it holds that For every nonnegative random variable X and every a Markov’s inequality 11/19

Concentration , it holds that Algorithms for Big Data (I) a Var X a E X Pr X For every random variable X and every a We need some probabilistic tools to establish the concentration inequality. Chebyshev’s inequality a Markov’s inequality 11/19 For every nonnegative random variable X and every a ≥ 0 , it holds that Pr [ X ≥ a ] ≤ E [ X ] .

Concentration We need some probabilistic tools to establish the concentration inequality. Markov’s inequality a Chebyshev’s inequality Algorithms for Big Data (I) 11/19 For every nonnegative random variable X and every a ≥ 0 , it holds that Pr [ X ≥ a ] ≤ E [ X ] . For every random variable X and every a ≥ 0 , it holds that Pr [ | X − E [ X ] | ≥ a ] ≤ Var [ X ] . a 2

Concentration (cont’d) Var m Algorithms for Big Data (I) m m X m E E m E m Therefore, In order to apply Chebyshev’s inequality, we have to compute the variance of m . expectation. We can prove the claim using an induction argument similar to our proof for the m m X m E Lemma 12/19

Concentration (cont’d) expectation. Algorithms for Big Data (I) m m X m E E m E m Var m Therefore, We can prove the claim using an induction argument similar to our proof for the Lemma E m . 12/19 In order to apply Chebyshev’s inequality, we have to compute the variance of � [( ) 2 ] = 3 2 m 2 + 3 2 X m 2 m + 1 .

Concentration (cont’d) We can prove the claim using an induction argument similar to our proof for the Algorithms for Big Data (I) Therefore, expectation. 12/19 E m . Lemma In order to apply Chebyshev’s inequality, we have to compute the variance of � [( ) 2 ] = 3 2 m 2 + 3 2 X m 2 m + 1 . [( ) 2 ] [ m 2 ] − m 2 ≤ m 2 m ] 2 = E 2 X m − 1 Var [ � m ] = E � − E [ � 2

Two common tricks work here. Can we improve the concentration? Algorithms for Big Data (I) 13/19 Applying Chebyshev’s inequality, we obtain for every ε > 0 , 1 Pr [ | � m − m | ≥ ε m ] ≤ 2 ε 2 .

Averaging trick The final output is m Algorithms for Big Data (I) t m m Pr m Apply Chebyshev’s inequality to m : . t m i i t m t . The Chebyshev’s inequality tells us that we can improve the concentration by reducing m We can independently run Morris algorithm t time in parallel, and let the outputs be Var Y for independent X and Y . Var X Y Var X Var X ; a Var a X Note that variance satisfies the variance. 14/19

Averaging trick m i Algorithms for Big Data (I) t m m Pr m Apply Chebyshev’s inequality to m : . t i The Chebyshev’s inequality tells us that we can improve the concentration by reducing t The final output is m m t . m We can independently run Morris algorithm t time in parallel, and let the outputs be Note that variance satisfies the variance. 14/19 ▶ Var [ a · X ] = a 2 · Var [ X ] ; ▶ Var [ X + Y ] = Var [ X ] + Var [ Y ] for independent X and Y .

Averaging trick i Algorithms for Big Data (I) t m m Pr m Apply Chebyshev’s inequality to m : . t m i t The Chebyshev’s inequality tells us that we can improve the concentration by reducing The final output is m m t . We can independently run Morris algorithm t time in parallel, and let the outputs be Note that variance satisfies the variance. 14/19 ▶ Var [ a · X ] = a 2 · Var [ X ] ; ▶ Var [ X + Y ] = Var [ X ] + Var [ Y ] for independent X and Y . � m 1 , . . . , �

Averaging trick The Chebyshev’s inequality tells us that we can improve the concentration by reducing Algorithms for Big Data (I) t m m Pr m Apply Chebyshev’s inequality to m : . t m i 14/19 We can independently run Morris algorithm t time in parallel, and let the outputs be m t . Note that variance satisfies the variance. ▶ Var [ a · X ] = a 2 · Var [ X ] ; ▶ Var [ X + Y ] = Var [ X ] + Var [ Y ] for independent X and Y . � m 1 , . . . , � ∑ t m ∗ := i =1 � The final output is �

Averaging trick m t . Algorithms for Big Data (I) . t m i The Chebyshev’s inequality tells us that we can improve the concentration by reducing 14/19 Note that variance satisfies the variance. We can independently run Morris algorithm t time in parallel, and let the outputs be ▶ Var [ a · X ] = a 2 · Var [ X ] ; ▶ Var [ X + Y ] = Var [ X ] + Var [ Y ] for independent X and Y . � m 1 , . . . , � ∑ t m ∗ := i =1 � The final output is � m ∗ : Apply Chebyshev’s inequality to � 1 m ∗ − m | ≥ ε m ] ≤ Pr [ | � t · 2 ε 2 .

A trade-ofg between the quality of the randomized algorithm and the consumption of log log n Our algorithm uses O bits of memory. memory space. Algorithms for Big Data (I) 15/19 1 For t ≥ 2 ε 2 δ , we have m ∗ − m | ≥ ε m ] ≤ δ . Pr [ | �

A trade-ofg between the quality of the randomized algorithm and the consumption of Our algorithm uses O log log n bits of memory. memory space. Algorithms for Big Data (I) 15/19 1 For t ≥ 2 ε 2 δ , we have m ∗ − m | ≥ ε m ] ≤ δ . Pr [ | � ( ) ε 2 δ

Our algorithm uses O log log n bits of memory. memory space. Algorithms for Big Data (I) 15/19 1 For t ≥ 2 ε 2 δ , we have m ∗ − m | ≥ ε m ] ≤ δ . Pr [ | � ( ) ε 2 δ A trade-ofg between the quality of the randomized algorithm and the consumption of

The Median trick m i Algorithms for Big Data (I) . m m s Output the median of m m m Pr We choose t s , It holds that for every i m s . m m Independently run the algorithm s times in parallel, and let the outputs be in the previous algorithm. 16/19

The Median trick m i Algorithms for Big Data (I) . m m s Output the median of m m m Pr s , It holds that for every i m s . m m Independently run the algorithm s times in parallel, and let the outputs be 16/19 3 We choose t = 2 ε 2 in the previous algorithm.

The Median trick s , Algorithms for Big Data (I) . m m s Output the median of m m m m i Pr It holds that for every i Independently run the algorithm s times in parallel, and let the outputs be s . 16/19 3 We choose t = 2 ε 2 in the previous algorithm. m ∗ m ∗ m ∗ � 1 , � 2 , . . . , �

The Median trick s . Algorithms for Big Data (I) s Pr 16/19 Independently run the algorithm s times in parallel, and let the outputs be 3 We choose t = 2 ε 2 in the previous algorithm. m ∗ m ∗ m ∗ � 1 , � 2 , . . . , � It holds that for every i = 1 , . . . , s , [� � ] ≤ 1 �� m ∗ � ≥ ε m i − m 3 . m ∗ m ∗ m ∗∗ ) . Output the median of � 1 , . . . , � (=: �

X n be independent random variables with X i Chernoff bound , it holds that Algorithms for Big Data (I) E X exp E X E X Pr X X i . Then for every Chernofg bound i n Let X n . for every i Let X 17/19

Chernoff bound Chernofg bound Algorithms for Big Data (I) 17/19 Let X 1 , . . . , X n be independent random variables with X i ∈ [0 , 1] for every i = 1 , . . . , n . Let X = ∑ n i =1 X i . Then for every 0 < ε < 1 , it holds that ( ) − ε 2 E [ X ] Pr [ | X − E [ X ] | > ε · E [ X ]] ≤ 2 exp . 3

Algorithms for Big Data (I) Chihao Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

Algorithms for Big Data (I) Chihao Zhang Shanghai Jiao Tong University Sept. 20, 2019 Algorithms for Big Data (I) 1/19 Course Information Course Homepage: http://chihaozhang.com/teaching/BDA2019fall Time: Every Friday, 12:55 - 15:40 Ofgice

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Big- Big -O O Analyzing Algorithms Asymptotically Analyzing Algorithms Asymptotically P1 P2

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms & Data Structures Tuesday,

Analysis of Algorithms & Big-O CS16: Introduction to Algorithms & Data Structures Spring

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Algorithms for Big Data (VI) Chihao Zhang Shanghai Jiao Tong University Oct. 25, 2019

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Regaining sovereignty over your router Lucas Lasota | Legal Team lucas.lasota@fsfe.org What is

Chapter 3 outline 3.1 Transport-layer 3.5 Connection-oriented services transport: TCP

Some Modeling and Estimation Issues in Control of Heterogenous Networks Krister Jacobsson, Niels

HighSpeed TCP and Quick-Start for Fast Long-Distance Networks: * Workshop on Protocols for Fast

Routing on Overlay Networks EECS 228 Abhay Parekh parekh@eecs.berkeley.edu October 28, 2002

HA and clustering solution: ProxySQL as an intelligent router for Galera and Group Replication

Maciej Paczkowski, Aleksandra Jereczek, Patrycja Kochmaska INTEL CONFIDENTIAL Internal Only

Lecture 5 - The Perfect BVH Welcome! , = (, ) ,

Algorithms for Big Data (I) Chihao Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

Algorithms for Big Data (I) Chihao Zhang Shanghai Jiao Tong University Sept. 20, 2019 Algorithms for Big Data (I) 1/19 Course Information Course Homepage: http://chihaozhang.com/teaching/BDA2019fall Time: Every Friday, 12:55 - 15:40 Ofgice

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Algorithms for Big Data (X) Chihao Zhang Shanghai Jiao Tong University Nov. 22, 2019 Algorithms

Big- Big -O O Analyzing Algorithms Asymptotically Analyzing Algorithms Asymptotically P1 P2

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms &amp; Data Structures Tuesday,

Analysis of Algorithms &amp; Big-O CS16: Introduction to Algorithms &amp; Data Structures Spring

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Algorithms for Big Data (VI) Chihao Zhang Shanghai Jiao Tong University Oct. 25, 2019

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Regaining sovereignty over your router Lucas Lasota | Legal Team lucas.lasota@fsfe.org What is

Chapter 3 outline 3.1 Transport-layer 3.5 Connection-oriented services transport: TCP

Some Modeling and Estimation Issues in Control of Heterogenous Networks Krister Jacobsson, Niels

HighSpeed TCP and Quick-Start for Fast Long-Distance Networks: * Workshop on Protocols for Fast

Routing on Overlay Networks EECS 228 Abhay Parekh parekh@eecs.berkeley.edu October 28, 2002

HA and clustering solution: ProxySQL as an intelligent router for Galera and Group Replication

Maciej Paczkowski, Aleksandra Jereczek, Patrycja Kochmaska INTEL CONFIDENTIAL Internal Only

Lecture 5 - The Perfect BVH Welcome! , = (, ) ,

ANALYSIS OF ALGORITHMS AND BIG-O CS16: Introduction to Algorithms & Data Structures Tuesday,

Analysis of Algorithms & Big-O CS16: Introduction to Algorithms & Data Structures Spring

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data