Sketching and Streaming for Distributions Piotr Indyk Andrew - PowerPoint PPT Presentation

Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Main Material: Stable distributions, pseudo-random generators, embeddings, and data stream computation Piotr Indyk (FOCS 2000) Sketching information divergences Sudipto Guha, Piotr Indyk, Andrew McGregor (COLT 2007) Declaring independence via the sketching of sketches Piotr Indyk, Andrew McGregor (SODA 2008)

The Problem

The Problem • List of m red values and m green values in [ n ] 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7,3,4, ...

The Problem • List of m red values and m green values in [ n ] 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7,3,4, ... • Define distributions ( p 1 , ..., p n ) and ( q 1 , ..., q n )

The Problem • List of m red values and m green values in [ n ] 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7,3,4, ... • Define distributions ( p 1 , ..., p n ) and ( q 1 , ..., q n ) • How “different” are p and q ?

The Problem • List of m red values and m green values in [ n ] 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7,3,4, ... • Define distributions ( p 1 , ..., p n ) and ( q 1 , ..., q n ) • How “different” are p and q ? Variational: � | p i − q i | Kullback-Leibler: � p i log( p i / q i ) Hellinger: � ( √ p i − √ q i ) 2 Euclidean: � ( p i − q i ) 2

The Problem • List of m red values and m green values in [ n ] 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7,3,4, ... • Define distributions ( p 1 , ..., p n ) and ( q 1 , ..., q n ) • How “different” are p and q ? D f ( p , q ) = � p i f ( q i / p i ) B F ( p , q ) = � [ F ( p i ) − F ( q i ) − ( p i − q i ) F ′ ( q i )] where f and F are convex and f (1)=0.

The Catch...

The Catch... • What if m and n are huge and you can’t store the list?

The Catch... • What if m and n are huge and you can’t store the list? • Applications: monitoring internet traffic, I/O efficient external memory, processing huge log files, database query planning, sensor networks, ...

The Catch... • What if m and n are huge and you can’t store the list? • Applications: monitoring internet traffic, I/O efficient external memory, processing huge log files, database query planning, sensor networks, ... • Data Stream Model: No control over the order of the stream Limited working memory, e.g. , polylog(n,m) space Limited time to process each element

The Catch... • What if m and n are huge and you can’t store the list? • Applications: monitoring internet traffic, I/O efficient external memory, processing huge log files, database query planning, sensor networks, ... • Data Stream Model: No control over the order of the stream Limited working memory, e.g. , polylog(n,m) space Limited time to process each element • Previous work: quantiles, frequency moments, histograms, clustering, entropy, graph problems... see, e.g., Muthukrishnan “Data Streams: Algorithms and Applications”

Today’s Talk

Today’s Talk • Sketching L p distances (0<p ≤ 2): • (1+ ε )- approx. with prob. 1- δ in Õ( ε -2 ln δ -1 ) space • Stable distributions and pseudo-random generators • Stable distributions, pseudo-random generators, embeddings & data stream computation (Indyk, FOCS 2000)

Today’s Talk • Sketching L p distances (0<p ≤ 2): • (1+ ε )- approx. with prob. 1- δ in Õ( ε -2 ln δ -1 ) space • Stable distributions and pseudo-random generators • Stable distributions, pseudo-random generators, embeddings & data stream computation (Indyk, FOCS 2000) • Impossibility of Extending to Other Divergences: • Can we sketch other divergences such as Hellinger? • Lower bounds via communication complexity • Sketching information divergences (Guha, Indyk, McGregor, COLT 2007)

Today’s Talk • Sketching L p distances (0<p ≤ 2): • (1+ ε )- approx. with prob. 1- δ in Õ( ε -2 ln δ -1 ) space • Stable distributions and pseudo-random generators • Stable distributions, pseudo-random generators, embeddings & data stream computation (Indyk, FOCS 2000) • Impossibility of Extending to Other Divergences: • Can we sketch other divergences such as Hellinger? • Lower bounds via communication complexity • Sketching information divergences (Guha, Indyk, McGregor, COLT 2007) • Using sketches to test independence: • Testing independence between data streams • Declaring independence via the sketching of sketches (Indyk, McGregor, SODA 2008)

1. Sketching L p distances p-stable distributions, pseudo-random generators 2. The Unsketchables information divergences, communication complexity 3. Sketching Sketches identifying correlations in data streams

Stable Distributions

Stable Distributions • A p-stable distribution μ has the following property: If X, Y, Z ∼ µ and a, b ∈ R then : aX + bY ∼ ( | a | p + | b | p ) 1 /p Z

Stable Distributions • A p-stable distribution μ has the following property: If X, Y, Z ∼ µ and a, b ∈ R then : aX + bY ∼ ( | a | p + | b | p ) 1 /p Z • Examples: 1 e − x 2 / 2 Normal(0,1) is 2-stable: √ 2 π 1 1 Cauchy is 1-stable: 1 + x 2 π

Approximating L 1 and L 2

Approximating L 1 and L 2 • Let μ be a p -stable distribution (0< p ≤ 1)

Approximating L 1 and L 2 • Let μ be a p -stable distribution (0< p ≤ 1) • Ideal Algorithm: For i = 1 to k: Let x be a length n vector with x j ~ μ Compute t i = |x.(p-q)| Return median(t 1 , t 2 , ... , t n )/median(| μ |)

Approximating L 1 and L 2 • Let μ be a p -stable distribution (0< p ≤ 1) • Ideal Algorithm: For i = 1 to k: Let x be a length n vector with x j ~ μ Compute t i = |x.(p-q)| Return median(t 1 , t 2 , ... , t n )/median(| μ |) Easy to compute x .( p - q ): for stream 3,5,3,7,5, ... compute x 3 -x 5 +x 3 -x 7 -x 5 - ... and scale.

Approximating L 1 and L 2 • Let μ be a p -stable distribution (0< p ≤ 1) • Ideal Algorithm: For i = 1 to k: Let x be a length n vector with x j ~ μ Compute t i = |x.(p-q)| Return median(t 1 , t 2 , ... , t n )/median(| μ |) Easy to compute x .( p - q ): for stream 3,5,3,7,5, ... compute x 3 -x 5 +x 3 -x 7 -x 5 - ... and scale. • Lemma: Returns (1± ε ) L p ( p-q ) with prob. 1- δ , if k =Õ( ε -2 ln δ -1 ) .

Approximating L 1 and L 2 • Let μ be a p -stable distribution (0< p ≤ 1) • Ideal Algorithm: For i = 1 to k: Let x be a length n vector with x j ~ μ Compute t i = |x.(p-q)| Return median(t 1 , t 2 , ... , t n )/median(| μ |) Easy to compute x .( p - q ): for stream 3,5,3,7,5, ... compute x 3 -x 5 +x 3 -x 7 -x 5 - ... and scale. • Lemma: Returns (1± ε ) L p ( p-q ) with prob. 1- δ , if k =Õ( ε -2 ln δ -1 ) . • Proof: • Each t i ~ L 1 ( p-q ) | μ | by p -stablity property. • Apply Chernoff bounds.

Sketches and Space

Sketches and Space • Sketch/Embedding into Small Dimension:

Sketches and Space • Sketch/Embedding into Small Dimension: • Let x 1 , x 2 , ... , x k be length n vector with x ji ~ μ

Sketches and Space • Sketch/Embedding into Small Dimension: • Let x 1 , x 2 , ... , x k be length n vector with x ji ~ μ • Let C(y)= (x 1 .y, ... , x k .y)

Sketches and Space • Sketch/Embedding into Small Dimension: • Let x 1 , x 2 , ... , x k be length n vector with x ji ~ μ • Let C(y)= (x 1 .y, ... , x k .y) • Approximate L 1 (p-q) from C(p) and C(p)

Sketches and Space • Sketch/Embedding into Small Dimension: • Let x 1 , x 2 , ... , x k be length n vector with x ji ~ μ • Let C(y)= (x 1 .y, ... , x k .y) • Approximate L 1 (p-q) from C(p) and C(p) • CAUTION : Not an embedding into a normed space.

Sketches and Space • Sketch/Embedding into Small Dimension: • Let x 1 , x 2 , ... , x k be length n vector with x ji ~ μ • Let C(y)= (x 1 .y, ... , x k .y) • Approximate L 1 (p-q) from C(p) and C(p) • CAUTION : Not an embedding into a normed space. • Can we also construct sketch in small space:

Sketches and Space • Sketch/Embedding into Small Dimension: • Let x 1 , x 2 , ... , x k be length n vector with x ji ~ μ • Let C(y)= (x 1 .y, ... , x k .y) • Approximate L 1 (p-q) from C(p) and C(p) • CAUTION : Not an embedding into a normed space. • Can we also construct sketch in small space: • Storing all x i requires Ω (nk) space.

Sketches and Space • Sketch/Embedding into Small Dimension: • Let x 1 , x 2 , ... , x k be length n vector with x ji ~ μ • Let C(y)= (x 1 .y, ... , x k .y) • Approximate L 1 (p-q) from C(p) and C(p) • CAUTION : Not an embedding into a normed space. • Can we also construct sketch in small space: • Storing all x i requires Ω (nk) space. • Generate x i with Nisan’s pseudo-random generator.

Sketching and Streaming for Distributions Piotr Indyk Andrew - PowerPoint PPT Presentation

Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Main Material: Stable distributions, pseudo-random generators, embeddings, and data stream

Iterative Sketching Agile Arizona 2017 Agenda Who am I? The Power of Sketching When

Formal Modeling in Cognitive Science 1 Distributions Lecture 20: Joint, Marginal, and Conditional

Free Form Sketching System for Free Form Sketching System for Product Design Using Virtual

Curve Sketching Michael Freeze MAT 151 UNC Wilmington Summer 2013 1 / 10 Section 5.4 :: Curve

Sketching and Streaming Matrix Norms David Woodruff IBM Almaden Based on joint works with Yi Li

? ? ? ? Basic Charts Outline - Distributions & Histograms - Mean, Mode, Average - Chart

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Lecture 5: Probability Distributions Random Variables Probability Distributions

Stat 5102 Lecture Slides: Deck 1 Empirical Distributions, Exact Sampling Distributions,

Create Distributions Empirically using Excel V0E 10/11/2014 0E 2014 Schield Creating

Input Distributions Reading: Chapter 6 in Law Input Distributions Overview Probability Theory

Outline Power Law Size Distributions Distributions Power Law Size Distributions Overview

Graph Sketching, Sampling, Streaming, and Space Efficient Optimization (Part II) Sudipto Guha

Graph Streaming and Sketching Lecture 19 Nov 5, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1

Communication-efficient Distributed SGD with Sketching Nikita Ivkin, Daniel Rothchild, Enayat

36 W 10.8 kJ 306 kJ, so 3:45 hr of operation Temperature (degrees Celsius) Temperature

Herringbone Accordion Tent Dome Amount of Cardboard 107 96 84.5 72 (ft^2) Weight (lbs)

Hypothesis Testing for High-Dimensional Regression: Nearly Optimal Sample Size Adel Javanmard

CountMin and Count Sketches Lecture 10 February 14, 2019 Chandra (UIUC) CS498ABD 1 Spring

CSE 440: Introduction to HCI User Interface Design, Prototyping, and Evaluation Lecture 06:

Efficient Private Statistics with Succinct Sketches Luca Melis , George Danezis, Emiliano De

Secure Data Retrieval on the Cloud: Homomorphic Encryption meets Coresets Adi Akavia (University

Sketching and Streaming for Distributions Piotr Indyk Andrew - PowerPoint PPT Presentation

Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Main Material: Stable distributions, pseudo-random generators, embeddings, and data stream

Iterative Sketching Agile Arizona 2017 Agenda Who am I? The Power of Sketching When

Formal Modeling in Cognitive Science 1 Distributions Lecture 20: Joint, Marginal, and Conditional

Free Form Sketching System for Free Form Sketching System for Product Design Using Virtual

Curve Sketching Michael Freeze MAT 151 UNC Wilmington Summer 2013 1 / 10 Section 5.4 :: Curve

Sketching and Streaming Matrix Norms David Woodruff IBM Almaden Based on joint works with Yi Li

? ? ? ? Basic Charts Outline - Distributions &amp; Histograms - Mean, Mode, Average - Chart

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Lecture 5: Probability Distributions Random Variables Probability Distributions

Stat 5102 Lecture Slides: Deck 1 Empirical Distributions, Exact Sampling Distributions,

Create Distributions Empirically using Excel V0E 10/11/2014 0E 2014 Schield Creating

Input Distributions Reading: Chapter 6 in Law Input Distributions Overview Probability Theory

Outline Power Law Size Distributions Distributions Power Law Size Distributions Overview

Graph Sketching, Sampling, Streaming, and Space Efficient Optimization (Part II) Sudipto Guha

Graph Streaming and Sketching Lecture 19 Nov 5, 2020 Chandra (UIUC) CS498ABD 1 Fall 2020 1

Communication-efficient Distributed SGD with Sketching Nikita Ivkin*, Daniel Rothchild*, Enayat

36 W 10.8 kJ 306 kJ, so 3:45 hr of operation Temperature (degrees Celsius) Temperature

Herringbone Accordion Tent Dome Amount of Cardboard 107 96 84.5 72 (ft^2) Weight (lbs)

Hypothesis Testing for High-Dimensional Regression: Nearly Optimal Sample Size Adel Javanmard

CountMin and Count Sketches Lecture 10 February 14, 2019 Chandra (UIUC) CS498ABD 1 Spring

CSE 440: Introduction to HCI User Interface Design, Prototyping, and Evaluation Lecture 06:

Efficient Private Statistics with Succinct Sketches Luca Melis , George Danezis, Emiliano De

Secure Data Retrieval on the Cloud: Homomorphic Encryption meets Coresets Adi Akavia (University

? ? ? ? Basic Charts Outline - Distributions & Histograms - Mean, Mode, Average - Chart

Communication-efficient Distributed SGD with Sketching Nikita Ivkin, Daniel Rothchild, Enayat