[PPT] - Fast and Near-Optimal Algorithms for Approximating Distributions by PowerPoint Presentation

SLIDE 1

Fast and Near-Optimal Algorithms for Approximating Distributions by Histograms

Jayadev Acharya 1 Ilias Diakonikolas 2 Chinmay Hegde 1 Jerry Li 1 Ludwig Schmidt 1

1MIT 2University of Edinburgh

June 1, 2015

1 / 30

SLIDE 2

Introduction

Motivating Example

You want a representation of f (i), the fraction of the world’s population whose annual salary is i dollars.

2 / 30

SLIDE 3

Introduction

Motivating Example

You want a representation of f (i), the fraction of the world’s population whose annual salary is i dollars. 102 104 106 108 1010 0.05 0.1 0.15

2 / 30

SLIDE 4

Introduction

Motivating Example

You want a representation of f (i), the fraction of the world’s population whose annual salary is i dollars. 102 104 106 108 1010 0.05 0.1 0.15 Problem: Don’t want to store ∼ 1010 elements.

2 / 30

SLIDE 5

Introduction

Motivating Example

You want a representation of f (i), the fraction of the world’s population whose annual salary is i dollars. 102 104 106 108 1010 0.05 0.1 0.15 Problem: Don’t want to store ∼ 1010 elements. Problem: Too many people in the world, so we can’t get all the data

2 / 30

SLIDE 6

Introduction

Motivating Example (cont.)

Problem: Don’t want to store ∼ 1010 elements.

3 / 30

SLIDE 7

Introduction

Motivating Example (cont.)

Problem: Don’t want to store ∼ 1010 elements. Solution: Store a concise representation as a k-histogram.

3 / 30

SLIDE 8

Introduction

Motivating Example (cont.)

Problem: Don’t want to store ∼ 1010 elements. Solution: Store a concise representation as a k-histogram. Definition A k-histogram is a k-piecewise flat function h : {1, . . . , n} → R.

3 / 30

SLIDE 9

Introduction

Motivating Example (cont.)

Problem: Don’t want to store ∼ 1010 elements. Solution: Store a concise representation as a k-histogram. Definition A k-histogram is a k-piecewise flat function h : {1, . . . , n} → R.

5 10 15 20 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

3 / 30

SLIDE 10

Introduction

Motivating Example (cont.)

Problem: Don’t want to store ∼ 1010 elements. Solution: Store a concise representation as a k-histogram. Definition A k-histogram is a k-piecewise flat function h : {1, . . . , n} → R. Instead of storing f , efficiently find and store k-histogram h (for some small value of k) so that f − h2

2 is small,

where g2

2 = n i=1 g(i)2 is the Sum-Squared Error or V-optimal error

[IP95].

3 / 30

SLIDE 11

Introduction

Motivating Example (cont.)

Problem: Too many people in the world, so we can’t get all the data

4 / 30

SLIDE 12

Introduction

Motivating Example (cont.)

Problem: Too many people in the world, so we can’t get all the data Solution: Uniformly at random query people for their yearly income, and output a good k-histogram approximation to f using these answers.

4 / 30

SLIDE 13

Introduction

Motivating Example (cont.)

Problem: Too many people in the world, so we can’t get all the data Solution: Uniformly at random query people for their yearly income, and output a good k-histogram approximation to f using these answers. i.e. Draw independent samples from f , and use them to recover a good k-histogram approximation to f

4 / 30

SLIDE 14

Introduction

Formal Problem Statement

Given k ∈ N, ǫ > 0, and independent samples from an unknown distribution f supported on [n] = {1, . . . , n}, recover a k-histogram h so that w.h.p., f − h2

2 ≤ OPTk(f ) + ǫ ,

where for any function q : [n] → R, OPTk(q) = inf

h: k-histogram q − h2 2 .

5 / 30

SLIDE 15

Introduction

Formal Problem Statement

Given k ∈ N, ǫ > 0, and independent samples from an unknown distribution f supported on [n] = {1, . . . , n}, recover a α · k-histogram h so that w.h.p., f − h2

2 ≤ β · OPTk(f ) + ǫ ,

where for any function q : [n] → R, OPTk(q) = inf

h: k-histogram q − h2 2 .

5 / 30

SLIDE 16

Introduction

Formal Problem Statement

Given k ∈ N, ǫ > 0, and independent samples from an unknown distribution f supported on [n] = {1, . . . , n}, recover a α · k-histogram h so that w.h.p., f − h2

2 ≤ β · OPTk(f ) + ǫ ,

where for any function q : [n] → R, OPTk(q) = inf

h: k-histogram q − h2 2 .

Sample Complexity: How many samples does our algorithm need?

5 / 30

SLIDE 17

Introduction

Formal Problem Statement

Given k ∈ N, ǫ > 0, and independent samples from an unknown distribution f supported on [n] = {1, . . . , n}, recover a α · k-histogram h so that w.h.p., f − h2

2 ≤ β · OPTk(f ) + ǫ ,

where for any function q : [n] → R, OPTk(q) = inf

h: k-histogram q − h2 2 .

Sample Complexity: How many samples does our algorithm need? Time Complexity: How fast does our algorithm run?

5 / 30

SLIDE 18

Introduction

Formal Problem Statement

Given k ∈ N, ǫ > 0, and independent samples from an unknown distribution f supported on [n] = {1, . . . , n}, recover a α · k-histogram h so that w.h.p., f − h2

2 ≤ β · OPTk(f ) + ǫ ,

where for any function q : [n] → R, OPTk(q) = inf

h: k-histogram q − h2 2 .

Sample Complexity: How many samples does our algorithm need? Time Complexity: How fast does our algorithm run? Gold Standard: Algorithm which achieves α = β = 1 which takes an information-theoretically optimal number of samples, and runs in time linear in the number of samples taken.

5 / 30

SLIDE 19

Introduction

Previous Work

6 / 30

SLIDE 20

Introduction

Previous Work

When given complete access to f : Algorithm Runtime α β Basic DP [JKM+98] O(kn2) 1 1 Greedy Dual [JKM+98] O (n log OPTk(f )) 3 3 Smart DPs [TGIK02, GGI+02, GKS06] O

n + k3 log2 n

δ2

(1 + δ)

1

6 / 30

SLIDE 21

Introduction

Previous Work

When given complete access to f : Algorithm Runtime α β Basic DP [JKM+98] O(kn2) 1 1 Greedy Dual [JKM+98] O (n log OPTk(f )) 3 3 Smart DPs [TGIK02, GGI+02, GKS06] O

n + k3 log2 n

δ2

(1 + δ)

1 This Work O(n) 5 2

6 / 30

SLIDE 22

Introduction

Previous Work

When given complete access to f : Algorithm Runtime α β Basic DP [JKM+98] O(kn2) 1 1 Greedy Dual [JKM+98] O (n log OPTk(f )) 3 3 Smart DPs [TGIK02, GGI+02, GKS06] O

n + k3 log2 n

δ2

(1 + δ)

1 This Work O(n) 5 2 When given sample access to f : [ILR12]: ˜ O

k2

ǫ2 log n

samples, ˜

O

k5

ǫ4 log2 n

time, α = O(log 1

ǫ), β = 1.

6 / 30

SLIDE 23

Introduction

Previous Work

When given complete access to f : Algorithm Runtime α β Basic DP [JKM+98] O(kn2) 1 1 Greedy Dual [JKM+98] O (n log OPTk(f )) 3 3 Smart DPs [TGIK02, GGI+02, GKS06] O

n + k3 log2 n

δ2

(1 + δ)

1 This Work O(n) 5 2 When given sample access to f : [ILR12]: ˜ O

k2

ǫ2 log n

samples, ˜

O

k5

ǫ4 log2 n

time, α = O(log 1

ǫ), β = 1.

This work: O( 1

ǫ) samples, O( 1 ǫ) time, α = 5, β = 2.

6 / 30

SLIDE 24

Introduction

Previous Work

When given complete access to f : Algorithm Runtime α β Basic DP [JKM+98] O(kn2) 1 1 Greedy Dual [JKM+98] O (n log OPTk(f )) 3 3 Smart DPs [TGIK02, GGI+02, GKS06] O

n + k3 log2 n

δ2

(1 + δ)

1 This Work O(n) 5 2 When given sample access to f : [ILR12]: ˜ O

k2

ǫ2 log n

samples, ˜

O

k5

ǫ4 log2 n

time, α = O(log 1

ǫ), β = 1.

This work: O( 1

ǫ) samples, O( 1 ǫ) time, α = 5, β = 2.

Our algorithm is sample optimal (up to constants) and runs in linear time.

6 / 30

SLIDE 25

An O(1/ǫ) Sample Upper Bound

Outline of Rest of Talk

1

An O(1/ǫ) Sample Upper Bound

2

The Greedy Merging Algorithm

3

Analysis

4

Experimental Evaluation

5

Conclusions

7 / 30

SLIDE 26

An O(1/ǫ) Sample Upper Bound

Outline of Rest of Talk

1

An O(1/ǫ) Sample Upper Bound

2

The Greedy Merging Algorithm

3

Analysis

4

Experimental Evaluation

5

Conclusions

8 / 30

SLIDE 27

An O(1/ǫ) Sample Upper Bound

Let ˆ fm denote the empirical distribution after drawing m samples X1, . . . , Xm from f .

9 / 30

SLIDE 28

An O(1/ǫ) Sample Upper Bound

Let ˆ fm denote the empirical distribution after drawing m samples X1, . . . , Xm from f . ˆ fm(i) = #{j : Xj = i} m .

9 / 30

SLIDE 29

An O(1/ǫ) Sample Upper Bound

Let ˆ fm denote the empirical distribution after drawing m samples X1, . . . , Xm from f . ˆ fm(i) = #{j : Xj = i} m . Key lemma: Lemma If m = O( 1

ǫ), then f − ˆ

fm2

2 ≤ ǫ with probability 99/100.

9 / 30

SLIDE 30

An O(1/ǫ) Sample Upper Bound

An O(1/ǫ) Sample Upper Bound (cont.)

Lemma If m = O( 1

ǫ), then f − ˆ

fm2

2 ≤ ǫ with probability 99/100.

10 / 30

SLIDE 31

An O(1/ǫ) Sample Upper Bound

An O(1/ǫ) Sample Upper Bound (cont.)

Lemma If m = O( 1

ǫ), then f − ˆ

fm2

2 ≤ ǫ with probability 99/100.

Proof. We will show E[f − ˆ fm2

2] ≤ ǫ.

10 / 30

SLIDE 32

An O(1/ǫ) Sample Upper Bound

An O(1/ǫ) Sample Upper Bound (cont.)

Lemma If m = O( 1

ǫ), then f − ˆ

fm2

2 ≤ ǫ with probability 99/100.

Proof. We will show E[f − ˆ fm2

2] ≤ ǫ.

E[f − ˆ fm2

2]

= E n

i=1

(f (i) − ˆ fm(i))2

10 / 30

SLIDE 33

An O(1/ǫ) Sample Upper Bound

An O(1/ǫ) Sample Upper Bound (cont.)

Lemma If m = O( 1

ǫ), then f − ˆ

fm2

2 ≤ ǫ with probability 99/100.

Proof. We will show E[f − ˆ fm2

2] ≤ ǫ.

E[f − ˆ fm2

2]

= E n

i=1

(f (i) − ˆ fm(i))2

=

n

i=1

E

(f (i) − ˆ

fm(i))2

10 / 30

SLIDE 34

An O(1/ǫ) Sample Upper Bound

An O(1/ǫ) Sample Upper Bound (cont.)

Lemma If m = O( 1

ǫ), then f − ˆ

fm2

2 ≤ ǫ with probability 99/100.

Proof. We will show E[f − ˆ fm2

2] ≤ ǫ.

E[f − ˆ fm2

2]

= E n

i=1

(f (i) − ˆ fm(i))2

=

n

i=1

E

(f (i) − ˆ

fm(i))2 =

n

i=1

Var

ˆ

fm(i)

.

10 / 30

SLIDE 35

An O(1/ǫ) Sample Upper Bound

An O(1/ǫ) Sample Upper Bound (cont.)

Lemma If m = O( 1

ǫ), then f − ˆ

fm2

2 ≤ ǫ with probability 99/100.

Proof. E[f − ˆ fm2

2]

=

n

i=1

Var

ˆ

fm(i)

.

11 / 30

SLIDE 36

An O(1/ǫ) Sample Upper Bound

An O(1/ǫ) Sample Upper Bound (cont.)

Lemma If m = O( 1

ǫ), then f − ˆ

fm2

2 ≤ ǫ with probability 99/100.

Proof. E[f − ˆ fm2

2]

=

n

i=1

Var

ˆ

fm(i)

.

But ˆ fm(i) ∼ 1

mBin(m, f (i)), and Var(Bin(n, p)) = np(1 − p).

11 / 30

SLIDE 37

An O(1/ǫ) Sample Upper Bound

An O(1/ǫ) Sample Upper Bound (cont.)

Lemma If m = O( 1

ǫ), then f − ˆ

fm2

2 ≤ ǫ with probability 99/100.

Proof. E[f − ˆ fm2

2]

=

n

i=1

Var

ˆ

fm(i)

.

But ˆ fm(i) ∼ 1

mBin(m, f (i)), and Var(Bin(n, p)) = np(1 − p).

E[f − ˆ fm2

2]

=

n

i=1

1 m2 mf (i)(1 − f (i))

11 / 30

SLIDE 38

An O(1/ǫ) Sample Upper Bound

An O(1/ǫ) Sample Upper Bound (cont.)

Lemma If m = O( 1

ǫ), then f − ˆ

fm2

2 ≤ ǫ with probability 99/100.

Proof. E[f − ˆ fm2

2]

=

n

i=1

Var

ˆ

fm(i)

.

But ˆ fm(i) ∼ 1

mBin(m, f (i)), and Var(Bin(n, p)) = np(1 − p).

E[f − ˆ fm2

2]

=

n

i=1

1 m2 mf (i)(1 − f (i)) ≤ 1 m

n

i=1

f (i)

11 / 30

SLIDE 39

An O(1/ǫ) Sample Upper Bound

An O(1/ǫ) Sample Upper Bound (cont.)

Lemma If m = O( 1

ǫ), then f − ˆ

fm2

2 ≤ ǫ with probability 99/100.

Proof. E[f − ˆ fm2

2]

=

n

i=1

Var

ˆ

fm(i)

.

But ˆ fm(i) ∼ 1

mBin(m, f (i)), and Var(Bin(n, p)) = np(1 − p).

E[f − ˆ fm2

2]

=

n

i=1

1 m2 mf (i)(1 − f (i)) ≤ 1 m

n

i=1

f (i) = 1 m

11 / 30

SLIDE 40

An O(1/ǫ) Sample Upper Bound

An O(1/ǫ) Sample Upper Bound (cont.)

Lemma If m = O( 1

ǫ), then f − ˆ

fm2

2 ≤ ǫ with probability 99/100.

Proof. E[f − ˆ fm2

2]

=

n

i=1

Var

ˆ

fm(i)

.

But ˆ fm(i) ∼ 1

mBin(m, f (i)), and Var(Bin(n, p)) = np(1 − p).

E[f − ˆ fm2

2]

=

n

i=1

1 m2 mf (i)(1 − f (i)) ≤ 1 m

n

i=1

f (i) = 1 m ≤ ǫ .

11 / 30

SLIDE 41

An O(1/ǫ) Sample Upper Bound

An O(1/ǫ) Sample Upper Bound (cont.)

Lemma If m = O( 1

ǫ), then f − ˆ

fm2

2 ≤ ǫ with probability 99/100.

12 / 30

SLIDE 42

An O(1/ǫ) Sample Upper Bound

An O(1/ǫ) Sample Upper Bound (cont.)

Lemma If m = O( 1

ǫ), then f − ˆ

fm2

2 ≤ ǫ with probability 99/100.

Corollary Let m = O(1/ǫ), and let h be so that h − ˆ fm2

2 ≤ β · OPTk(ˆ

fm). Then w.h.p. h − f 2

2 ≤ β · OPTk(f ) + ǫ .

12 / 30

SLIDE 43

An O(1/ǫ) Sample Upper Bound

An O(1/ǫ) Sample Upper Bound (cont.)

Lemma If m = O( 1

ǫ), then f − ˆ

fm2

2 ≤ ǫ with probability 99/100.

Corollary Let m = O(1/ǫ), and let h be so that h − ˆ fm2

2 ≤ β · OPTk(ˆ

fm). Then w.h.p. h − f 2

2 ≤ β · OPTk(f ) + ǫ .

This reduces to a completely deterministic problem!

12 / 30

SLIDE 44

The Greedy Merging Algorithm

Outline of Rest of Talk

1

An O(1/ǫ) Sample Upper Bound

2

The Greedy Merging Algorithm

3

Analysis

4

Experimental Evaluation

5

Conclusions

13 / 30

SLIDE 45

The Greedy Merging Algorithm

Main Result

14 / 30

SLIDE 46

The Greedy Merging Algorithm

Main Result

Main Algorithmic Result An algorithm which, given k ∈ N and q : [n] → R supported on m elements, runs in time O(m), and outputs a 5k-histogram h so that h − q2

2 ≤ 2 · OPTk(q) .

14 / 30

SLIDE 47

The Greedy Merging Algorithm

Main Result

Main Algorithmic Result An algorithm which, given k ∈ N and q : [n] → R supported on m elements, runs in time O(m), and outputs a 5k-histogram h so that h − q2

2 ≤ 2 · OPTk(q) .

Corollary An algorithm for learning histogram approximations that takes O(1/ǫ) samples and runs in time O(1/ǫ) which achieves α = 5 and β = 2.

14 / 30

SLIDE 48

The Greedy Merging Algorithm

Main Result

Main Algorithmic Result An algorithm which, given k ∈ N and q : [n] → R supported on m elements, runs in time O(m), and outputs a 5k-histogram h so that h − q2

2 ≤ 2 · OPTk(q) .

Corollary An algorithm for learning histogram approximations that takes O(1/ǫ) samples and runs in time O(1/ǫ) which achieves α = 5 and β = 2. Proof. Draw m = O(1/ǫ) samples and form the empirical ˆ fm.

14 / 30

SLIDE 49

The Greedy Merging Algorithm

Main Result

Main Algorithmic Result An algorithm which, given k ∈ N and q : [n] → R supported on m elements, runs in time O(m), and outputs a 5k-histogram h so that h − q2

2 ≤ 2 · OPTk(q) .

Corollary An algorithm for learning histogram approximations that takes O(1/ǫ) samples and runs in time O(1/ǫ) which achieves α = 5 and β = 2. Proof. Draw m = O(1/ǫ) samples and form the empirical ˆ fm. Run the above algorithm on ˆ fm.

14 / 30

SLIDE 50

The Greedy Merging Algorithm

Flattening

15 / 30

SLIDE 51

The Greedy Merging Algorithm

Flattening

Definition Let q : [n] → R, and and let I ⊆ [n] be an interval. Let qI be the constant function on I which is identically

1 |I|

i∈I q(I).

We call this the flattening of q over I. Let flat-errq(I) =

i∈I(q(i) − qI(i))2.

15 / 30

SLIDE 52

The Greedy Merging Algorithm

Flattening

Definition Let q : [n] → R, and and let I ⊆ [n] be an interval. Let qI be the constant function on I which is identically

1 |I|

i∈I q(I).

We call this the flattening of q over I. Let flat-errq(I) =

i∈I(q(i) − qI(i))2.

0.5 0.0 0.5 1.0 1.5 2.0 2.5 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

15 / 30

SLIDE 53

The Greedy Merging Algorithm

Flattening

Definition Let q : [n] → R, and and let I ⊆ [n] be an interval. Let qI be the constant function on I which is identically

1 |I|

i∈I q(I).

We call this the flattening of q over I. Let flat-errq(I) =

i∈I(q(i) − qI(i))2.

0.5 0.0 0.5 1.0 1.5 2.0 2.5 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

15 / 30

SLIDE 54

The Greedy Merging Algorithm

Flattening

Definition Let q : [n] → R, and and let I ⊆ [n] be an interval. Let qI be the constant function on I which is identically

1 |I|

i∈I q(I).

We call this the flattening of q over I. Let flat-errq(I) =

i∈I(q(i) − qI(i))2.

Lemma For any flat function φ on I, flat-errq(I) ≤

i∈I(q(i) − φ(i))2.

15 / 30

SLIDE 55

The Greedy Merging Algorithm

Partitions

16 / 30

SLIDE 56

The Greedy Merging Algorithm

Partitions

Definition A partition of [n] is a set of disjoint intervals I1, . . . , Ir so that Ii = [n].

16 / 30

SLIDE 57

The Greedy Merging Algorithm

Partitions

Definition A partition of [n] is a set of disjoint intervals I1, . . . , Ir so that Ii = [n]. Any k-histogram induces a partition of [n].

10 20 30 40 50 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

16 / 30

SLIDE 58

The Greedy Merging Algorithm

Partitions

Definition A partition of [n] is a set of disjoint intervals I1, . . . , Ir so that Ii = [n]. Any k-histogram induces a partition of [n].

10 20 30 40 50 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

16 / 30

SLIDE 59

The Greedy Merging Algorithm

Partitions

Definition A partition of [n] is a set of disjoint intervals I1, . . . , Ir so that Ii = [n]. Any partition induces a unique k-histogram (for our purposes)

10 20 30 40 50 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

16 / 30

SLIDE 60

The Greedy Merging Algorithm

Partitions

Definition A partition of [n] is a set of disjoint intervals I1, . . . , Ir so that Ii = [n]. Any partition induces a unique k-histogram (for our purposes)

10 20 30 40 50 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

16 / 30

SLIDE 61

The Greedy Merging Algorithm

Partitions

Definition A partition of [n] is a set of disjoint intervals I1, . . . , Ir so that Ii = [n]. Any partition induces a unique k-histogram (for our purposes)

10 20 30 40 50 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

16 / 30

SLIDE 62

The Greedy Merging Algorithm

Algorithm Description

17 / 30

SLIDE 63

The Greedy Merging Algorithm

Algorithm Description

q is m-sparse ⇒ q is an O(m)-histogram. Let I be the partition of [n] that the jumps of q induce.

17 / 30

SLIDE 64

The Greedy Merging Algorithm

Algorithm Description

q is m-sparse ⇒ q is an O(m)-histogram. Let I be the partition of [n] that the jumps of q induce. While |I| ≥ 5k:

17 / 30

SLIDE 65

The Greedy Merging Algorithm

Algorithm Description

q is m-sparse ⇒ q is an O(m)-histogram. Let I be the partition of [n] that the jumps of q induce. While |I| ≥ 5k:

Let I = {I1, . . . , Ir} where the Ij are in order.

17 / 30

SLIDE 66

The Greedy Merging Algorithm

Algorithm Description

q is m-sparse ⇒ q is an O(m)-histogram. Let I be the partition of [n] that the jumps of q induce. While |I| ≥ 5k:

Let I = {I1, . . . , Ir} where the Ij are in order. Form J1 = (I1 ∪ I2), J2 = (I3 ∪ I4), . . . , Jr/2 = (Ir−1 ∪ Ir)

17 / 30

SLIDE 67

The Greedy Merging Algorithm

Algorithm Description

q is m-sparse ⇒ q is an O(m)-histogram. Let I be the partition of [n] that the jumps of q induce. While |I| ≥ 5k:

Let I = {I1, . . . , Ir} where the Ij are in order. Form J1 = (I1 ∪ I2), J2 = (I3 ∪ I4), . . . , Jr/2 = (Ir−1 ∪ Ir) For ℓ = 1, . . . , r/2, compute eℓ = flat-errq(Jℓ). Let L ⊆ {1, . . . , r/2} be the set of 2k indices with largest eℓ.

17 / 30

SLIDE 68

The Greedy Merging Algorithm

Algorithm Description

q is m-sparse ⇒ q is an O(m)-histogram. Let I be the partition of [n] that the jumps of q induce. While |I| ≥ 5k:

Let I = {I1, . . . , Ir} where the Ij are in order. Form J1 = (I1 ∪ I2), J2 = (I3 ∪ I4), . . . , Jr/2 = (Ir−1 ∪ Ir) For ℓ = 1, . . . , r/2, compute eℓ = flat-errq(Jℓ). Let L ⊆ {1, . . . , r/2} be the set of 2k indices with largest eℓ. Form I′ by:

17 / 30

SLIDE 69

The Greedy Merging Algorithm

Algorithm Description

q is m-sparse ⇒ q is an O(m)-histogram. Let I be the partition of [n] that the jumps of q induce. While |I| ≥ 5k:

Let I = {I1, . . . , Ir} where the Ij are in order. Form J1 = (I1 ∪ I2), J2 = (I3 ∪ I4), . . . , Jr/2 = (Ir−1 ∪ Ir) For ℓ = 1, . . . , r/2, compute eℓ = flat-errq(Jℓ). Let L ⊆ {1, . . . , r/2} be the set of 2k indices with largest eℓ. Form I′ by:

For ℓ ∈ L, include I2ℓ−1 and I2ℓ.

17 / 30

SLIDE 70

The Greedy Merging Algorithm

Algorithm Description

q is m-sparse ⇒ q is an O(m)-histogram. Let I be the partition of [n] that the jumps of q induce. While |I| ≥ 5k:

Let I = {I1, . . . , Ir} where the Ij are in order. Form J1 = (I1 ∪ I2), J2 = (I3 ∪ I4), . . . , Jr/2 = (Ir−1 ∪ Ir) For ℓ = 1, . . . , r/2, compute eℓ = flat-errq(Jℓ). Let L ⊆ {1, . . . , r/2} be the set of 2k indices with largest eℓ. Form I′ by:

For ℓ ∈ L, include I2ℓ−1 and I2ℓ. For ℓ ∈ L, include Jℓ.

17 / 30

SLIDE 71

The Greedy Merging Algorithm

Algorithm Description

q is m-sparse ⇒ q is an O(m)-histogram. Let I be the partition of [n] that the jumps of q induce. While |I| ≥ 5k:

Let I = {I1, . . . , Ir} where the Ij are in order. Form J1 = (I1 ∪ I2), J2 = (I3 ∪ I4), . . . , Jr/2 = (Ir−1 ∪ Ir) For ℓ = 1, . . . , r/2, compute eℓ = flat-errq(Jℓ). Let L ⊆ {1, . . . , r/2} be the set of 2k indices with largest eℓ. Form I′ by:

For ℓ ∈ L, include I2ℓ−1 and I2ℓ. For ℓ ∈ L, include Jℓ.

Set I ← I′

17 / 30

SLIDE 72

The Greedy Merging Algorithm

Algorithm Description

q is m-sparse ⇒ q is an O(m)-histogram. Let I be the partition of [n] that the jumps of q induce. While |I| ≥ 5k:

Let I = {I1, . . . , Ir} where the Ij are in order. Form J1 = (I1 ∪ I2), J2 = (I3 ∪ I4), . . . , Jr/2 = (Ir−1 ∪ Ir) For ℓ = 1, . . . , r/2, compute eℓ = flat-errq(Jℓ). Let L ⊆ {1, . . . , r/2} be the set of 2k indices with largest eℓ. Form I′ by:

For ℓ ∈ L, include I2ℓ−1 and I2ℓ. For ℓ ∈ L, include Jℓ.

Set I ← I′

Output the flattening of q over the intervals in I.

17 / 30

SLIDE 73

The Greedy Merging Algorithm

Example (k = 2)

Input:

10 20 30 40 50 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 18 / 30

SLIDE 74

The Greedy Merging Algorithm

Example (k = 2)

Iteration 0:

10 20 30 40 50 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 18 / 30

SLIDE 75

The Greedy Merging Algorithm

Example (k = 2)

Iteration i:

10 20 30 40 50 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 18 / 30

SLIDE 76

The Greedy Merging Algorithm

Example (k = 2)

Iteration i:

10 20 30 40 50 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 18 / 30

SLIDE 77

The Greedy Merging Algorithm

Example (k = 2)

Iteration i:

10 20 30 40 50 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 18 / 30

SLIDE 78

The Greedy Merging Algorithm

Example (k = 2)

Iteration i + 1:

10 20 30 40 50 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 18 / 30

SLIDE 79

Analysis

Outline of Rest of Talk

1

An O(1/ǫ) Sample Upper Bound

2

The Greedy Merging Algorithm

3

Analysis

4

Experimental Evaluation

5

Conclusions

19 / 30

SLIDE 80

Analysis

Runtime Analysis

20 / 30

SLIDE 81

Analysis

Runtime Analysis

Theorem The greedy merging algorithm runs in time O(m).

20 / 30

SLIDE 82

Analysis

Runtime Analysis

Theorem The greedy merging algorithm runs in time O(m). Proof Sketch.

20 / 30

SLIDE 83

Analysis

Runtime Analysis

Theorem The greedy merging algorithm runs in time O(m). Proof Sketch. Each iteration can be performed in time proportional to the number

f intervals left in the partition in that iteration.

20 / 30

SLIDE 84

Analysis

Runtime Analysis

Theorem The greedy merging algorithm runs in time O(m). Proof Sketch. Each iteration can be performed in time proportional to the number

f intervals left in the partition in that iteration.

Let sj be the number of intervals after the jth iteration of the algorithm.

20 / 30

SLIDE 85

Analysis

Runtime Analysis

Theorem The greedy merging algorithm runs in time O(m). Proof Sketch. Each iteration can be performed in time proportional to the number

f intervals left in the partition in that iteration.

Let sj be the number of intervals after the jth iteration of the

algorithm. Then

sj+1 = sj − 4k 2 + 4k = sj + 4k 2 ≤ 9 10sj as long as sj ≥ 5k.

20 / 30

SLIDE 86

Analysis

Runtime Analysis

Theorem The greedy merging algorithm runs in time O(m). Proof Sketch. Each iteration can be performed in time proportional to the number

f intervals left in the partition in that iteration.

Let sj be the number of intervals after the jth iteration of the

algorithm. Then

sj+1 = sj − 4k 2 + 4k = sj + 4k 2 ≤ 9 10sj as long as sj ≥ 5k. Thus the runtime is dominated by the runtime of the first iteration, which runs in time O(m).

20 / 30

SLIDE 87

Analysis

Error Analysis

21 / 30

SLIDE 88

Analysis

Error Analysis

Theorem Let h be the output of our algorithm. Then h − q2

2 ≤ 2 · OPTk(q).

Proof.

21 / 30

SLIDE 89

Analysis

Error Analysis

Theorem Let h be the output of our algorithm. Then h − q2

2 ≤ 2 · OPTk(q).

Proof. Let h∗ be an optimal k-histogram; i.e. h∗ − q2

2 = OPTk(q).

21 / 30

SLIDE 90

Analysis

Error Analysis

Theorem Let h be the output of our algorithm. Then h − q2

2 ≤ 2 · OPTk(q).

Proof. Let h∗ be an optimal k-histogram; i.e. h∗ − q2

2 = OPTk(q).

Let I = {I1, . . . , I5k} be the set of intervals we produce. Partition I:

21 / 30

SLIDE 91

Analysis

Error Analysis

Theorem Let h be the output of our algorithm. Then h − q2

2 ≤ 2 · OPTk(q).

Proof. Let h∗ be an optimal k-histogram; i.e. h∗ − q2

2 = OPTk(q).

Let I = {I1, . . . , I5k} be the set of intervals we produce. Partition I: Let F be the set of intervals in I on which h∗ has no jumps

21 / 30

SLIDE 92

Analysis

Error Analysis

Theorem Let h be the output of our algorithm. Then h − q2

2 ≤ 2 · OPTk(q).

Proof. Let h∗ be an optimal k-histogram; i.e. h∗ − q2

2 = OPTk(q).

Let I = {I1, . . . , I5k} be the set of intervals we produce. Partition I: Let F be the set of intervals in I on which h∗ has no jumps Let J be the set of intervals in I on which h∗ has jumps

21 / 30

SLIDE 93

Analysis

Error Analysis

Theorem Let h be the output of our algorithm. Then h − q2

2 ≤ 2 · OPTk(q).

Proof. Let h∗ be an optimal k-histogram; i.e. h∗ − q2

2 = OPTk(q).

Let I = {I1, . . . , I5k} be the set of intervals we produce. Partition I: Let F be the set of intervals in I on which h∗ has no jumps Let J be the set of intervals in I on which h∗ has jumps h − q2

2 =

I∈F

flat-errq(I) +

I∈J

flat-errq(I) .

21 / 30

SLIDE 94

Analysis

Error Analysis

Theorem Let h be the output of our algorithm. Then h − q2

2 ≤ 2 · OPTk(q).

Proof. Let h∗ be an optimal k-histogram; i.e. h∗ − q2

2 = OPTk(q).

Let I = {I1, . . . , I5k} be the set of intervals we produce. Partition I: Let F be the set of intervals in I on which h∗ has no jumps Let J be the set of intervals in I on which h∗ has jumps h − q2

2 =

I∈F

flat-errq(I) +

I∈J

flat-errq(I) . We will bound each term separately.

21 / 30

SLIDE 95

Analysis

Error Analysis (cont.)

Theorem Let h be the output of our algorithm. Then h − q2

2 ≤ 2 · OPTk(q).

22 / 30

SLIDE 96

Analysis

Error Analysis (cont.)

Theorem Let h be the output of our algorithm. Then h − q2

2 ≤ 2 · OPTk(q).

Proof. Error on F:

22 / 30

SLIDE 97

Analysis

Error Analysis (cont.)

Theorem Let h be the output of our algorithm. Then h − q2

2 ≤ 2 · OPTk(q).

Proof. Error on F: Fix I ∈ F.

22 / 30

SLIDE 98

Analysis

Error Analysis (cont.)

Theorem Let h be the output of our algorithm. Then h − q2

2 ≤ 2 · OPTk(q).

Proof. Error on F: Fix I ∈ F. Since h∗ is flat on I, flat-errq(I) ≤

i∈I(q(i) − h∗(I))2.

22 / 30

SLIDE 99

Analysis

Error Analysis (cont.)

Theorem Let h be the output of our algorithm. Then h − q2

2 ≤ 2 · OPTk(q).

Proof. Error on F: Fix I ∈ F. Since h∗ is flat on I, flat-errq(I) ≤

i∈I(q(i) − h∗(I))2.

Thus the squared error we have on F is at most the squared error of h∗ on F, i.e.

22 / 30

SLIDE 100

Analysis

Error Analysis (cont.)

Theorem Let h be the output of our algorithm. Then h − q2

2 ≤ 2 · OPTk(q).

Proof. Error on F: Fix I ∈ F. Since h∗ is flat on I, flat-errq(I) ≤

i∈I(q(i) − h∗(I))2.

Thus the squared error we have on F is at most the squared error of h∗ on F, i.e.

I∈F

flat-errq(I) ≤

22 / 30

SLIDE 101

Analysis

Error Analysis (cont.)

Theorem Let h be the output of our algorithm. Then h − q2

2 ≤ 2 · OPTk(q).

Proof. Error on F: Fix I ∈ F. Since h∗ is flat on I, flat-errq(I) ≤

i∈I(q(i) − h∗(I))2.

Thus the squared error we have on F is at most the squared error of h∗ on F, i.e.

I∈F

flat-errq(I) ≤

I∈F
i∈I

(q(i) − h∗(I))2

22 / 30

SLIDE 102

Analysis

Error Analysis (cont.)

Theorem Let h be the output of our algorithm. Then h − q2

2 ≤ 2 · OPTk(q).

Proof. Error on F: Fix I ∈ F. Since h∗ is flat on I, flat-errq(I) ≤

i∈I(q(i) − h∗(I))2.

Thus the squared error we have on F is at most the squared error of h∗ on F, i.e.

I∈F

flat-errq(I) ≤

I∈F
i∈I

(q(i) − h∗(I))2 ≤ OPTk(q) .

22 / 30

SLIDE 103

Analysis

Error Analysis (cont.)

Theorem Let h be the output of our algorithm. Then h − q2

2 ≤ 2 · OPTk(q).

Proof. Error on F: Fix I ∈ F. Since h∗ is flat on I, flat-errq(I) ≤

i∈I(q(i) − h∗(I))2.

Thus the squared error we have on F is at most the squared error of h∗ on F, i.e.

I∈F

flat-errq(I) ≤

I∈F
i∈I

(q(i) − h∗(I))2 ≤ OPTk(q) . Notice this is true for any set of intervals on which h∗ is flat.

22 / 30

SLIDE 104

Analysis

Error Analysis (cont.)

23 / 30

SLIDE 105

Analysis

Error Analysis (cont.)

Proof. Error on J :

23 / 30

SLIDE 106

Analysis

Error Analysis (cont.)

Proof. Error on J : Fix I ∈ J . WLOG assume it was merged in some iteration.

23 / 30

SLIDE 107

Analysis

Error Analysis (cont.)

Proof. Error on J : Fix I ∈ J . WLOG assume it was merged in some iteration. Let J1, . . . , J2k be the 2k candidate intervals which were not merged in that iteration.

23 / 30

SLIDE 108

Analysis

Error Analysis (cont.)

Proof. Error on J : Fix I ∈ J . WLOG assume it was merged in some iteration. Let J1, . . . , J2k be the 2k candidate intervals which were not merged in that iteration. For ℓ = 1, . . . , 2k, we know flat-errq(I) ≤ flat-errq(Jℓ) .

23 / 30

SLIDE 109

Analysis

Error Analysis (cont.)

Proof. Error on J : Fix I ∈ J . WLOG assume it was merged in some iteration. Let J1, . . . , J2k be the 2k candidate intervals which were not merged in that iteration. For ℓ = 1, . . . , 2k, we know flat-errq(I) ≤ flat-errq(Jℓ) . h∗ has at most k jumps ⇒ it has no jumps in at least k of the J1, . . . , J2k. WLOG assume J1, . . . , Jk have no jumps of h∗.

23 / 30

SLIDE 110

Analysis

Error Analysis (cont.)

Proof. Error on J : Fix I ∈ J . WLOG assume it was merged in some iteration. Let J1, . . . , J2k be the 2k candidate intervals which were not merged in that iteration. For ℓ = 1, . . . , 2k, we know flat-errq(I) ≤ flat-errq(Jℓ) . h∗ has at most k jumps ⇒ it has no jumps in at least k of the J1, . . . , J2k. WLOG assume J1, . . . , Jk have no jumps of h∗. Hence k

ℓ=1 flat-errq(Jℓ) ≤ OPTk(q)

23 / 30

SLIDE 111

Analysis

Error Analysis (cont.)

Proof. Error on J : Fix I ∈ J . WLOG assume it was merged in some iteration. Let J1, . . . , J2k be the 2k candidate intervals which were not merged in that iteration. For ℓ = 1, . . . , 2k, we know flat-errq(I) ≤ flat-errq(Jℓ) . h∗ has at most k jumps ⇒ it has no jumps in at least k of the J1, . . . , J2k. WLOG assume J1, . . . , Jk have no jumps of h∗. Hence k

ℓ=1 flat-errq(Jℓ) ≤ OPTk(q) ⇒ flat-errq(I) ≤ 1 k OPTk(q).

23 / 30

SLIDE 112

Analysis

Error Analysis (cont.)

Proof. Error on J : Fix I ∈ J . WLOG assume it was merged in some iteration. Let J1, . . . , J2k be the 2k candidate intervals which were not merged in that iteration. For ℓ = 1, . . . , 2k, we know flat-errq(I) ≤ flat-errq(Jℓ) . h∗ has at most k jumps ⇒ it has no jumps in at least k of the J1, . . . , J2k. WLOG assume J1, . . . , Jk have no jumps of h∗. Hence k

ℓ=1 flat-errq(Jℓ) ≤ OPTk(q) ⇒ flat-errq(I) ≤ 1 k OPTk(q).

So

I∈J flat-errq(I) ≤ |J | 1 k OPTk(q)

23 / 30

SLIDE 113

Analysis

Error Analysis (cont.)

Proof. Error on J : Fix I ∈ J . WLOG assume it was merged in some iteration. Let J1, . . . , J2k be the 2k candidate intervals which were not merged in that iteration. For ℓ = 1, . . . , 2k, we know flat-errq(I) ≤ flat-errq(Jℓ) . h∗ has at most k jumps ⇒ it has no jumps in at least k of the J1, . . . , J2k. WLOG assume J1, . . . , Jk have no jumps of h∗. Hence k

ℓ=1 flat-errq(Jℓ) ≤ OPTk(q) ⇒ flat-errq(I) ≤ 1 k OPTk(q).

So

I∈J flat-errq(I) ≤ |J | 1 k OPTk(q) ≤ OPTk(q).

23 / 30

SLIDE 114

Experimental Evaluation

Outline of Rest of Talk

1

An O(1/ǫ) Sample Upper Bound

2

The Greedy Merging Algorithm

3

Analysis

4

Experimental Evaluation

5

Conclusions

24 / 30

SLIDE 115

Experimental Evaluation

Error Rate

5,000 10,000 0.02 0.04 OPT10 Number of samples Mean ℓ2-error hist’: noisy histogram 5,000 10,000 0.02 0.04 OPT10 Number of samples Mean ℓ2-error poly’: noisy polynomial exactdp merging merging2 5,000 10,000 0.02 0.04 OPT50 Number of samples Mean ℓ2-error dow’: Dow Jones index

exactdp: Exact DP algorithm for k-histograms. merging: Merging with parameters set to produce a 2k + 1 histogram. merging2: Merging with parameters set to produce a k + 1 histogram.

25 / 30

SLIDE 116

Experimental Evaluation

exactdp merging merging2 fastmerging fastmerging2 dual hist Error (ℓ2) 16.1 16.4 16.6 17.0 21.5 25.8 Error (relative) 1.00 1.02 1.03 1.06 1.34 1.60 Time (milliseconds) 55.391 0.038 0.037 0.020 0.014 0.108 Time (relative) 3,910 2.7 2.6 1.4 1.0 7.6 poly Error (ℓ2) 105.1 85.9 111.6 85.6 111.7 124.0 Error (relative) 1.00 0.82 1.06 0.81 1.06 1.18 Time (milliseconds) 858.064 0.112 0.112 0.048 0.041 0.446 Time (relative) 20,924 2.7 2.7 1.2 1.0 10.9 dow Error (ℓ2) 904.0 733.1 1,046.1 727.5 1,079.1 1,838.1 Error (relative) 1.00 0.81 1.16 0.80 1.19 2.03 Time (milliseconds) 73576.921 0.510 0.478 0.205 0.173 1.849 Time (relative) 425,540 3.0 2.8 1.2 1.0 10.7

26 / 30

SLIDE 117

Conclusions

Outline of Rest of Talk

1

An O(1/ǫ) Sample Upper Bound

2

The Greedy Merging Algorithm

3

Analysis

4

Experimental Evaluation

5

Conclusions

27 / 30

SLIDE 118

Conclusions

Extensions of the Algorithm

28 / 30

SLIDE 119

Conclusions

Extensions of the Algorithm

What if you don’t know what k to pick?

28 / 30

SLIDE 120

Conclusions

Extensions of the Algorithm

What if you don’t know what k to pick? We give a hierarchical version of our algorithm which produces good histogram approximations for all values of k ≥ 1.

28 / 30

SLIDE 121

Conclusions

Extensions of the Algorithm

What if you don’t know what k to pick? We give a hierarchical version of our algorithm which produces good histogram approximations for all values of k ≥ 1. What if you want more powerful representations?

28 / 30

SLIDE 122

Conclusions

Extensions of the Algorithm

What if you don’t know what k to pick? We give a hierarchical version of our algorithm which produces good histogram approximations for all values of k ≥ 1. What if you want more powerful representations? We give a natural generalization of our algorithm which produces good piecewise polynomial approximations.

28 / 30

SLIDE 123

Conclusions

Extensions of the Algorithm

What if you don’t know what k to pick? We give a hierarchical version of our algorithm which produces good histogram approximations for all values of k ≥ 1. What if you want more powerful representations? We give a natural generalization of our algorithm which produces good piecewise polynomial approximations. What about different norms?

28 / 30

SLIDE 124

Conclusions

Extensions of the Algorithm

What if you don’t know what k to pick? We give a hierarchical version of our algorithm which produces good histogram approximations for all values of k ≥ 1. What if you want more powerful representations? We give a natural generalization of our algorithm which produces good piecewise polynomial approximations. What about different norms? [ADLS15] develops a similar (but more complicated) algorithm for approximation in ℓ1 (or total variation distance).

28 / 30

SLIDE 125

Conclusions

We give the first sample-optimal, linear time algorithm for learning histogram approximations in ℓ2

2 error.

Reduce to giving a linear time algorithm for histogram approximation

f sparse distributions.

Do so via a simple, novel greedy merging algorithm. Open problems: Streaming variants? Parallel variants? Is our error analysis tight? Can we get better α, β?

29 / 30

SLIDE 126

Conclusions

Thank you!

30 / 30