Estimating Information Flow in Deep Neural Networks Ziv Goldfeld , - - PowerPoint PPT Presentation

estimating information flow in deep neural networks
SMART_READER_LITE
LIVE PREVIEW

Estimating Information Flow in Deep Neural Networks Ziv Goldfeld , - - PowerPoint PPT Presentation

Estimating Information Flow in Deep Neural Networks Ziv Goldfeld , Ewout van den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury and Yury Polyanskiy MIT, IBM Research, MIT-IBM Watson AI Lab International Conference on Machine


slide-1
SLIDE 1

Estimating Information Flow in Deep Neural Networks

Ziv Goldfeld, Ewout van den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury and Yury Polyanskiy

MIT, IBM Research, MIT-IBM Watson AI Lab

International Conference on Machine Learning June 12th, 2019

slide-2
SLIDE 2

Deep Learning - What’s Under the Hood?

2/11

slide-3
SLIDE 3

Deep Learning - What’s Under the Hood?

Lacking Theory: Macroscopic understanding of Deep Learning

2/11

slide-4
SLIDE 4

Deep Learning - What’s Under the Hood?

Lacking Theory: Macroscopic understanding of Deep Learning

What drives the evolution of internal representations?

2/11

slide-5
SLIDE 5

Deep Learning - What’s Under the Hood?

Lacking Theory: Macroscopic understanding of Deep Learning

What drives the evolution of internal representations? What are properties of learned representations?

2/11

slide-6
SLIDE 6

Deep Learning - What’s Under the Hood?

Lacking Theory: Macroscopic understanding of Deep Learning

What drives the evolution of internal representations? What are properties of learned representations? How do fully trained networks process information?

2/11

slide-7
SLIDE 7

Deep Learning - What’s Under the Hood?

Lacking Theory: Macroscopic understanding of Deep Learning Attempts to Understand Effectiveness of DL:

◮ Structure of loss landscape

[Saxe et al.’14, Choromanska et al.’15, Kawaguchi’16, Keskar et al.’17]

◮ Wavelets and sparse coding

[Bruna-Mallat’13, Giryes et al.’16, Papyan et al.’16]

◮ Adversarial examples

[Szegedy et al.’14, Nguyen et al.’17, Liu et al.’16, Cisse et al.’16]

◮ Information Bottleneck Theory

[Tishby-Zaslavsky’15, Shwartz-Tishby’17, Saxe et al.’18, Gabri´ e et al.’18]

What drives the evolution of internal representations? What are properties of learned representations? How do fully trained networks process information?

2/11

slide-8
SLIDE 8

Deep Learning - What’s Under the Hood?

Lacking Theory: Macroscopic understanding of Deep Learning Attempts to Understand Effectiveness of DL:

◮ Structure of loss landscape

[Saxe et al.’14, Choromanska et al.’15, Kawaguchi’16, Keskar et al.’17]

◮ Wavelets and sparse coding

[Bruna-Mallat’13, Giryes et al.’16, Papyan et al.’16]

◮ Adversarial examples

[Szegedy et al.’14, Nguyen et al.’17, Liu et al.’16, Cisse et al.’16]

◮ Information Bottleneck Theory

[Tishby-Zaslavsky’15, Shwartz-Tishby’17, Saxe et al.’18, Gabri´ e et al.’18]

What drives the evolution of internal representations? What are properties of learned representations? How do fully trained networks process information?

2/11

slide-9
SLIDE 9

Deep Learning - What’s Under the Hood?

Lacking Theory: Macroscopic understanding of Deep Learning Attempts to Understand Effectiveness of DL:

◮ Structure of loss landscape

[Saxe et al.’14, Choromanska et al.’15, Kawaguchi’16, Keskar et al.’17]

◮ Wavelets and sparse coding

[Bruna-Mallat’13, Giryes et al.’16, Papyan et al.’16]

◮ Adversarial examples

[Szegedy et al.’14, Nguyen et al.’17, Liu et al.’16, Cisse et al.’16]

◮ Information Bottleneck Theory

[Tishby-Zaslavsky’15, Shwartz-Tishby’17, Saxe et al.’18, Gabri´ e et al.’18]

⋆ Goal: Mathematically analyze IB theory & test ‘Compression’

What drives the evolution of internal representations? What are properties of learned representations? How do fully trained networks process information?

2/11

slide-10
SLIDE 10

Setup and Preliminaries

(Deterministic) Feedforward DNN: Each layer Tℓ = fℓ(Tℓ−1)

  • (Label)
  • (Feature/Image)

= (Input Layer) Cat Dog

  • (Hidden Layer 1)
  • (Hidden Layer )
  • (Hidden Layer )
  • =
  • (Output Layer)

3/11

slide-11
SLIDE 11

Setup and Preliminaries

(Deterministic) Feedforward DNN: Each layer Tℓ = fℓ(Tℓ−1) Joint Distribution: PX,Y

  • (Label)
  • (Feature/Image)

= (Input Layer) Cat Dog

  • (Hidden Layer 1)
  • (Hidden Layer )
  • (Hidden Layer )
  • =
  • (Output Layer)

3/11

slide-12
SLIDE 12

Setup and Preliminaries

(Deterministic) Feedforward DNN: Each layer Tℓ = fℓ(Tℓ−1) Joint Distribution: PX,Y = ⇒ PX,Y · PT1,...,TL|X

  • (Label)
  • (Feature/Image)

= (Input Layer) Cat Dog

  • (Hidden Layer 1)
  • (Hidden Layer )
  • (Hidden Layer )
  • =
  • (Output Layer)

3/11

slide-13
SLIDE 13

Setup and Preliminaries

(Deterministic) Feedforward DNN: Each layer Tℓ = fℓ(Tℓ−1) Joint Distribution: PX,Y = ⇒ PX,Y · PT1,...,TL|X Information Plane: Evolution of

I(X; Tℓ), I(Y ; Tℓ) during training

  • I(A; B) = DKL(PA,B||PA ⊗ PB)

Discrete

=

  • a,b PA,B(a, b) log

PA,B(a,b) PA(a)PB(b)

  • (Label)
  • (Feature/Image)

= (Input Layer) Cat Dog

  • (Hidden Layer 1)
  • (Hidden Layer )
  • (Hidden Layer )
  • =
  • (Output Layer)

3/11

slide-14
SLIDE 14

Setup and Preliminaries

(Deterministic) Feedforward DNN: Each layer Tℓ = fℓ(Tℓ−1) IB Theory Claim: Training comprises 2 phases

  • (Label)
  • (Feature/Image)

= (Input Layer) Cat Dog

  • (Hidden Layer 1)
  • (Hidden Layer )
  • (Hidden Layer )
  • =
  • (Output Layer)

4/11

slide-15
SLIDE 15

Setup and Preliminaries

(Deterministic) Feedforward DNN: Each layer Tℓ = fℓ(Tℓ−1) IB Theory Claim: Training comprises 2 phases

1

Fitting: I(Y ; Tℓ) & I(X; Tℓ) rise (short)

  • (Label)
  • (Feature/Image)

= (Input Layer) Cat Dog

  • (Hidden Layer 1)
  • (Hidden Layer )
  • (Hidden Layer )
  • =
  • (Output Layer)

4/11

slide-16
SLIDE 16

Setup and Preliminaries

(Deterministic) Feedforward DNN: Each layer Tℓ = fℓ(Tℓ−1) IB Theory Claim: Training comprises 2 phases

1

Fitting: I(Y ; Tℓ) & I(X; Tℓ) rise (short)

2

Compression: I(X; Tℓ) slowly drops (long)

  • (Label)
  • (Feature/Image)

= (Input Layer) Cat Dog

  • (Hidden Layer 1)
  • (Hidden Layer )
  • (Hidden Layer )
  • =
  • (Output Layer)

4/11

slide-17
SLIDE 17

Setup and Preliminaries

(Deterministic) Feedforward DNN: Each layer Tℓ = fℓ(Tℓ−1) IB Theory Claim: Training comprises 2 phases

1

Fitting: I(Y ; Tℓ) & I(X; Tℓ) rise (short)

2

Compression: I(X; Tℓ) slowly drops (long)

[Shwartz-Tishby’17]

  • (Label)
  • (Feature/Image)

= (Input Layer) Cat Dog

  • (Hidden Layer 1)
  • (Hidden Layer )
  • (Hidden Layer )
  • =
  • (Output Layer)

4/11

slide-18
SLIDE 18

Vacuous Mutual Information & Mis-Estimation

Proposition (Informal)

  • Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)

5/11

slide-19
SLIDE 19

Vacuous Mutual Information & Mis-Estimation

Proposition (Informal)

  • Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)

= ⇒ I(X; Tℓ) is independent of the DNN parameters

5/11

slide-20
SLIDE 20

Vacuous Mutual Information & Mis-Estimation

Proposition (Informal)

  • Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)

= ⇒ I(X; Tℓ) is independent of the DNN parameters I(X; Tℓ) a.s. infinite (continuous X) or constant H(X) (discrete X)

5/11

slide-21
SLIDE 21

Vacuous Mutual Information & Mis-Estimation

Proposition (Informal)

  • Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)

= ⇒ I(X; Tℓ) is independent of the DNN parameters I(X; Tℓ) a.s. infinite (continuous X) or constant H(X) (discrete X)

Feature Space (X) X ∼ Unif(X) |X| = 3000

5/11

slide-22
SLIDE 22

Vacuous Mutual Information & Mis-Estimation

Proposition (Informal)

  • Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)

= ⇒ I(X; Tℓ) is independent of the DNN parameters I(X; Tℓ) a.s. infinite (continuous X) or constant H(X) (discrete X)

DNN

Feature Space (X) X ∼ Unif(X) |X| = 3000 Internal Rep. Space (Tℓ = ˜ fℓ(X)) Tℓ ∼ Unif(Tℓ) |Tℓ| = |X| = 3000

5/11

slide-23
SLIDE 23

Vacuous Mutual Information & Mis-Estimation

Proposition (Informal)

  • Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)

= ⇒ I(X; Tℓ) is independent of the DNN parameters I(X; Tℓ) a.s. infinite (continuous X) or constant H(X) (discrete X) Past Works: Use binning-based proxy of I(X; Tℓ) (aka quantization)

5/11

slide-24
SLIDE 24

Vacuous Mutual Information & Mis-Estimation

Proposition (Informal)

  • Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)

= ⇒ I(X; Tℓ) is independent of the DNN parameters I(X; Tℓ) a.s. infinite (continuous X) or constant H(X) (discrete X) Past Works: Use binning-based proxy of I(X; Tℓ) (aka quantization)

1

For non-negligible bin size I

  • X; Bin(Tℓ)
  • = I(X; Tℓ)

5/11

slide-25
SLIDE 25

Vacuous Mutual Information & Mis-Estimation

Proposition (Informal)

  • Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)

= ⇒ I(X; Tℓ) is independent of the DNN parameters I(X; Tℓ) a.s. infinite (continuous X) or constant H(X) (discrete X) Past Works: Use binning-based proxy of I(X; Tℓ) (aka quantization)

1

For non-negligible bin size I

  • X; Bin(Tℓ)
  • = I(X; Tℓ)

2

I

  • X; Bin(Tℓ)
  • highly sensitive to user-defined bin size:

5/11

slide-26
SLIDE 26

Vacuous Mutual Information & Mis-Estimation

Proposition (Informal)

  • Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)

= ⇒ I(X; Tℓ) is independent of the DNN parameters I(X; Tℓ) a.s. infinite (continuous X) or constant H(X) (discrete X) Past Works: Use binning-based proxy of I(X; Tℓ) (aka quantization)

1

For non-negligible bin size I

  • X; Bin(Tℓ)
  • = I(X; Tℓ)

2

I

  • X; Bin(Tℓ)
  • highly sensitive to user-defined bin size:

100 101 102 103 104 Epoch 4 8 MI (nats) bin size = 0.0001

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 bin size = 0.001 bin size = 0.01 bin size = 0.1

5/11

slide-27
SLIDE 27

Vacuous Mutual Information & Mis-Estimation

Proposition (Informal)

  • Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)

= ⇒ I(X; Tℓ) is independent of the DNN parameters I(X; Tℓ) a.s. infinite (continuous X) or constant H(X) (discrete X) Past Works: Use binning-based proxy of I(X; Tℓ) (aka quantization)

1

For non-negligible bin size I

  • X; Bin(Tℓ)
  • = I(X; Tℓ)

2

I

  • X; Bin(Tℓ)
  • highly sensitive to user-defined bin size:

⊛ ⊛ ⊛ Real Problem: Mismatch between I(X; Tℓ) measurement and model

100 101 102 103 104 Epoch 4 8 MI (nats) bin size = 0.0001

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 bin size = 0.001 bin size = 0.01 bin size = 0.1

5/11

slide-28
SLIDE 28

Auxiliary Framework - Noisy Deep Neural Networks

Modification: Inject (small) Gaussian noise to neurons’ output

6/11

slide-29
SLIDE 29

Auxiliary Framework - Noisy Deep Neural Networks

Modification: Inject (small) Gaussian noise to neurons’ output Formally: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·

6/11

slide-30
SLIDE 30

Auxiliary Framework - Noisy Deep Neural Networks

Modification: Inject (small) Gaussian noise to neurons’ output Formally: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) = ⇒ X → Tℓ is a parametrized channel (by DNN’s parameters) X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·

6/11

slide-31
SLIDE 31

Auxiliary Framework - Noisy Deep Neural Networks

Modification: Inject (small) Gaussian noise to neurons’ output Formally: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) = ⇒ X → Tℓ is a parametrized channel (by DNN’s parameters) = ⇒ I(X; Tℓ) is a function of parameters! X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·

6/11

slide-32
SLIDE 32

Auxiliary Framework - Noisy Deep Neural Networks

Modification: Inject (small) Gaussian noise to neurons’ output Formally: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) = ⇒ X → Tℓ is a parametrized channel (by DNN’s parameters) = ⇒ I(X; Tℓ) is a function of parameters!

⊛ ⊛ ⊛ Challenge: How to accurately track I(X; Tℓ)?

X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·

6/11

slide-33
SLIDE 33

High-Dim. & Nonparametric Functional Estimation

7/11

slide-34
SLIDE 34

High-Dim. & Nonparametric Functional Estimation

Distill I(X;Tℓ) Estimation into Noisy Differential Entropy Estimation: Estimate h(P ∗ Nσ) from n i.i.d. samples Sn (Si)n

i=1 of P ∈ Fd (non-

parametric class) and knowledge of Nσ (Gaussian measure N(0, σ2Id)).

7/11

slide-35
SLIDE 35

High-Dim. & Nonparametric Functional Estimation

Distill I(X;Tℓ) Estimation into Noisy Differential Entropy Estimation: Estimate h(P ∗ Nσ) from n i.i.d. samples Sn (Si)n

i=1 of P ∈ Fd (non-

parametric class) and knowledge of Nσ (Gaussian measure N(0, σ2Id)). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) Sample complexity of any accurate estimator (additive gap η) is Ω

  • 2d

ηd

  • 7/11
slide-36
SLIDE 36

High-Dim. & Nonparametric Functional Estimation

Distill I(X;Tℓ) Estimation into Noisy Differential Entropy Estimation: Estimate h(P ∗ Nσ) from n i.i.d. samples Sn (Si)n

i=1 of P ∈ Fd (non-

parametric class) and knowledge of Nσ (Gaussian measure N(0, σ2Id)). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) Sample complexity of any accurate estimator (additive gap η) is Ω

  • 2d

ηd

  • Structured Estimator⋆: ˆ

h(Sn, σ) h( ˆ Pn ∗ Nσ), where ˆ Pn = 1

n n

  • i=1

δSi

⋆ Efficient and parallelizable

7/11

slide-37
SLIDE 37

High-Dim. & Nonparametric Functional Estimation

Distill I(X;Tℓ) Estimation into Noisy Differential Entropy Estimation: Estimate h(P ∗ Nσ) from n i.i.d. samples Sn (Si)n

i=1 of P ∈ Fd (non-

parametric class) and knowledge of Nσ (Gaussian measure N(0, σ2Id)). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) Sample complexity of any accurate estimator (additive gap η) is Ω

  • 2d

ηd

  • Structured Estimator⋆: ˆ

h(Sn, σ) h( ˆ Pn ∗ Nσ), where ˆ Pn = 1

n n

  • i=1

δSi Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For F(SG)

d,K

P

  • P is K-subgaussian in Rd, d ≥ 1 and σ > 0, we have

supP ∈F(SG)

d,K ESn

  • h(P ∗ Nσ) − ˆ

h(Sn, σ)

  • ≤ cd

σ,K · n− 1

2 7/11

slide-38
SLIDE 38

High-Dim. & Nonparametric Functional Estimation

Distill I(X;Tℓ) Estimation into Noisy Differential Entropy Estimation: Estimate h(P ∗ Nσ) from n i.i.d. samples Sn (Si)n

i=1 of P ∈ Fd (non-

parametric class) and knowledge of Nσ (Gaussian measure N(0, σ2Id)). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) Sample complexity of any accurate estimator (additive gap η) is Ω

  • 2d

ηd

  • Structured Estimator⋆: ˆ

h(Sn, σ) h( ˆ Pn ∗ Nσ), where ˆ Pn = 1

n n

  • i=1

δSi Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For F(SG)

d,K

P

  • P is K-subgaussian in Rd, d ≥ 1 and σ > 0, we have

supP ∈F(SG)

d,K ESn

  • h(P ∗ Nσ) − ˆ

h(Sn, σ)

  • ≤ cd

σ,K · n− 1

2

Optimality: ˆ h(Sn, σ) attains sharp dependence on both n and d!

7/11

slide-39
SLIDE 39

I(X; Tℓ) Dynamics - Illustrative Minimal Example

Single Neuron Classification: X

tanh(wX + b) Sw,b

Z ∼ N(0, σ2) T

8/11

slide-40
SLIDE 40

I(X; Tℓ) Dynamics - Illustrative Minimal Example

Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} X

tanh(wX + b) Sw,b

Z ∼ N(0, σ2) T

8/11

slide-41
SLIDE 41

I(X; Tℓ) Dynamics - Illustrative Minimal Example

Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} X

tanh(wX + b) Sw,b

Z ∼ N(0, σ2) T

3 3 3 3

S1,0

8/11

slide-42
SLIDE 42

I(X; Tℓ) Dynamics - Illustrative Minimal Example

Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} X

tanh(wX + b) Sw,b

Z ∼ N(0, σ2) T

3 3 3 3

S1,0

⊛ ⊛ ⊛ Center & sharpen transition ( ⇐

⇒ increase w and keep b = −2w)

8/11

slide-43
SLIDE 43

I(X; Tℓ) Dynamics - Illustrative Minimal Example

Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} X

tanh(wX + b) Sw,b

Z ∼ N(0, σ2) T

3 3 3 3

S1,0 S5,−10

8/11

slide-44
SLIDE 44

I(X; Tℓ) Dynamics - Illustrative Minimal Example

Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} X

tanh(wX + b) Sw,b

Z ∼ N(0, σ2) T

3 3 3 3

S1,0 S5,−10

✓ Correct classification performance

8/11

slide-45
SLIDE 45

I(X; Tℓ) Dynamics - Illustrative Minimal Example

Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} Mutual Information: X

tanh(wX + b) Sw,b

Z ∼ N(0, σ2) T

8/11

slide-46
SLIDE 46

I(X; Tℓ) Dynamics - Illustrative Minimal Example

Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} Mutual Information: I(X; T) = I(Sw,b; Sw,b + Z) X

tanh(wX + b) Sw,b

Z ∼ N(0, σ2) T

8/11

slide-47
SLIDE 47

I(X; Tℓ) Dynamics - Illustrative Minimal Example

Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} Mutual Information: I(X; T) = I(Sw,b; Sw,b + Z) = ⇒ I(X; T) is # bits (nats) transmittable over AWGN with symbols Sw,b

tanh(

−3w+b), tanh( −w+b), tanh(w+b), tanh(3w+b)

  • X

tanh(wX + b) Sw,b

Z ∼ N(0, σ2) T

8/11

slide-48
SLIDE 48

I(X; Tℓ) Dynamics - Illustrative Minimal Example

Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} Mutual Information: I(X; T) = I(Sw,b; Sw,b + Z) = ⇒ I(X; T) is # bits (nats) transmittable over AWGN with symbols Sw,b

tanh(

−3w+b), tanh( −w+b), tanh(w+b), tanh(3w+b)

→ {±1} X

tanh(wX + b) Sw,b

Z ∼ N(0, σ2) T

8/11

slide-49
SLIDE 49

I(X; Tℓ) Dynamics - Illustrative Minimal Example

Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} Mutual Information: I(X; T) = I(Sw,b; Sw,b + Z) = ⇒ I(X; T) is # bits (nats) transmittable over AWGN with symbols Sw,b

tanh(

−3w+b), tanh( −w+b), tanh(w+b), tanh(3w+b)

→ {±1} X

tanh(wX + b) Sw,b

Z ∼ N(0, σ2) T

8/11

slide-50
SLIDE 50

I(X; Tℓ) Dynamics - Illustrative Minimal Example

Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} Mutual Information: I(X; T) = I(Sw,b; Sw,b + Z) = ⇒ I(X; T) is # bits (nats) transmittable over AWGN with symbols Sw,b

tanh(

−3w+b), tanh( −w+b), tanh(w+b), tanh(3w+b)

→ {±1} X

tanh(wX + b) Sw,b

Z ∼ N(0, σ2) T

100 102 104 106

Epoch

0.5 1 1.5

Mutual information

ln(3) ln(2) ln(4) 8/11

slide-51
SLIDE 51

Clustering of Representations - Larger Networks

Noisy version of DNN from [Shwartz-Tishby’17]:

9/11

slide-52
SLIDE 52

Clustering of Representations - Larger Networks

Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP

9/11

slide-53
SLIDE 53

Clustering of Representations - Larger Networks

Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP

9/11

slide-54
SLIDE 54

Clustering of Representations - Larger Networks

Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP Verified in multiple additional experiments

9/11

slide-55
SLIDE 55

Clustering of Representations - Larger Networks

Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP Verified in multiple additional experiments = ⇒ Compression of I(X; Tℓ) driven by clustering of representations

9/11

slide-56
SLIDE 56

Circling Back to Deterministic DNNs

I(X; Tℓ) is constant/infinite = ⇒ Doesn’t measure clustering

10/11

slide-57
SLIDE 57

Circling Back to Deterministic DNNs

I(X; Tℓ) is constant/infinite = ⇒ Doesn’t measure clustering Reexamine Measurements: Computed I

X; Bin(Tℓ) = H Bin(Tℓ)

  • 10/11
slide-58
SLIDE 58

Circling Back to Deterministic DNNs

I(X; Tℓ) is constant/infinite = ⇒ Doesn’t measure clustering Reexamine Measurements: Computed I

X; Bin(Tℓ) = H Bin(Tℓ)

  • H

Bin(Tℓ) measures clustering (maximized by uniform distribution)

10/11

slide-59
SLIDE 59

Circling Back to Deterministic DNNs

I(X; Tℓ) is constant/infinite = ⇒ Doesn’t measure clustering Reexamine Measurements: Computed I

X; Bin(Tℓ) = H Bin(Tℓ)

  • H

Bin(Tℓ) measures clustering (maximized by uniform distribution)

Test: I(X; Tℓ) and H

Bin(Tℓ) highly correlated in noisy DNNs⋆

⋆ When bin size chosen ∝ noise std.

10/11

slide-60
SLIDE 60

Circling Back to Deterministic DNNs

I(X; Tℓ) is constant/infinite = ⇒ Doesn’t measure clustering Reexamine Measurements: Computed I

X; Bin(Tℓ) = H Bin(Tℓ)

  • H

Bin(Tℓ) measures clustering (maximized by uniform distribution)

Test: I(X; Tℓ) and H

Bin(Tℓ) highly correlated in noisy DNNs⋆

= ⇒ Past works not measuring MI but clustering (via binned-MI)!

10/11

slide-61
SLIDE 61

Circling Back to Deterministic DNNs

I(X; Tℓ) is constant/infinite = ⇒ Doesn’t measure clustering Reexamine Measurements: Computed I

X; Bin(Tℓ) = H Bin(Tℓ)

  • H

Bin(Tℓ) measures clustering (maximized by uniform distribution)

Test: I(X; Tℓ) and H

Bin(Tℓ) highly correlated in noisy DNNs⋆

= ⇒ Past works not measuring MI but clustering (via binned-MI)! By-Product Result:

10/11

slide-62
SLIDE 62

Circling Back to Deterministic DNNs

I(X; Tℓ) is constant/infinite = ⇒ Doesn’t measure clustering Reexamine Measurements: Computed I

X; Bin(Tℓ) = H Bin(Tℓ)

  • H

Bin(Tℓ) measures clustering (maximized by uniform distribution)

Test: I(X; Tℓ) and H

Bin(Tℓ) highly correlated in noisy DNNs⋆

= ⇒ Past works not measuring MI but clustering (via binned-MI)! By-Product Result: Refute ‘compression (tight clustering) improves generalization’ claim [Come see us at poster #96 for details]

10/11

slide-63
SLIDE 63

Summary

Reexamined Information Bottleneck Compression:

11/11

slide-64
SLIDE 64

Summary

Reexamined Information Bottleneck Compression:

◮ I(X; T ) fluctuations in det. DNNs are theoretically impossible

11/11

slide-65
SLIDE 65

Summary

Reexamined Information Bottleneck Compression:

◮ I(X; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I(X; T ) dynamics during training

11/11

slide-66
SLIDE 66

Summary

Reexamined Information Bottleneck Compression:

◮ I(X; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I(X; T ) dynamics during training

Noisy DNN Framework: Studying IT quantities over DNNs

11/11

slide-67
SLIDE 67

Summary

Reexamined Information Bottleneck Compression:

◮ I(X; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I(X; T ) dynamics during training

Noisy DNN Framework: Studying IT quantities over DNNs

◮ Optimal estimator (in n and d) for accurate MI estimation

11/11

slide-68
SLIDE 68

Summary

Reexamined Information Bottleneck Compression:

◮ I(X; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I(X; T ) dynamics during training

Noisy DNN Framework: Studying IT quantities over DNNs

◮ Optimal estimator (in n and d) for accurate MI estimation ◮ Clustering of learned representations is the source of compression

11/11

slide-69
SLIDE 69

Summary

Reexamined Information Bottleneck Compression:

◮ I(X; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I(X; T ) dynamics during training

Noisy DNN Framework: Studying IT quantities over DNNs

◮ Optimal estimator (in n and d) for accurate MI estimation ◮ Clustering of learned representations is the source of compression

Clarify Past Observations of Compression: in fact show clustering

11/11

slide-70
SLIDE 70

Summary

Reexamined Information Bottleneck Compression:

◮ I(X; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I(X; T ) dynamics during training

Noisy DNN Framework: Studying IT quantities over DNNs

◮ Optimal estimator (in n and d) for accurate MI estimation ◮ Clustering of learned representations is the source of compression

Clarify Past Observations of Compression: in fact show clustering

◮ Compression/clustering and generalization and not necessarily related

11/11

slide-71
SLIDE 71

Summary

Reexamined Information Bottleneck Compression:

◮ I(X; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I(X; T ) dynamics during training

Noisy DNN Framework: Studying IT quantities over DNNs

◮ Optimal estimator (in n and d) for accurate MI estimation ◮ Clustering of learned representations is the source of compression

Clarify Past Observations of Compression: in fact show clustering

◮ Compression/clustering and generalization and not necessarily related

Thank you!

11/11

slide-72
SLIDE 72

Clustering of Representations - Larger Networks

Noisy version of DNN from [Shwartz-Tishby’17]:

11/11

slide-73
SLIDE 73

Clustering of Representations - Larger Networks

Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP

11/11

slide-74
SLIDE 74

Clustering of Representations - Larger Networks

Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP

11/11

slide-75
SLIDE 75

Clustering of Representations - Larger Networks

Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP

11/11

slide-76
SLIDE 76

Clustering of Representations - Larger Networks

Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP

⊛ ⊛ ⊛ weight orthonormality regularization [Cisse et al.’17]

11/11

slide-77
SLIDE 77

Clustering of Representations - Larger Networks

Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP Verified in multiple additional experiments

11/11

slide-78
SLIDE 78

Clustering of Representations - Larger Networks

Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP Verified in multiple additional experiments = ⇒ Compression of I(X; Tℓ) driven by clustering of representations

11/11

slide-79
SLIDE 79

Mutual Information Estimation in Noisy DNNs

Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·

11/11

slide-80
SLIDE 80

Mutual Information Estimation in Noisy DNNs

Mutual Information: I(X; Tℓ) = h(Tℓ) −

dPX(x)h(Tℓ|X = x)

Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·

11/11

slide-81
SLIDE 81

Mutual Information Estimation in Noisy DNNs

Mutual Information: I(X; Tℓ) = h(Tℓ) −

dPX(x)h(Tℓ|X = x)

Structure: Sℓ ⊥ Zℓ = ⇒ Tℓ = Sℓ + Zℓ ∼ P ∗ Nσ Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·

11/11

slide-82
SLIDE 82

Mutual Information Estimation in Noisy DNNs

Mutual Information: I(X; Tℓ) = h(Tℓ) −

dPX(x)h(Tℓ|X = x)

Structure: Sℓ ⊥ Zℓ = ⇒ Tℓ = Sℓ + Zℓ ∼ P ∗ Nσ Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·

11/11

slide-83
SLIDE 83

Mutual Information Estimation in Noisy DNNs

Mutual Information: I(X; Tℓ) = h(Tℓ) −

dPX(x)h(Tℓ|X = x)

Structure: Sℓ ⊥ Zℓ = ⇒ Tℓ = Sℓ + Zℓ ∼ P ∗ N σ Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·

11/11

slide-84
SLIDE 84

Mutual Information Estimation in Noisy DNNs

Mutual Information: I(X; Tℓ) = h(Tℓ) −

dPX(x)h(Tℓ|X = x)

Structure: Sℓ ⊥ Zℓ = ⇒ Tℓ = Sℓ + Zℓ ∼ P ∗ Nσ

⊛ ⊛ ⊛ Know the distribution Nσ of Zℓ (noise injected by design)

Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·

11/11

slide-85
SLIDE 85

Mutual Information Estimation in Noisy DNNs

Mutual Information: I(X; Tℓ) = h(Tℓ) −

dPX(x)h(Tℓ|X = x)

Structure: Sℓ ⊥ Zℓ = ⇒ Tℓ = Sℓ + Zℓ ∼ P ∗ Nσ

⊛ ⊛ ⊛ Know the distribution Nσ of Zℓ (noise injected by design)

Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·

11/11

slide-86
SLIDE 86

Mutual Information Estimation in Noisy DNNs

Mutual Information: I(X; Tℓ) = h(Tℓ) −

dPX(x)h(Tℓ|X = x)

Structure: Sℓ ⊥ Zℓ = ⇒ Tℓ = Sℓ + Zℓ ∼ P ∗ Nσ

⊛ ⊛ ⊛ Know the distribution Nσ of Zℓ (noise injected by design) ⊛ ⊛ ⊛ Extremely complicated P

= ⇒ Treat as unknown Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·

11/11

slide-87
SLIDE 87

Mutual Information Estimation in Noisy DNNs

Mutual Information: I(X; Tℓ) = h(Tℓ) −

dPX(x)h(Tℓ|X = x)

Structure: Sℓ ⊥ Zℓ = ⇒ Tℓ = Sℓ + Zℓ ∼ P ∗ Nσ

⊛ ⊛ ⊛ Know the distribution Nσ of Zℓ (noise injected by design) ⊛ ⊛ ⊛ Extremely complicated P

= ⇒ Treat as unknown

⊛ ⊛ ⊛ Easily get i.i.d. samples from P via DNN forward pass

Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·

11/11

slide-88
SLIDE 88

Structured Estimator (with Implementation in Mind)

Differential Entropy Estimation under Gaussian Convolutions Estimate h(P ∗ Nσ) via n i.i.d. samples Sn (Si)n

i=1 from unknown

P ∈ Fd (nonparametric class) and knowledge of Nσ (noise distribution).

11/11

slide-89
SLIDE 89

Structured Estimator (with Implementation in Mind)

Nonparametric Class: Specified by DNN architecture (d = Tℓ ‘width’) Differential Entropy Estimation under Gaussian Convolutions Estimate h(P ∗ Nσ) via n i.i.d. samples Sn (Si)n

i=1 from unknown

P ∈ Fd (nonparametric class) and knowledge of Nσ (noise distribution).

11/11

slide-90
SLIDE 90

Structured Estimator (with Implementation in Mind)

Nonparametric Class: Specified by DNN architecture (d = Tℓ ‘width’) Goal: Simple & parallelizable for efficient implementation Differential Entropy Estimation under Gaussian Convolutions Estimate h(P ∗ Nσ) via n i.i.d. samples Sn (Si)n

i=1 from unknown

P ∈ Fd (nonparametric class) and knowledge of Nσ (noise distribution).

11/11

slide-91
SLIDE 91

Structured Estimator (with Implementation in Mind)

Nonparametric Class: Specified by DNN architecture (d = Tℓ ‘width’) Goal: Simple & parallelizable for efficient implementation Estimator: ˆ h(Sn, σ) h( ˆ PSn ∗ Nσ), where ˆ PSn 1

n n

  • i=1

δSi Differential Entropy Estimation under Gaussian Convolutions Estimate h(P ∗ Nσ) via n i.i.d. samples Sn (Si)n

i=1 from unknown

P ∈ Fd (nonparametric class) and knowledge of Nσ (noise distribution).

11/11

slide-92
SLIDE 92

Structured Estimator (with Implementation in Mind)

Nonparametric Class: Specified by DNN architecture (d = Tℓ ‘width’) Goal: Simple & parallelizable for efficient implementation Estimator: ˆ h(Sn, σ) h( ˆ PSn ∗ Nσ), where ˆ PSn 1

n n

  • i=1

δSi Plug-in: ˆ h is plug-in est. for the functional Tσ(P) h(P ∗ Nσ) Differential Entropy Estimation under Gaussian Convolutions Estimate h(P ∗ Nσ) via n i.i.d. samples Sn (Si)n

i=1 from unknown

P ∈ Fd (nonparametric class) and knowledge of Nσ (noise distribution).

11/11

slide-93
SLIDE 93

Structured Estimator - Convergence Rate

Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0, d ≥ 1, we have sup

P ∈F(SG)

d,K

E

  • h(P ∗ Nσ) − h( ˆ

PSn ∗ Nσ)

  • ≤ Cσ,d,K · n− 1

2

where Cσ,d,K = Oσ,K(cd) for a constant c.

11/11

slide-94
SLIDE 94

Structured Estimator - Convergence Rate

Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0, d ≥ 1, we have sup

P ∈F(SG)

d,K

E

  • h(P ∗ Nσ) − h( ˆ

PSn ∗ Nσ)

  • ≤ Cσ,d,K · n− 1

2

where Cσ,d,K = Oσ,K(cd) for a constant c. Comments:

11/11

slide-95
SLIDE 95

Structured Estimator - Convergence Rate

Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0, d ≥ 1, we have sup

P ∈F(SG)

d,K

E

  • h(P ∗ Nσ) − h( ˆ

PSn ∗ Nσ)

  • ≤ Cσ,d,K · n− 1

2

where Cσ,d,K = Oσ,K(cd) for a constant c. Comments: Explicit Expression: Enables concrete error bounds in simulations

11/11

slide-96
SLIDE 96

Structured Estimator - Convergence Rate

Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0, d ≥ 1, we have sup

P ∈F(SG)

d,K

E

  • h(P ∗ Nσ) − h( ˆ

PSn ∗ Nσ)

  • ≤ Cσ,d,K · n− 1

2

where Cσ,d,K = Oσ,K(cd) for a constant c. Comments: Explicit Expression: Enables concrete error bounds in simulations Minimax Rate Optimal: Attains parametric estimation rate O

n− 1

2 11/11

slide-97
SLIDE 97

Structured Estimator - Convergence Rate

Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0, d ≥ 1, we have sup

P ∈F(SG)

d,K

E

  • h(P ∗ Nσ) − h( ˆ

PSn ∗ Nσ)

  • ≤ Cσ,d,K · n− 1

2

where Cσ,d,K = Oσ,K(cd) for a constant c. Comments: Explicit Expression: Enables concrete error bounds in simulations Minimax Rate Optimal: Attains parametric estimation rate O

n− 1

2

Proof (initial step): Based on [Polyanskiy-Wu’16]

  • h(P ∗ Nσ) − h( ˆ

PSn ∗ Nσ)

  • W1(P ∗ Nσ, ˆ

PSn ∗ Nσ)

11/11

slide-98
SLIDE 98

Structured Estimator - Convergence Rate

Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0, d ≥ 1, we have sup

P ∈F(SG)

d,K

E

  • h(P ∗ Nσ) − h( ˆ

PSn ∗ Nσ)

  • ≤ Cσ,d,K · n− 1

2

where Cσ,d,K = Oσ,K(cd) for a constant c. Comments: Explicit Expression: Enables concrete error bounds in simulations Minimax Rate Optimal: Attains parametric estimation rate O

n− 1

2

Proof (initial step): Based on [Polyanskiy-Wu’16]

  • h(P ∗ Nσ) − h( ˆ

PSn ∗ Nσ)

  • W1(P ∗ Nσ, ˆ

PSn ∗ Nσ) = ⇒ Analyze empirical 1-Wasserstein distance under Gaussian convolutions

11/11

slide-99
SLIDE 99

Empirical W1 & The Magic of Gaussian Convolution

p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf

EX − Y p1/p

infimum over all couplings of P and Q

11/11

slide-100
SLIDE 100

Empirical W1 & The Magic of Gaussian Convolution

p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf

EX − Y p1/p

infimum over all couplings of P and Q Empirical 1-Wasserstein Distance:

11/11

slide-101
SLIDE 101

Empirical W1 & The Magic of Gaussian Convolution

p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf

EX − Y p1/p

infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: Distribution P on Rd

11/11

slide-102
SLIDE 102

Empirical W1 & The Magic of Gaussian Convolution

p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf

EX − Y p1/p

infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: Distribution P on Rd = ⇒ i.i.d. Samples (Si)n

i=1

11/11

slide-103
SLIDE 103

Empirical W1 & The Magic of Gaussian Convolution

p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf

EX − Y p1/p

infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: Distribution P on Rd = ⇒ i.i.d. Samples (Si)n

i=1

Empirical distribution ˆ PSn 1

n n

  • i=1

δSi

11/11

slide-104
SLIDE 104

Empirical W1 & The Magic of Gaussian Convolution

p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf

EX − Y p1/p

infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: Distribution P on Rd = ⇒ i.i.d. Samples (Si)n

i=1

Empirical distribution ˆ PSn 1

n n

  • i=1

δSi = ⇒ Dependence on (n, d) of EW1

P, ˆ

PSn

11/11

slide-105
SLIDE 105

Empirical W1 & The Magic of Gaussian Convolution

p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf

EX − Y p1/p

infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: Distribution P on Rd = ⇒ i.i.d. Samples (Si)n

i=1

Empirical distribution ˆ PSn 1

n n

  • i=1

δSi = ⇒ Dependence on (n, d) of EW1

P, ˆ

PSn n− 1

d 11/11

slide-106
SLIDE 106

Empirical W1 & The Magic of Gaussian Convolution

p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf

EX − Y p1/p

infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: Distribution P on Rd = ⇒ i.i.d. Samples (Si)n

i=1

Empirical distribution ˆ PSn 1

n n

  • i=1

δSi = ⇒ Dependence on (n, d) of EW1

P, ˆ

PSn n− 1

d 11/11

slide-107
SLIDE 107

Empirical W1 & The Magic of Gaussian Convolution

p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf

EX − Y p1/p

infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: Distribution P on Rd = ⇒ i.i.d. Samples (Si)n

i=1

Empirical distribution ˆ PSn 1

n n

  • i=1

δSi = ⇒ Dependence on (n, d) of EW1

P, ˆ

PSn n− 1

d

Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any d, we have EW1

P ∗ Nσ, ˆ

PSn ∗ Nσ

≤ Oσ,d n− 1

2 11/11

slide-108
SLIDE 108

Empirical W1 & The Magic of Gaussian Convolution

p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf

EX − Y p1/p

infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: Distribution P on Rd = ⇒ i.i.d. Samples (Si)n

i=1

Empirical distribution ˆ PSn 1

n n

  • i=1

δSi = ⇒ Dependence on (n, d) of EW1

P, ˆ

PSn n− 1

d

Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any d, we have EW1

P ∗ Nσ, ˆ

PSn ∗ Nσ

≤ Oσ,d n− 1

2 = Oσ

cdn− 1

2 11/11

slide-109
SLIDE 109

Is Exponentiality in Dimension Necessary?

11/11

slide-110
SLIDE 110

Is Exponentiality in Dimension Necessary?

Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any σ > 0, sufficiently large d and sufficiently small η > 0, we have n⋆(η, σ, Fd) = Ω

  • 2γ(σ)d

ηd

  • , where γ(σ)>0 is monotonically decreasing in σ.

11/11

slide-111
SLIDE 111

Is Exponentiality in Dimension Necessary?

Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any σ > 0, sufficiently large d and sufficiently small η > 0, we have n⋆(η, σ, Fd) = Ω

  • 2γ(σ)d

ηd

  • , where γ(σ)>0 is monotonically decreasing in σ.

= ⇒ O

  • cd

√n

  • rate attained by the plugin estimator is sharp in n and d

11/11

slide-112
SLIDE 112

Is Exponentiality in Dimension Necessary?

Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any σ > 0, sufficiently large d and sufficiently small η > 0, we have n⋆(η, σ, Fd) = Ω

  • 2γ(σ)d

ηd

  • , where γ(σ)>0 is monotonically decreasing in σ.

= ⇒ O

  • cd

√n

  • rate attained by the plugin estimator is sharp in n and d

Proof (main ideas):

11/11

slide-113
SLIDE 113

Is Exponentiality in Dimension Necessary?

Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any σ > 0, sufficiently large d and sufficiently small η > 0, we have n⋆(η, σ, Fd) = Ω

  • 2γ(σ)d

ηd

  • , where γ(σ)>0 is monotonically decreasing in σ.

= ⇒ O

  • cd

√n

  • rate attained by the plugin estimator is sharp in n and d

Proof (main ideas): Relate h(P ∗ Nσ) to Shannon entropy H(Q) supp(Q) = peak-constrained AWGN capacity achieving codebook Cd

11/11

slide-114
SLIDE 114

Is Exponentiality in Dimension Necessary?

Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any σ > 0, sufficiently large d and sufficiently small η > 0, we have n⋆(η, σ, Fd) = Ω

  • 2γ(σ)d

ηd

  • , where γ(σ)>0 is monotonically decreasing in σ.

= ⇒ O

  • cd

√n

  • rate attained by the plugin estimator is sharp in n and d

Proof (main ideas): Relate h(P ∗ Nσ) to Shannon entropy H(Q) supp(Q) = peak-constrained AWGN capacity achieving codebook Cd H(Q) estimation sample complexity Ω

  • |Cd|

η log |Cd|

  • [Valiant-Valiant’10]

11/11