Estimating Information Flow in Deep Neural Networks Ziv Goldfeld , - - PowerPoint PPT Presentation
Estimating Information Flow in Deep Neural Networks Ziv Goldfeld , - - PowerPoint PPT Presentation
Estimating Information Flow in Deep Neural Networks Ziv Goldfeld , Ewout van den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury and Yury Polyanskiy MIT, IBM Research, MIT-IBM Watson AI Lab International Conference on Machine
Deep Learning - What’s Under the Hood?
2/11
Deep Learning - What’s Under the Hood?
Lacking Theory: Macroscopic understanding of Deep Learning
2/11
Deep Learning - What’s Under the Hood?
Lacking Theory: Macroscopic understanding of Deep Learning
What drives the evolution of internal representations?
2/11
Deep Learning - What’s Under the Hood?
Lacking Theory: Macroscopic understanding of Deep Learning
What drives the evolution of internal representations? What are properties of learned representations?
2/11
Deep Learning - What’s Under the Hood?
Lacking Theory: Macroscopic understanding of Deep Learning
What drives the evolution of internal representations? What are properties of learned representations? How do fully trained networks process information?
2/11
Deep Learning - What’s Under the Hood?
Lacking Theory: Macroscopic understanding of Deep Learning Attempts to Understand Effectiveness of DL:
◮ Structure of loss landscape
[Saxe et al.’14, Choromanska et al.’15, Kawaguchi’16, Keskar et al.’17]
◮ Wavelets and sparse coding
[Bruna-Mallat’13, Giryes et al.’16, Papyan et al.’16]
◮ Adversarial examples
[Szegedy et al.’14, Nguyen et al.’17, Liu et al.’16, Cisse et al.’16]
◮ Information Bottleneck Theory
[Tishby-Zaslavsky’15, Shwartz-Tishby’17, Saxe et al.’18, Gabri´ e et al.’18]
What drives the evolution of internal representations? What are properties of learned representations? How do fully trained networks process information?
2/11
Deep Learning - What’s Under the Hood?
Lacking Theory: Macroscopic understanding of Deep Learning Attempts to Understand Effectiveness of DL:
◮ Structure of loss landscape
[Saxe et al.’14, Choromanska et al.’15, Kawaguchi’16, Keskar et al.’17]
◮ Wavelets and sparse coding
[Bruna-Mallat’13, Giryes et al.’16, Papyan et al.’16]
◮ Adversarial examples
[Szegedy et al.’14, Nguyen et al.’17, Liu et al.’16, Cisse et al.’16]
◮ Information Bottleneck Theory
[Tishby-Zaslavsky’15, Shwartz-Tishby’17, Saxe et al.’18, Gabri´ e et al.’18]
What drives the evolution of internal representations? What are properties of learned representations? How do fully trained networks process information?
2/11
Deep Learning - What’s Under the Hood?
Lacking Theory: Macroscopic understanding of Deep Learning Attempts to Understand Effectiveness of DL:
◮ Structure of loss landscape
[Saxe et al.’14, Choromanska et al.’15, Kawaguchi’16, Keskar et al.’17]
◮ Wavelets and sparse coding
[Bruna-Mallat’13, Giryes et al.’16, Papyan et al.’16]
◮ Adversarial examples
[Szegedy et al.’14, Nguyen et al.’17, Liu et al.’16, Cisse et al.’16]
◮ Information Bottleneck Theory
[Tishby-Zaslavsky’15, Shwartz-Tishby’17, Saxe et al.’18, Gabri´ e et al.’18]
⋆ Goal: Mathematically analyze IB theory & test ‘Compression’
What drives the evolution of internal representations? What are properties of learned representations? How do fully trained networks process information?
2/11
Setup and Preliminaries
(Deterministic) Feedforward DNN: Each layer Tℓ = fℓ(Tℓ−1)
- (Label)
- (Feature/Image)
= (Input Layer) Cat Dog
- (Hidden Layer 1)
- (Hidden Layer )
- (Hidden Layer )
- =
- (Output Layer)
3/11
Setup and Preliminaries
(Deterministic) Feedforward DNN: Each layer Tℓ = fℓ(Tℓ−1) Joint Distribution: PX,Y
- (Label)
- (Feature/Image)
= (Input Layer) Cat Dog
- (Hidden Layer 1)
- (Hidden Layer )
- (Hidden Layer )
- =
- (Output Layer)
3/11
Setup and Preliminaries
(Deterministic) Feedforward DNN: Each layer Tℓ = fℓ(Tℓ−1) Joint Distribution: PX,Y = ⇒ PX,Y · PT1,...,TL|X
- (Label)
- (Feature/Image)
= (Input Layer) Cat Dog
- (Hidden Layer 1)
- (Hidden Layer )
- (Hidden Layer )
- =
- (Output Layer)
3/11
Setup and Preliminaries
(Deterministic) Feedforward DNN: Each layer Tℓ = fℓ(Tℓ−1) Joint Distribution: PX,Y = ⇒ PX,Y · PT1,...,TL|X Information Plane: Evolution of
I(X; Tℓ), I(Y ; Tℓ) during training
- I(A; B) = DKL(PA,B||PA ⊗ PB)
Discrete
=
- a,b PA,B(a, b) log
PA,B(a,b) PA(a)PB(b)
- (Label)
- (Feature/Image)
= (Input Layer) Cat Dog
- (Hidden Layer 1)
- (Hidden Layer )
- (Hidden Layer )
- =
- (Output Layer)
3/11
Setup and Preliminaries
(Deterministic) Feedforward DNN: Each layer Tℓ = fℓ(Tℓ−1) IB Theory Claim: Training comprises 2 phases
- (Label)
- (Feature/Image)
= (Input Layer) Cat Dog
- (Hidden Layer 1)
- (Hidden Layer )
- (Hidden Layer )
- =
- (Output Layer)
4/11
Setup and Preliminaries
(Deterministic) Feedforward DNN: Each layer Tℓ = fℓ(Tℓ−1) IB Theory Claim: Training comprises 2 phases
1
Fitting: I(Y ; Tℓ) & I(X; Tℓ) rise (short)
- (Label)
- (Feature/Image)
= (Input Layer) Cat Dog
- (Hidden Layer 1)
- (Hidden Layer )
- (Hidden Layer )
- =
- (Output Layer)
4/11
Setup and Preliminaries
(Deterministic) Feedforward DNN: Each layer Tℓ = fℓ(Tℓ−1) IB Theory Claim: Training comprises 2 phases
1
Fitting: I(Y ; Tℓ) & I(X; Tℓ) rise (short)
2
Compression: I(X; Tℓ) slowly drops (long)
- (Label)
- (Feature/Image)
= (Input Layer) Cat Dog
- (Hidden Layer 1)
- (Hidden Layer )
- (Hidden Layer )
- =
- (Output Layer)
4/11
Setup and Preliminaries
(Deterministic) Feedforward DNN: Each layer Tℓ = fℓ(Tℓ−1) IB Theory Claim: Training comprises 2 phases
1
Fitting: I(Y ; Tℓ) & I(X; Tℓ) rise (short)
2
Compression: I(X; Tℓ) slowly drops (long)
[Shwartz-Tishby’17]
- (Label)
- (Feature/Image)
= (Input Layer) Cat Dog
- (Hidden Layer 1)
- (Hidden Layer )
- (Hidden Layer )
- =
- (Output Layer)
4/11
Vacuous Mutual Information & Mis-Estimation
Proposition (Informal)
- Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)
5/11
Vacuous Mutual Information & Mis-Estimation
Proposition (Informal)
- Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)
= ⇒ I(X; Tℓ) is independent of the DNN parameters
5/11
Vacuous Mutual Information & Mis-Estimation
Proposition (Informal)
- Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)
= ⇒ I(X; Tℓ) is independent of the DNN parameters I(X; Tℓ) a.s. infinite (continuous X) or constant H(X) (discrete X)
5/11
Vacuous Mutual Information & Mis-Estimation
Proposition (Informal)
- Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)
= ⇒ I(X; Tℓ) is independent of the DNN parameters I(X; Tℓ) a.s. infinite (continuous X) or constant H(X) (discrete X)
Feature Space (X) X ∼ Unif(X) |X| = 3000
5/11
Vacuous Mutual Information & Mis-Estimation
Proposition (Informal)
- Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)
= ⇒ I(X; Tℓ) is independent of the DNN parameters I(X; Tℓ) a.s. infinite (continuous X) or constant H(X) (discrete X)
DNN
Feature Space (X) X ∼ Unif(X) |X| = 3000 Internal Rep. Space (Tℓ = ˜ fℓ(X)) Tℓ ∼ Unif(Tℓ) |Tℓ| = |X| = 3000
5/11
Vacuous Mutual Information & Mis-Estimation
Proposition (Informal)
- Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)
= ⇒ I(X; Tℓ) is independent of the DNN parameters I(X; Tℓ) a.s. infinite (continuous X) or constant H(X) (discrete X) Past Works: Use binning-based proxy of I(X; Tℓ) (aka quantization)
5/11
Vacuous Mutual Information & Mis-Estimation
Proposition (Informal)
- Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)
= ⇒ I(X; Tℓ) is independent of the DNN parameters I(X; Tℓ) a.s. infinite (continuous X) or constant H(X) (discrete X) Past Works: Use binning-based proxy of I(X; Tℓ) (aka quantization)
1
For non-negligible bin size I
- X; Bin(Tℓ)
- = I(X; Tℓ)
5/11
Vacuous Mutual Information & Mis-Estimation
Proposition (Informal)
- Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)
= ⇒ I(X; Tℓ) is independent of the DNN parameters I(X; Tℓ) a.s. infinite (continuous X) or constant H(X) (discrete X) Past Works: Use binning-based proxy of I(X; Tℓ) (aka quantization)
1
For non-negligible bin size I
- X; Bin(Tℓ)
- = I(X; Tℓ)
2
I
- X; Bin(Tℓ)
- highly sensitive to user-defined bin size:
5/11
Vacuous Mutual Information & Mis-Estimation
Proposition (Informal)
- Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)
= ⇒ I(X; Tℓ) is independent of the DNN parameters I(X; Tℓ) a.s. infinite (continuous X) or constant H(X) (discrete X) Past Works: Use binning-based proxy of I(X; Tℓ) (aka quantization)
1
For non-negligible bin size I
- X; Bin(Tℓ)
- = I(X; Tℓ)
2
I
- X; Bin(Tℓ)
- highly sensitive to user-defined bin size:
100 101 102 103 104 Epoch 4 8 MI (nats) bin size = 0.0001
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 bin size = 0.001 bin size = 0.01 bin size = 0.1
5/11
Vacuous Mutual Information & Mis-Estimation
Proposition (Informal)
- Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid)
= ⇒ I(X; Tℓ) is independent of the DNN parameters I(X; Tℓ) a.s. infinite (continuous X) or constant H(X) (discrete X) Past Works: Use binning-based proxy of I(X; Tℓ) (aka quantization)
1
For non-negligible bin size I
- X; Bin(Tℓ)
- = I(X; Tℓ)
2
I
- X; Bin(Tℓ)
- highly sensitive to user-defined bin size:
⊛ ⊛ ⊛ Real Problem: Mismatch between I(X; Tℓ) measurement and model
100 101 102 103 104 Epoch 4 8 MI (nats) bin size = 0.0001
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 bin size = 0.001 bin size = 0.01 bin size = 0.1
5/11
Auxiliary Framework - Noisy Deep Neural Networks
Modification: Inject (small) Gaussian noise to neurons’ output
6/11
Auxiliary Framework - Noisy Deep Neural Networks
Modification: Inject (small) Gaussian noise to neurons’ output Formally: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·
6/11
Auxiliary Framework - Noisy Deep Neural Networks
Modification: Inject (small) Gaussian noise to neurons’ output Formally: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) = ⇒ X → Tℓ is a parametrized channel (by DNN’s parameters) X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·
6/11
Auxiliary Framework - Noisy Deep Neural Networks
Modification: Inject (small) Gaussian noise to neurons’ output Formally: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) = ⇒ X → Tℓ is a parametrized channel (by DNN’s parameters) = ⇒ I(X; Tℓ) is a function of parameters! X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·
6/11
Auxiliary Framework - Noisy Deep Neural Networks
Modification: Inject (small) Gaussian noise to neurons’ output Formally: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) = ⇒ X → Tℓ is a parametrized channel (by DNN’s parameters) = ⇒ I(X; Tℓ) is a function of parameters!
⊛ ⊛ ⊛ Challenge: How to accurately track I(X; Tℓ)?
X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·
6/11
High-Dim. & Nonparametric Functional Estimation
7/11
High-Dim. & Nonparametric Functional Estimation
Distill I(X;Tℓ) Estimation into Noisy Differential Entropy Estimation: Estimate h(P ∗ Nσ) from n i.i.d. samples Sn (Si)n
i=1 of P ∈ Fd (non-
parametric class) and knowledge of Nσ (Gaussian measure N(0, σ2Id)).
7/11
High-Dim. & Nonparametric Functional Estimation
Distill I(X;Tℓ) Estimation into Noisy Differential Entropy Estimation: Estimate h(P ∗ Nσ) from n i.i.d. samples Sn (Si)n
i=1 of P ∈ Fd (non-
parametric class) and knowledge of Nσ (Gaussian measure N(0, σ2Id)). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) Sample complexity of any accurate estimator (additive gap η) is Ω
- 2d
ηd
- 7/11
High-Dim. & Nonparametric Functional Estimation
Distill I(X;Tℓ) Estimation into Noisy Differential Entropy Estimation: Estimate h(P ∗ Nσ) from n i.i.d. samples Sn (Si)n
i=1 of P ∈ Fd (non-
parametric class) and knowledge of Nσ (Gaussian measure N(0, σ2Id)). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) Sample complexity of any accurate estimator (additive gap η) is Ω
- 2d
ηd
- Structured Estimator⋆: ˆ
h(Sn, σ) h( ˆ Pn ∗ Nσ), where ˆ Pn = 1
n n
- i=1
δSi
⋆ Efficient and parallelizable
7/11
High-Dim. & Nonparametric Functional Estimation
Distill I(X;Tℓ) Estimation into Noisy Differential Entropy Estimation: Estimate h(P ∗ Nσ) from n i.i.d. samples Sn (Si)n
i=1 of P ∈ Fd (non-
parametric class) and knowledge of Nσ (Gaussian measure N(0, σ2Id)). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) Sample complexity of any accurate estimator (additive gap η) is Ω
- 2d
ηd
- Structured Estimator⋆: ˆ
h(Sn, σ) h( ˆ Pn ∗ Nσ), where ˆ Pn = 1
n n
- i=1
δSi Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For F(SG)
d,K
P
- P is K-subgaussian in Rd, d ≥ 1 and σ > 0, we have
supP ∈F(SG)
d,K ESn
- h(P ∗ Nσ) − ˆ
h(Sn, σ)
- ≤ cd
σ,K · n− 1
2 7/11
High-Dim. & Nonparametric Functional Estimation
Distill I(X;Tℓ) Estimation into Noisy Differential Entropy Estimation: Estimate h(P ∗ Nσ) from n i.i.d. samples Sn (Si)n
i=1 of P ∈ Fd (non-
parametric class) and knowledge of Nσ (Gaussian measure N(0, σ2Id)). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) Sample complexity of any accurate estimator (additive gap η) is Ω
- 2d
ηd
- Structured Estimator⋆: ˆ
h(Sn, σ) h( ˆ Pn ∗ Nσ), where ˆ Pn = 1
n n
- i=1
δSi Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For F(SG)
d,K
P
- P is K-subgaussian in Rd, d ≥ 1 and σ > 0, we have
supP ∈F(SG)
d,K ESn
- h(P ∗ Nσ) − ˆ
h(Sn, σ)
- ≤ cd
σ,K · n− 1
2
Optimality: ˆ h(Sn, σ) attains sharp dependence on both n and d!
7/11
I(X; Tℓ) Dynamics - Illustrative Minimal Example
Single Neuron Classification: X
tanh(wX + b) Sw,b
Z ∼ N(0, σ2) T
8/11
I(X; Tℓ) Dynamics - Illustrative Minimal Example
Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} X
tanh(wX + b) Sw,b
Z ∼ N(0, σ2) T
8/11
I(X; Tℓ) Dynamics - Illustrative Minimal Example
Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} X
tanh(wX + b) Sw,b
Z ∼ N(0, σ2) T
3 3 3 3
S1,0
8/11
I(X; Tℓ) Dynamics - Illustrative Minimal Example
Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} X
tanh(wX + b) Sw,b
Z ∼ N(0, σ2) T
3 3 3 3
S1,0
⊛ ⊛ ⊛ Center & sharpen transition ( ⇐
⇒ increase w and keep b = −2w)
8/11
I(X; Tℓ) Dynamics - Illustrative Minimal Example
Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} X
tanh(wX + b) Sw,b
Z ∼ N(0, σ2) T
3 3 3 3
S1,0 S5,−10
8/11
I(X; Tℓ) Dynamics - Illustrative Minimal Example
Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} X
tanh(wX + b) Sw,b
Z ∼ N(0, σ2) T
3 3 3 3
S1,0 S5,−10
✓ Correct classification performance
8/11
I(X; Tℓ) Dynamics - Illustrative Minimal Example
Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} Mutual Information: X
tanh(wX + b) Sw,b
Z ∼ N(0, σ2) T
8/11
I(X; Tℓ) Dynamics - Illustrative Minimal Example
Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} Mutual Information: I(X; T) = I(Sw,b; Sw,b + Z) X
tanh(wX + b) Sw,b
Z ∼ N(0, σ2) T
8/11
I(X; Tℓ) Dynamics - Illustrative Minimal Example
Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} Mutual Information: I(X; T) = I(Sw,b; Sw,b + Z) = ⇒ I(X; T) is # bits (nats) transmittable over AWGN with symbols Sw,b
tanh(
−3w+b), tanh( −w+b), tanh(w+b), tanh(3w+b)
- X
tanh(wX + b) Sw,b
Z ∼ N(0, σ2) T
8/11
I(X; Tℓ) Dynamics - Illustrative Minimal Example
Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} Mutual Information: I(X; T) = I(Sw,b; Sw,b + Z) = ⇒ I(X; T) is # bits (nats) transmittable over AWGN with symbols Sw,b
tanh(
−3w+b), tanh( −w+b), tanh(w+b), tanh(3w+b)
−
→ {±1} X
tanh(wX + b) Sw,b
Z ∼ N(0, σ2) T
8/11
I(X; Tℓ) Dynamics - Illustrative Minimal Example
Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} Mutual Information: I(X; T) = I(Sw,b; Sw,b + Z) = ⇒ I(X; T) is # bits (nats) transmittable over AWGN with symbols Sw,b
tanh(
−3w+b), tanh( −w+b), tanh(w+b), tanh(3w+b)
−
→ {±1} X
tanh(wX + b) Sw,b
Z ∼ N(0, σ2) T
8/11
I(X; Tℓ) Dynamics - Illustrative Minimal Example
Single Neuron Classification: Input: X ∼ Unif{±1, ±3} Xy=−1 {−3, −1, 1} , Xy=1 {3} Mutual Information: I(X; T) = I(Sw,b; Sw,b + Z) = ⇒ I(X; T) is # bits (nats) transmittable over AWGN with symbols Sw,b
tanh(
−3w+b), tanh( −w+b), tanh(w+b), tanh(3w+b)
−
→ {±1} X
tanh(wX + b) Sw,b
Z ∼ N(0, σ2) T
100 102 104 106
Epoch
0.5 1 1.5
Mutual information
ln(3) ln(2) ln(4) 8/11
Clustering of Representations - Larger Networks
Noisy version of DNN from [Shwartz-Tishby’17]:
9/11
Clustering of Representations - Larger Networks
Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP
9/11
Clustering of Representations - Larger Networks
Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP
9/11
Clustering of Representations - Larger Networks
Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP Verified in multiple additional experiments
9/11
Clustering of Representations - Larger Networks
Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP Verified in multiple additional experiments = ⇒ Compression of I(X; Tℓ) driven by clustering of representations
9/11
Circling Back to Deterministic DNNs
I(X; Tℓ) is constant/infinite = ⇒ Doesn’t measure clustering
10/11
Circling Back to Deterministic DNNs
I(X; Tℓ) is constant/infinite = ⇒ Doesn’t measure clustering Reexamine Measurements: Computed I
X; Bin(Tℓ) = H Bin(Tℓ)
- 10/11
Circling Back to Deterministic DNNs
I(X; Tℓ) is constant/infinite = ⇒ Doesn’t measure clustering Reexamine Measurements: Computed I
X; Bin(Tℓ) = H Bin(Tℓ)
- H
Bin(Tℓ) measures clustering (maximized by uniform distribution)
10/11
Circling Back to Deterministic DNNs
I(X; Tℓ) is constant/infinite = ⇒ Doesn’t measure clustering Reexamine Measurements: Computed I
X; Bin(Tℓ) = H Bin(Tℓ)
- H
Bin(Tℓ) measures clustering (maximized by uniform distribution)
Test: I(X; Tℓ) and H
Bin(Tℓ) highly correlated in noisy DNNs⋆
⋆ When bin size chosen ∝ noise std.
10/11
Circling Back to Deterministic DNNs
I(X; Tℓ) is constant/infinite = ⇒ Doesn’t measure clustering Reexamine Measurements: Computed I
X; Bin(Tℓ) = H Bin(Tℓ)
- H
Bin(Tℓ) measures clustering (maximized by uniform distribution)
Test: I(X; Tℓ) and H
Bin(Tℓ) highly correlated in noisy DNNs⋆
= ⇒ Past works not measuring MI but clustering (via binned-MI)!
10/11
Circling Back to Deterministic DNNs
I(X; Tℓ) is constant/infinite = ⇒ Doesn’t measure clustering Reexamine Measurements: Computed I
X; Bin(Tℓ) = H Bin(Tℓ)
- H
Bin(Tℓ) measures clustering (maximized by uniform distribution)
Test: I(X; Tℓ) and H
Bin(Tℓ) highly correlated in noisy DNNs⋆
= ⇒ Past works not measuring MI but clustering (via binned-MI)! By-Product Result:
10/11
Circling Back to Deterministic DNNs
I(X; Tℓ) is constant/infinite = ⇒ Doesn’t measure clustering Reexamine Measurements: Computed I
X; Bin(Tℓ) = H Bin(Tℓ)
- H
Bin(Tℓ) measures clustering (maximized by uniform distribution)
Test: I(X; Tℓ) and H
Bin(Tℓ) highly correlated in noisy DNNs⋆
= ⇒ Past works not measuring MI but clustering (via binned-MI)! By-Product Result: Refute ‘compression (tight clustering) improves generalization’ claim [Come see us at poster #96 for details]
10/11
Summary
Reexamined Information Bottleneck Compression:
11/11
Summary
Reexamined Information Bottleneck Compression:
◮ I(X; T ) fluctuations in det. DNNs are theoretically impossible
11/11
Summary
Reexamined Information Bottleneck Compression:
◮ I(X; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I(X; T ) dynamics during training
11/11
Summary
Reexamined Information Bottleneck Compression:
◮ I(X; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I(X; T ) dynamics during training
Noisy DNN Framework: Studying IT quantities over DNNs
11/11
Summary
Reexamined Information Bottleneck Compression:
◮ I(X; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I(X; T ) dynamics during training
Noisy DNN Framework: Studying IT quantities over DNNs
◮ Optimal estimator (in n and d) for accurate MI estimation
11/11
Summary
Reexamined Information Bottleneck Compression:
◮ I(X; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I(X; T ) dynamics during training
Noisy DNN Framework: Studying IT quantities over DNNs
◮ Optimal estimator (in n and d) for accurate MI estimation ◮ Clustering of learned representations is the source of compression
11/11
Summary
Reexamined Information Bottleneck Compression:
◮ I(X; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I(X; T ) dynamics during training
Noisy DNN Framework: Studying IT quantities over DNNs
◮ Optimal estimator (in n and d) for accurate MI estimation ◮ Clustering of learned representations is the source of compression
Clarify Past Observations of Compression: in fact show clustering
11/11
Summary
Reexamined Information Bottleneck Compression:
◮ I(X; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I(X; T ) dynamics during training
Noisy DNN Framework: Studying IT quantities over DNNs
◮ Optimal estimator (in n and d) for accurate MI estimation ◮ Clustering of learned representations is the source of compression
Clarify Past Observations of Compression: in fact show clustering
◮ Compression/clustering and generalization and not necessarily related
11/11
Summary
Reexamined Information Bottleneck Compression:
◮ I(X; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I(X; T ) dynamics during training
Noisy DNN Framework: Studying IT quantities over DNNs
◮ Optimal estimator (in n and d) for accurate MI estimation ◮ Clustering of learned representations is the source of compression
Clarify Past Observations of Compression: in fact show clustering
◮ Compression/clustering and generalization and not necessarily related
Thank you!
11/11
Clustering of Representations - Larger Networks
Noisy version of DNN from [Shwartz-Tishby’17]:
11/11
Clustering of Representations - Larger Networks
Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP
11/11
Clustering of Representations - Larger Networks
Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP
11/11
Clustering of Representations - Larger Networks
Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP
11/11
Clustering of Representations - Larger Networks
Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP
⊛ ⊛ ⊛ weight orthonormality regularization [Cisse et al.’17]
11/11
Clustering of Representations - Larger Networks
Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP Verified in multiple additional experiments
11/11
Clustering of Representations - Larger Networks
Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12–10–7–5–4–3–2 tanh MLP Verified in multiple additional experiments = ⇒ Compression of I(X; Tℓ) driven by clustering of representations
11/11
Mutual Information Estimation in Noisy DNNs
Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·
11/11
Mutual Information Estimation in Noisy DNNs
Mutual Information: I(X; Tℓ) = h(Tℓ) −
dPX(x)h(Tℓ|X = x)
Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·
11/11
Mutual Information Estimation in Noisy DNNs
Mutual Information: I(X; Tℓ) = h(Tℓ) −
dPX(x)h(Tℓ|X = x)
Structure: Sℓ ⊥ Zℓ = ⇒ Tℓ = Sℓ + Zℓ ∼ P ∗ Nσ Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·
11/11
Mutual Information Estimation in Noisy DNNs
Mutual Information: I(X; Tℓ) = h(Tℓ) −
dPX(x)h(Tℓ|X = x)
Structure: Sℓ ⊥ Zℓ = ⇒ Tℓ = Sℓ + Zℓ ∼ P ∗ Nσ Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·
11/11
Mutual Information Estimation in Noisy DNNs
Mutual Information: I(X; Tℓ) = h(Tℓ) −
dPX(x)h(Tℓ|X = x)
Structure: Sℓ ⊥ Zℓ = ⇒ Tℓ = Sℓ + Zℓ ∼ P ∗ N σ Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·
11/11
Mutual Information Estimation in Noisy DNNs
Mutual Information: I(X; Tℓ) = h(Tℓ) −
dPX(x)h(Tℓ|X = x)
Structure: Sℓ ⊥ Zℓ = ⇒ Tℓ = Sℓ + Zℓ ∼ P ∗ Nσ
⊛ ⊛ ⊛ Know the distribution Nσ of Zℓ (noise injected by design)
Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·
11/11
Mutual Information Estimation in Noisy DNNs
Mutual Information: I(X; Tℓ) = h(Tℓ) −
dPX(x)h(Tℓ|X = x)
Structure: Sℓ ⊥ Zℓ = ⇒ Tℓ = Sℓ + Zℓ ∼ P ∗ Nσ
⊛ ⊛ ⊛ Know the distribution Nσ of Zℓ (noise injected by design)
Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·
11/11
Mutual Information Estimation in Noisy DNNs
Mutual Information: I(X; Tℓ) = h(Tℓ) −
dPX(x)h(Tℓ|X = x)
Structure: Sℓ ⊥ Zℓ = ⇒ Tℓ = Sℓ + Zℓ ∼ P ∗ Nσ
⊛ ⊛ ⊛ Know the distribution Nσ of Zℓ (noise injected by design) ⊛ ⊛ ⊛ Extremely complicated P
= ⇒ Treat as unknown Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·
11/11
Mutual Information Estimation in Noisy DNNs
Mutual Information: I(X; Tℓ) = h(Tℓ) −
dPX(x)h(Tℓ|X = x)
Structure: Sℓ ⊥ Zℓ = ⇒ Tℓ = Sℓ + Zℓ ∼ P ∗ Nσ
⊛ ⊛ ⊛ Know the distribution Nσ of Zℓ (noise injected by design) ⊛ ⊛ ⊛ Extremely complicated P
= ⇒ Treat as unknown
⊛ ⊛ ⊛ Easily get i.i.d. samples from P via DNN forward pass
Noisy DNN: Tℓ = Sℓ + Zℓ, where Sℓ fℓ(Tℓ−1) and Zℓ ∼ N(0, σ2Id) X f1 f1 S1 S1 Z1 Z1 T1 T1 f2 f2 S2 S2 Z2 Z2 T2 · · ·
11/11
Structured Estimator (with Implementation in Mind)
Differential Entropy Estimation under Gaussian Convolutions Estimate h(P ∗ Nσ) via n i.i.d. samples Sn (Si)n
i=1 from unknown
P ∈ Fd (nonparametric class) and knowledge of Nσ (noise distribution).
11/11
Structured Estimator (with Implementation in Mind)
Nonparametric Class: Specified by DNN architecture (d = Tℓ ‘width’) Differential Entropy Estimation under Gaussian Convolutions Estimate h(P ∗ Nσ) via n i.i.d. samples Sn (Si)n
i=1 from unknown
P ∈ Fd (nonparametric class) and knowledge of Nσ (noise distribution).
11/11
Structured Estimator (with Implementation in Mind)
Nonparametric Class: Specified by DNN architecture (d = Tℓ ‘width’) Goal: Simple & parallelizable for efficient implementation Differential Entropy Estimation under Gaussian Convolutions Estimate h(P ∗ Nσ) via n i.i.d. samples Sn (Si)n
i=1 from unknown
P ∈ Fd (nonparametric class) and knowledge of Nσ (noise distribution).
11/11
Structured Estimator (with Implementation in Mind)
Nonparametric Class: Specified by DNN architecture (d = Tℓ ‘width’) Goal: Simple & parallelizable for efficient implementation Estimator: ˆ h(Sn, σ) h( ˆ PSn ∗ Nσ), where ˆ PSn 1
n n
- i=1
δSi Differential Entropy Estimation under Gaussian Convolutions Estimate h(P ∗ Nσ) via n i.i.d. samples Sn (Si)n
i=1 from unknown
P ∈ Fd (nonparametric class) and knowledge of Nσ (noise distribution).
11/11
Structured Estimator (with Implementation in Mind)
Nonparametric Class: Specified by DNN architecture (d = Tℓ ‘width’) Goal: Simple & parallelizable for efficient implementation Estimator: ˆ h(Sn, σ) h( ˆ PSn ∗ Nσ), where ˆ PSn 1
n n
- i=1
δSi Plug-in: ˆ h is plug-in est. for the functional Tσ(P) h(P ∗ Nσ) Differential Entropy Estimation under Gaussian Convolutions Estimate h(P ∗ Nσ) via n i.i.d. samples Sn (Si)n
i=1 from unknown
P ∈ Fd (nonparametric class) and knowledge of Nσ (noise distribution).
11/11
Structured Estimator - Convergence Rate
Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0, d ≥ 1, we have sup
P ∈F(SG)
d,K
E
- h(P ∗ Nσ) − h( ˆ
PSn ∗ Nσ)
- ≤ Cσ,d,K · n− 1
2
where Cσ,d,K = Oσ,K(cd) for a constant c.
11/11
Structured Estimator - Convergence Rate
Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0, d ≥ 1, we have sup
P ∈F(SG)
d,K
E
- h(P ∗ Nσ) − h( ˆ
PSn ∗ Nσ)
- ≤ Cσ,d,K · n− 1
2
where Cσ,d,K = Oσ,K(cd) for a constant c. Comments:
11/11
Structured Estimator - Convergence Rate
Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0, d ≥ 1, we have sup
P ∈F(SG)
d,K
E
- h(P ∗ Nσ) − h( ˆ
PSn ∗ Nσ)
- ≤ Cσ,d,K · n− 1
2
where Cσ,d,K = Oσ,K(cd) for a constant c. Comments: Explicit Expression: Enables concrete error bounds in simulations
11/11
Structured Estimator - Convergence Rate
Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0, d ≥ 1, we have sup
P ∈F(SG)
d,K
E
- h(P ∗ Nσ) − h( ˆ
PSn ∗ Nσ)
- ≤ Cσ,d,K · n− 1
2
where Cσ,d,K = Oσ,K(cd) for a constant c. Comments: Explicit Expression: Enables concrete error bounds in simulations Minimax Rate Optimal: Attains parametric estimation rate O
n− 1
2 11/11
Structured Estimator - Convergence Rate
Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0, d ≥ 1, we have sup
P ∈F(SG)
d,K
E
- h(P ∗ Nσ) − h( ˆ
PSn ∗ Nσ)
- ≤ Cσ,d,K · n− 1
2
where Cσ,d,K = Oσ,K(cd) for a constant c. Comments: Explicit Expression: Enables concrete error bounds in simulations Minimax Rate Optimal: Attains parametric estimation rate O
n− 1
2
Proof (initial step): Based on [Polyanskiy-Wu’16]
- h(P ∗ Nσ) − h( ˆ
PSn ∗ Nσ)
- W1(P ∗ Nσ, ˆ
PSn ∗ Nσ)
11/11
Structured Estimator - Convergence Rate
Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0, d ≥ 1, we have sup
P ∈F(SG)
d,K
E
- h(P ∗ Nσ) − h( ˆ
PSn ∗ Nσ)
- ≤ Cσ,d,K · n− 1
2
where Cσ,d,K = Oσ,K(cd) for a constant c. Comments: Explicit Expression: Enables concrete error bounds in simulations Minimax Rate Optimal: Attains parametric estimation rate O
n− 1
2
Proof (initial step): Based on [Polyanskiy-Wu’16]
- h(P ∗ Nσ) − h( ˆ
PSn ∗ Nσ)
- W1(P ∗ Nσ, ˆ
PSn ∗ Nσ) = ⇒ Analyze empirical 1-Wasserstein distance under Gaussian convolutions
11/11
Empirical W1 & The Magic of Gaussian Convolution
p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf
EX − Y p1/p
infimum over all couplings of P and Q
11/11
Empirical W1 & The Magic of Gaussian Convolution
p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf
EX − Y p1/p
infimum over all couplings of P and Q Empirical 1-Wasserstein Distance:
11/11
Empirical W1 & The Magic of Gaussian Convolution
p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf
EX − Y p1/p
infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: Distribution P on Rd
11/11
Empirical W1 & The Magic of Gaussian Convolution
p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf
EX − Y p1/p
infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: Distribution P on Rd = ⇒ i.i.d. Samples (Si)n
i=1
11/11
Empirical W1 & The Magic of Gaussian Convolution
p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf
EX − Y p1/p
infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: Distribution P on Rd = ⇒ i.i.d. Samples (Si)n
i=1
Empirical distribution ˆ PSn 1
n n
- i=1
δSi
11/11
Empirical W1 & The Magic of Gaussian Convolution
p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf
EX − Y p1/p
infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: Distribution P on Rd = ⇒ i.i.d. Samples (Si)n
i=1
Empirical distribution ˆ PSn 1
n n
- i=1
δSi = ⇒ Dependence on (n, d) of EW1
P, ˆ
PSn
11/11
Empirical W1 & The Magic of Gaussian Convolution
p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf
EX − Y p1/p
infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: Distribution P on Rd = ⇒ i.i.d. Samples (Si)n
i=1
Empirical distribution ˆ PSn 1
n n
- i=1
δSi = ⇒ Dependence on (n, d) of EW1
P, ˆ
PSn n− 1
d 11/11
Empirical W1 & The Magic of Gaussian Convolution
p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf
EX − Y p1/p
infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: Distribution P on Rd = ⇒ i.i.d. Samples (Si)n
i=1
Empirical distribution ˆ PSn 1
n n
- i=1
δSi = ⇒ Dependence on (n, d) of EW1
P, ˆ
PSn n− 1
d 11/11
Empirical W1 & The Magic of Gaussian Convolution
p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf
EX − Y p1/p
infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: Distribution P on Rd = ⇒ i.i.d. Samples (Si)n
i=1
Empirical distribution ˆ PSn 1
n n
- i=1
δSi = ⇒ Dependence on (n, d) of EW1
P, ˆ
PSn n− 1
d
Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any d, we have EW1
P ∗ Nσ, ˆ
PSn ∗ Nσ
≤ Oσ,d n− 1
2 11/11
Empirical W1 & The Magic of Gaussian Convolution
p-Wasserstein Distance: For two distributions P and Q on Rd and p ≥ 1 Wp(P, Q) inf
EX − Y p1/p
infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: Distribution P on Rd = ⇒ i.i.d. Samples (Si)n
i=1
Empirical distribution ˆ PSn 1
n n
- i=1
δSi = ⇒ Dependence on (n, d) of EW1
P, ˆ
PSn n− 1
d
Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any d, we have EW1
P ∗ Nσ, ˆ
PSn ∗ Nσ
≤ Oσ,d n− 1
2 = Oσ
cdn− 1
2 11/11
Is Exponentiality in Dimension Necessary?
11/11
Is Exponentiality in Dimension Necessary?
Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any σ > 0, sufficiently large d and sufficiently small η > 0, we have n⋆(η, σ, Fd) = Ω
- 2γ(σ)d
ηd
- , where γ(σ)>0 is monotonically decreasing in σ.
11/11
Is Exponentiality in Dimension Necessary?
Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any σ > 0, sufficiently large d and sufficiently small η > 0, we have n⋆(η, σ, Fd) = Ω
- 2γ(σ)d
ηd
- , where γ(σ)>0 is monotonically decreasing in σ.
= ⇒ O
- cd
√n
- rate attained by the plugin estimator is sharp in n and d
11/11
Is Exponentiality in Dimension Necessary?
Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any σ > 0, sufficiently large d and sufficiently small η > 0, we have n⋆(η, σ, Fd) = Ω
- 2γ(σ)d
ηd
- , where γ(σ)>0 is monotonically decreasing in σ.
= ⇒ O
- cd
√n
- rate attained by the plugin estimator is sharp in n and d
Proof (main ideas):
11/11
Is Exponentiality in Dimension Necessary?
Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any σ > 0, sufficiently large d and sufficiently small η > 0, we have n⋆(η, σ, Fd) = Ω
- 2γ(σ)d
ηd
- , where γ(σ)>0 is monotonically decreasing in σ.
= ⇒ O
- cd
√n
- rate attained by the plugin estimator is sharp in n and d
Proof (main ideas): Relate h(P ∗ Nσ) to Shannon entropy H(Q) supp(Q) = peak-constrained AWGN capacity achieving codebook Cd
11/11
Is Exponentiality in Dimension Necessary?
Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For any σ > 0, sufficiently large d and sufficiently small η > 0, we have n⋆(η, σ, Fd) = Ω
- 2γ(σ)d
ηd
- , where γ(σ)>0 is monotonically decreasing in σ.
= ⇒ O
- cd
√n
- rate attained by the plugin estimator is sharp in n and d
Proof (main ideas): Relate h(P ∗ Nσ) to Shannon entropy H(Q) supp(Q) = peak-constrained AWGN capacity achieving codebook Cd H(Q) estimation sample complexity Ω
- |Cd|
η log |Cd|
- [Valiant-Valiant’10]
11/11