Estimating Information Flow in Deep Neural Networks Ziv Goldfeld , - PowerPoint PPT Presentation

Vacuous Mutual Information & Mis-Estimation Proposition (Informal) Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid) ⇒ I ( X ; T ℓ ) is independent of the DNN parameters = I ( X ; T ℓ ) a.s. infinite (continuous X ) or constant H ( X ) (discrete X ) Past Works: Use binning-based proxy of I ( X ; T ℓ ) (aka quantization) � � For non-negligible bin size I X ; Bin ( T ℓ ) � = I ( X ; T ℓ ) 1 � � I X ; Bin ( T ℓ ) highly sensitive to user-defined bin size: 2 bin size = 0.0001 bin size = 0.001 bin size = 0.01 bin size = 0.1 8 Layer 1 MI (nats) Layer 2 4 Layer 3 Layer 4 Layer 5 0 10 0 10 1 10 2 10 3 10 4 Epoch 5/11

Vacuous Mutual Information & Mis-Estimation Proposition (Informal) Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid) ⇒ I ( X ; T ℓ ) is independent of the DNN parameters = I ( X ; T ℓ ) a.s. infinite (continuous X ) or constant H ( X ) (discrete X ) Past Works: Use binning-based proxy of I ( X ; T ℓ ) (aka quantization) � � For non-negligible bin size I X ; Bin ( T ℓ ) � = I ( X ; T ℓ ) 1 � � I X ; Bin ( T ℓ ) highly sensitive to user-defined bin size: 2 bin size = 0.0001 bin size = 0.001 bin size = 0.01 bin size = 0.1 8 Layer 1 MI (nats) Layer 2 4 Layer 3 Layer 4 Layer 5 0 10 0 10 1 10 2 10 3 10 4 Epoch ⊛ ⊛ ⊛ Real Problem: Mismatch between I ( X ; T ℓ ) measurement and model 5/11

Auxiliary Framework - Noisy Deep Neural Networks Modification: Inject (small) Gaussian noise to neurons’ output 6/11

Auxiliary Framework - Noisy Deep Neural Networks Modification: Inject (small) Gaussian noise to neurons’ output Formally: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) T 2 · · · f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 X Z 1 Z 1 Z 2 Z 2 6/11

Auxiliary Framework - Noisy Deep Neural Networks Modification: Inject (small) Gaussian noise to neurons’ output Formally: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) T 2 · · · f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 X Z 1 Z 1 Z 2 Z 2 = ⇒ X �→ T ℓ is a parametrized channel (by DNN’s parameters) 6/11

Auxiliary Framework - Noisy Deep Neural Networks Modification: Inject (small) Gaussian noise to neurons’ output Formally: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) T 2 · · · f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 X Z 1 Z 1 Z 2 Z 2 = ⇒ X �→ T ℓ is a parametrized channel (by DNN’s parameters) = ⇒ I ( X ; T ℓ ) is a function of parameters! 6/11

Auxiliary Framework - Noisy Deep Neural Networks Modification: Inject (small) Gaussian noise to neurons’ output Formally: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) T 2 · · · f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 X Z 1 Z 1 Z 2 Z 2 = ⇒ X �→ T ℓ is a parametrized channel (by DNN’s parameters) = ⇒ I ( X ; T ℓ ) is a function of parameters! ⊛ ⊛ Challenge: How to accurately track I ( X ; T ℓ ) ? ⊛ 6/11

High-Dim. & Nonparametric Functional Estimation 7/11

High-Dim. & Nonparametric Functional Estimation Distill I ( X ; T ℓ ) Estimation into Noisy Differential Entropy Estimation: Estimate h ( P ∗ N σ ) from n i.i.d. samples S n � ( S i ) n i =1 of P ∈ F d (nonparametric class) and knowledge of N σ (Gaussian measure N (0 , σ 2 I d ) ). 7/11

High-Dim. & Nonparametric Functional Estimation Distill I ( X ; T ℓ ) Estimation into Noisy Differential Entropy Estimation: Estimate h ( P ∗ N σ ) from n i.i.d. samples S n � ( S i ) n i =1 of P ∈ F d (nonparametric class) and knowledge of N σ (Gaussian measure N (0 , σ 2 I d ) ). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) � � 2 d Sample complexity of any accurate estimator (additive gap η ) is Ω ηd 7/11

High-Dim. & Nonparametric Functional Estimation Distill I ( X ; T ℓ ) Estimation into Noisy Differential Entropy Estimation: Estimate h ( P ∗ N σ ) from n i.i.d. samples S n � ( S i ) n i =1 of P ∈ F d (nonparametric class) and knowledge of N σ (Gaussian measure N (0 , σ 2 I d ) ). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) � � 2 d Sample complexity of any accurate estimator (additive gap η ) is Ω ηd n Structured Estimator ⋆ : ˆ h ( S n , σ ) � h ( ˆ P n ∗ N σ ) , where ˆ P n = 1 � δ S i n i =1 ⋆ Efficient and parallelizable 7/11

High-Dim. & Nonparametric Functional Estimation Distill I ( X ; T ℓ ) Estimation into Noisy Differential Entropy Estimation: Estimate h ( P ∗ N σ ) from n i.i.d. samples S n � ( S i ) n i =1 of P ∈ F d (nonparametric class) and knowledge of N σ (Gaussian measure N (0 , σ 2 I d ) ). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) � � 2 d Sample complexity of any accurate estimator (additive gap η ) is Ω ηd n Structured Estimator ⋆ : ˆ h ( S n , σ ) � h ( ˆ P n ∗ N σ ) , where ˆ P n = 1 � δ S i n i =1 Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For F ( SG ) � P is K -subgaussian in R d � , d ≥ 1 and σ > 0 , we have d,K � � P � σ,K · n − 1 � � � h ( P ∗ N σ ) − ˆ h ( S n , σ ) � ≤ c d sup P ∈F ( SG ) d,K E S n � � 2 7/11

High-Dim. & Nonparametric Functional Estimation Distill I ( X ; T ℓ ) Estimation into Noisy Differential Entropy Estimation: Estimate h ( P ∗ N σ ) from n i.i.d. samples S n � ( S i ) n i =1 of P ∈ F d (nonparametric class) and knowledge of N σ (Gaussian measure N (0 , σ 2 I d ) ). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) � � 2 d Sample complexity of any accurate estimator (additive gap η ) is Ω ηd n Structured Estimator ⋆ : ˆ h ( S n , σ ) � h ( ˆ P n ∗ N σ ) , where ˆ P n = 1 � δ S i n i =1 Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For F ( SG ) � P is K -subgaussian in R d � , d ≥ 1 and σ > 0 , we have d,K � � P � σ,K · n − 1 � � � h ( P ∗ N σ ) − ˆ h ( S n , σ ) � ≤ c d sup P ∈F ( SG ) d,K E S n � � 2 Optimality: ˆ h ( S n , σ ) attains sharp dependence on both n and d ! 7/11

I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Z ∼ N (0 , σ 2 ) 8/11

I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) 8/11

I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) S 1 , 0 3 3 3 3 8/11

I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) S 1 , 0 3 3 3 3 ⊛ ⊛ ⊛ Center & sharpen transition ( ⇐ ⇒ increase w and keep b = − 2 w ) 8/11

I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) S 1 , 0 S 5 , − 10 3 3 3 3 8/11

I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) S 1 , 0 S 5 , − 10 3 3 3 3 ✓ Correct classification performance 8/11

I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) Mutual Information: 8/11

I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) Mutual Information: I ( X ; T ) = I ( S w,b ; S w,b + Z ) 8/11

I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) Mutual Information: I ( X ; T ) = I ( S w,b ; S w,b + Z ) = ⇒ I ( X ; T ) is # bits (nats) transmittable over AWGN with symbols S w,b � � tanh( − 3 w + b ) , tanh( − w + b ) , tanh( w + b ) , tanh(3 w + b ) � 8/11

I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) Mutual Information: I ( X ; T ) = I ( S w,b ; S w,b + Z ) = ⇒ I ( X ; T ) is # bits (nats) transmittable over AWGN with symbols � − S w,b � � tanh( − 3 w + b ) , tanh( − w + b ) , tanh( w + b ) , tanh(3 w + b ) → {± 1 } 8/11

I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) Mutual Information: I ( X ; T ) = I ( S w,b ; S w,b + Z ) = ⇒ I ( X ; T ) is # bits (nats) transmittable over AWGN with symbols � − S w,b � � tanh( − 3 w + b ) , tanh( − w + b ) , tanh( w + b ) , tanh(3 w + b ) → {± 1 } 1.5 ln(4) Mutual information ln(3) 1 ln(2) 0.5 0 10 0 10 2 10 4 10 6 Epoch 8/11

Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: 9/11

Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP 9/11

Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP Verified in multiple additional experiments 9/11

Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP Verified in multiple additional experiments = ⇒ Compression of I ( X ; T ℓ ) driven by clustering of representations 9/11

Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering 10/11

Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � 10/11

Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � � measures clustering (maximized by uniform distribution) � Bin ( T ℓ ) H 10/11

Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � � measures clustering (maximized by uniform distribution) � Bin ( T ℓ ) H � highly correlated in noisy DNNs ⋆ Test: I ( X ; T ℓ ) and H � Bin ( T ℓ ) ⋆ When bin size chosen ∝ noise std. 10/11

Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � � measures clustering (maximized by uniform distribution) � Bin ( T ℓ ) H � highly correlated in noisy DNNs ⋆ Test: I ( X ; T ℓ ) and H � Bin ( T ℓ ) = ⇒ Past works not measuring MI but clustering (via binned-MI)! 10/11

Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � � measures clustering (maximized by uniform distribution) � Bin ( T ℓ ) H � highly correlated in noisy DNNs ⋆ Test: I ( X ; T ℓ ) and H � Bin ( T ℓ ) = ⇒ Past works not measuring MI but clustering (via binned-MI)! By-Product Result: 10/11

Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � � measures clustering (maximized by uniform distribution) � Bin ( T ℓ ) H � highly correlated in noisy DNNs ⋆ Test: I ( X ; T ℓ ) and H � Bin ( T ℓ ) = ⇒ Past works not measuring MI but clustering (via binned-MI)! By-Product Result: Refute ‘compression (tight clustering) improves generalization’ claim [Come see us at poster #96 for details] 10/11

Summary Reexamined Information Bottleneck Compression: 11/11

Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible 11/11

Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training 11/11

Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs 11/11

Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs ◮ Optimal estimator (in n and d ) for accurate MI estimation 11/11

Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs ◮ Optimal estimator (in n and d ) for accurate MI estimation ◮ Clustering of learned representations is the source of compression 11/11

Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs ◮ Optimal estimator (in n and d ) for accurate MI estimation ◮ Clustering of learned representations is the source of compression Clarify Past Observations of Compression: in fact show clustering 11/11

Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs ◮ Optimal estimator (in n and d ) for accurate MI estimation ◮ Clustering of learned representations is the source of compression Clarify Past Observations of Compression: in fact show clustering ◮ Compression/clustering and generalization and not necessarily related 11/11

Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs ◮ Optimal estimator (in n and d ) for accurate MI estimation ◮ Clustering of learned representations is the source of compression Clarify Past Observations of Compression: in fact show clustering ◮ Compression/clustering and generalization and not necessarily related Thank you! 11/11

Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: 11/11

Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP 11/11

Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP ⊛ ⊛ ⊛ weight orthonormality regularization [Cisse et al. ’17] 11/11

Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP Verified in multiple additional experiments 11/11

Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP Verified in multiple additional experiments = ⇒ Compression of I ( X ; T ℓ ) driven by clustering of representations 11/11

Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 11/11

Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − 11/11

Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − Structure: S ℓ ⊥ Z ℓ = ⇒ T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ 11/11

Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ Structure: S ℓ ⊥ Z ℓ = ⇒ 11/11

Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − Structure: S ℓ ⊥ Z ℓ = ⇒ T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ ⊛ ⊛ Know the distribution N σ of Z ℓ (noise injected by design) ⊛ 11/11

Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − Structure: S ℓ ⊥ Z ℓ = ⇒ T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ ⊛ ⊛ Know the distribution N σ of Z ℓ (noise injected by design) ⊛ 11/11

Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − Structure: S ℓ ⊥ Z ℓ = ⇒ T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ ⊛ ⊛ ⊛ Know the distribution N σ of Z ℓ (noise injected by design) ⊛ ⊛ ⊛ Extremely complicated P = ⇒ Treat as unknown 11/11

Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − Structure: S ℓ ⊥ Z ℓ = ⇒ T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ ⊛ ⊛ Know the distribution N σ of Z ℓ (noise injected by design) ⊛ ⊛ ⊛ ⊛ Extremely complicated P = ⇒ Treat as unknown ⊛ ⊛ ⊛ Easily get i.i.d. samples from P via DNN forward pass 11/11

Structured Estimator (with Implementation in Mind) Differential Entropy Estimation under Gaussian Convolutions Estimate h ( P ∗ N σ ) via n i.i.d. samples S n � ( S i ) n i =1 from unknown P ∈ F d (nonparametric class) and knowledge of N σ (noise distribution). 11/11

Structured Estimator (with Implementation in Mind) Differential Entropy Estimation under Gaussian Convolutions Estimate h ( P ∗ N σ ) via n i.i.d. samples S n � ( S i ) n i =1 from unknown P ∈ F d (nonparametric class) and knowledge of N σ (noise distribution). Nonparametric Class: Specified by DNN architecture ( d = T ℓ ‘width’) 11/11

Structured Estimator (with Implementation in Mind) Differential Entropy Estimation under Gaussian Convolutions Estimate h ( P ∗ N σ ) via n i.i.d. samples S n � ( S i ) n i =1 from unknown P ∈ F d (nonparametric class) and knowledge of N σ (noise distribution). Nonparametric Class: Specified by DNN architecture ( d = T ℓ ‘width’) Goal: Simple & parallelizable for efficient implementation 11/11

Structured Estimator (with Implementation in Mind) Differential Entropy Estimation under Gaussian Convolutions Estimate h ( P ∗ N σ ) via n i.i.d. samples S n � ( S i ) n i =1 from unknown P ∈ F d (nonparametric class) and knowledge of N σ (noise distribution). Nonparametric Class: Specified by DNN architecture ( d = T ℓ ‘width’) Goal: Simple & parallelizable for efficient implementation n Estimator: ˆ h ( S n , σ ) � h ( ˆ P S n ∗ N σ ) , where ˆ P S n � 1 � δ S i n i =1 11/11

Structured Estimator (with Implementation in Mind) Differential Entropy Estimation under Gaussian Convolutions Estimate h ( P ∗ N σ ) via n i.i.d. samples S n � ( S i ) n i =1 from unknown P ∈ F d (nonparametric class) and knowledge of N σ (noise distribution). Nonparametric Class: Specified by DNN architecture ( d = T ℓ ‘width’) Goal: Simple & parallelizable for efficient implementation n Estimator: ˆ h ( S n , σ ) � h ( ˆ P S n ∗ N σ ) , where ˆ P S n � 1 � δ S i n i =1 Plug-in: ˆ h is plug-in est. for the functional T σ ( P ) � h ( P ∗ N σ ) 11/11

Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � ≤ C σ,d,K · n − 1 � � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . 11/11

Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � ≤ C σ,d,K · n − 1 � � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . Comments: 11/11

Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � ≤ C σ,d,K · n − 1 � � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . Comments: Explicit Expression: Enables concrete error bounds in simulations 11/11

Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � ≤ C σ,d,K · n − 1 � � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . Comments: Explicit Expression: Enables concrete error bounds in simulations � n − 1 2 � Minimax Rate Optimal: Attains parametric estimation rate O 11/11

Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � � ≤ C σ,d,K · n − 1 � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . Comments: Explicit Expression: Enables concrete error bounds in simulations � n − 1 2 � Minimax Rate Optimal: Attains parametric estimation rate O Proof (initial step): Based on [Polyanskiy-Wu’16] � � � h ( P ∗ N σ ) − h ( ˆ � � W 1 ( P ∗ N σ , ˆ P S n ∗ N σ ) P S n ∗ N σ ) � � 11/11

Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � � ≤ C σ,d,K · n − 1 � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . Comments: Explicit Expression: Enables concrete error bounds in simulations � n − 1 2 � Minimax Rate Optimal: Attains parametric estimation rate O Proof (initial step): Based on [Polyanskiy-Wu’16] � � � h ( P ∗ N σ ) − h ( ˆ � � W 1 ( P ∗ N σ , ˆ P S n ∗ N σ ) P S n ∗ N σ ) � � = ⇒ Analyze empirical 1-Wasserstein distance under Gaussian convolutions 11/11

Empirical W 1 & The Magic of Gaussian Convolution p -Wasserstein Distance: For two distributions P and Q on R d and p ≥ 1 � E � X − Y � p � 1 /p W p ( P, Q ) � inf infimum over all couplings of P and Q 11/11

Empirical W 1 & The Magic of Gaussian Convolution p -Wasserstein Distance: For two distributions P and Q on R d and p ≥ 1 � E � X − Y � p � 1 /p W p ( P, Q ) � inf infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: 11/11

Estimating Information Flow in Deep Neural Networks Ziv Goldfeld , - PowerPoint PPT Presentation

Estimating Information Flow in Deep Neural Networks Ziv Goldfeld , Ewout van den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury and Yury Polyanskiy MIT, IBM Research, MIT-IBM Watson AI Lab International Conference on Machine

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

WITH DEEP NEURAL NETWORKS INTELLIGENT ROBOTICS SEMINAR PIA UK 25.11.2019 OUTLINE 1.

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Estimating Estimating Covariance . . . Statistical Characteristics Estimating . . . Proof of

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Optimizing Deep Neural Networks Leena Chennuru Vankadara 26-10-2015 Table of Contents Neural

On the Expressive Power of Deep Neural Networks Maithra Raghu, Ben Poole, Jon Kleinberg, Surya

Weight Parameterizations in Deep Neural Networks Sergey Zagoruyko e Paris-Est, Universit

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Visualizing and Interpreting Deep Neural Networks Bolei Zhou Department of Information

Statistical mechanics as a paradigm for complex systems Motivation, foundations and limitations

egoSlider: Visual Analysis of Egocentric Network Evolution Yanhong Wu , Naveen Pitipornvivat, Jian

Macroscopic Zeno effect and stationary flows in nonlinear waveguides with localized dissipation D.

The intrinsic hypoelliptic Laplacian on sub-Riemannian manifolds Ugo Boscain (CNRS, CMAP, Ecole

The densest subgraph of sparse random graphs Justin Salez (Universit e Paris 7) with Venkat

Electrons: waves, particles ... or jellies? Sharif QI Group 6 th August 2020 (Angelo Bassi

Origins: Is It Reasonable To Believe in God in This Scientific Age? Belief in God: Is It

A continuum based macroscopic unified low- and high cycle fatigue model Tero Frondelius 1 , 2 ,