estimating information flow in deep neural networks
play

Estimating Information Flow in Deep Neural Networks Ziv Goldfeld , - PowerPoint PPT Presentation

Estimating Information Flow in Deep Neural Networks Ziv Goldfeld , Ewout van den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury and Yury Polyanskiy MIT, IBM Research, MIT-IBM Watson AI Lab International Conference on Machine


  1. Vacuous Mutual Information & Mis-Estimation Proposition (Informal) Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid) ⇒ I ( X ; T ℓ ) is independent of the DNN parameters = I ( X ; T ℓ ) a.s. infinite (continuous X ) or constant H ( X ) (discrete X ) Past Works: Use binning-based proxy of I ( X ; T ℓ ) (aka quantization) � � For non-negligible bin size I X ; Bin ( T ℓ ) � = I ( X ; T ℓ ) 1 � � I X ; Bin ( T ℓ ) highly sensitive to user-defined bin size: 2 bin size = 0.0001 bin size = 0.001 bin size = 0.01 bin size = 0.1 8 Layer 1 MI (nats) Layer 2 4 Layer 3 Layer 4 Layer 5 0 10 0 10 1 10 2 10 3 10 4 Epoch 5/11

  2. Vacuous Mutual Information & Mis-Estimation Proposition (Informal) Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid) ⇒ I ( X ; T ℓ ) is independent of the DNN parameters = I ( X ; T ℓ ) a.s. infinite (continuous X ) or constant H ( X ) (discrete X ) Past Works: Use binning-based proxy of I ( X ; T ℓ ) (aka quantization) � � For non-negligible bin size I X ; Bin ( T ℓ ) � = I ( X ; T ℓ ) 1 � � I X ; Bin ( T ℓ ) highly sensitive to user-defined bin size: 2 bin size = 0.0001 bin size = 0.001 bin size = 0.01 bin size = 0.1 8 Layer 1 MI (nats) Layer 2 4 Layer 3 Layer 4 Layer 5 0 10 0 10 1 10 2 10 3 10 4 Epoch ⊛ ⊛ ⊛ Real Problem: Mismatch between I ( X ; T ℓ ) measurement and model 5/11

  3. Auxiliary Framework - Noisy Deep Neural Networks Modification: Inject (small) Gaussian noise to neurons’ output 6/11

  4. Auxiliary Framework - Noisy Deep Neural Networks Modification: Inject (small) Gaussian noise to neurons’ output Formally: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) T 2 · · · f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 X Z 1 Z 1 Z 2 Z 2 6/11

  5. Auxiliary Framework - Noisy Deep Neural Networks Modification: Inject (small) Gaussian noise to neurons’ output Formally: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) T 2 · · · f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 X Z 1 Z 1 Z 2 Z 2 = ⇒ X �→ T ℓ is a parametrized channel (by DNN’s parameters) 6/11

  6. Auxiliary Framework - Noisy Deep Neural Networks Modification: Inject (small) Gaussian noise to neurons’ output Formally: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) T 2 · · · f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 X Z 1 Z 1 Z 2 Z 2 = ⇒ X �→ T ℓ is a parametrized channel (by DNN’s parameters) = ⇒ I ( X ; T ℓ ) is a function of parameters! 6/11

  7. Auxiliary Framework - Noisy Deep Neural Networks Modification: Inject (small) Gaussian noise to neurons’ output Formally: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) T 2 · · · f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 X Z 1 Z 1 Z 2 Z 2 = ⇒ X �→ T ℓ is a parametrized channel (by DNN’s parameters) = ⇒ I ( X ; T ℓ ) is a function of parameters! ⊛ ⊛ Challenge: How to accurately track I ( X ; T ℓ ) ? ⊛ 6/11

  8. High-Dim. & Nonparametric Functional Estimation 7/11

  9. High-Dim. & Nonparametric Functional Estimation Distill I ( X ; T ℓ ) Estimation into Noisy Differential Entropy Estimation: Estimate h ( P ∗ N σ ) from n i.i.d. samples S n � ( S i ) n i =1 of P ∈ F d (non- parametric class) and knowledge of N σ (Gaussian measure N (0 , σ 2 I d ) ). 7/11

  10. High-Dim. & Nonparametric Functional Estimation Distill I ( X ; T ℓ ) Estimation into Noisy Differential Entropy Estimation: Estimate h ( P ∗ N σ ) from n i.i.d. samples S n � ( S i ) n i =1 of P ∈ F d (non- parametric class) and knowledge of N σ (Gaussian measure N (0 , σ 2 I d ) ). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) � � 2 d Sample complexity of any accurate estimator (additive gap η ) is Ω ηd 7/11

  11. High-Dim. & Nonparametric Functional Estimation Distill I ( X ; T ℓ ) Estimation into Noisy Differential Entropy Estimation: Estimate h ( P ∗ N σ ) from n i.i.d. samples S n � ( S i ) n i =1 of P ∈ F d (non- parametric class) and knowledge of N σ (Gaussian measure N (0 , σ 2 I d ) ). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) � � 2 d Sample complexity of any accurate estimator (additive gap η ) is Ω ηd n Structured Estimator ⋆ : ˆ h ( S n , σ ) � h ( ˆ P n ∗ N σ ) , where ˆ P n = 1 � δ S i n i =1 ⋆ Efficient and parallelizable 7/11

  12. High-Dim. & Nonparametric Functional Estimation Distill I ( X ; T ℓ ) Estimation into Noisy Differential Entropy Estimation: Estimate h ( P ∗ N σ ) from n i.i.d. samples S n � ( S i ) n i =1 of P ∈ F d (non- parametric class) and knowledge of N σ (Gaussian measure N (0 , σ 2 I d ) ). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) � � 2 d Sample complexity of any accurate estimator (additive gap η ) is Ω ηd n Structured Estimator ⋆ : ˆ h ( S n , σ ) � h ( ˆ P n ∗ N σ ) , where ˆ P n = 1 � δ S i n i =1 Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For F ( SG ) � P is K -subgaussian in R d � , d ≥ 1 and σ > 0 , we have d,K � � P � σ,K · n − 1 � � � h ( P ∗ N σ ) − ˆ h ( S n , σ ) � ≤ c d sup P ∈F ( SG ) d,K E S n � � 2 7/11

  13. High-Dim. & Nonparametric Functional Estimation Distill I ( X ; T ℓ ) Estimation into Noisy Differential Entropy Estimation: Estimate h ( P ∗ N σ ) from n i.i.d. samples S n � ( S i ) n i =1 of P ∈ F d (non- parametric class) and knowledge of N σ (Gaussian measure N (0 , σ 2 I d ) ). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) � � 2 d Sample complexity of any accurate estimator (additive gap η ) is Ω ηd n Structured Estimator ⋆ : ˆ h ( S n , σ ) � h ( ˆ P n ∗ N σ ) , where ˆ P n = 1 � δ S i n i =1 Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For F ( SG ) � P is K -subgaussian in R d � , d ≥ 1 and σ > 0 , we have d,K � � P � σ,K · n − 1 � � � h ( P ∗ N σ ) − ˆ h ( S n , σ ) � ≤ c d sup P ∈F ( SG ) d,K E S n � � 2 Optimality: ˆ h ( S n , σ ) attains sharp dependence on both n and d ! 7/11

  14. I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Z ∼ N (0 , σ 2 ) 8/11

  15. I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) 8/11

  16. I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) S 1 , 0 3 3 3 3 8/11

  17. I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) S 1 , 0 3 3 3 3 ⊛ ⊛ ⊛ Center & sharpen transition ( ⇐ ⇒ increase w and keep b = − 2 w ) 8/11

  18. I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) S 1 , 0 S 5 , − 10 3 3 3 3 8/11

  19. I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) S 1 , 0 S 5 , − 10 3 3 3 3 ✓ Correct classification performance 8/11

  20. I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) Mutual Information: 8/11

  21. I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) Mutual Information: I ( X ; T ) = I ( S w,b ; S w,b + Z ) 8/11

  22. I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) Mutual Information: I ( X ; T ) = I ( S w,b ; S w,b + Z ) = ⇒ I ( X ; T ) is # bits (nats) transmittable over AWGN with symbols S w,b � � tanh( − 3 w + b ) , tanh( − w + b ) , tanh( w + b ) , tanh(3 w + b ) � 8/11

  23. I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) Mutual Information: I ( X ; T ) = I ( S w,b ; S w,b + Z ) = ⇒ I ( X ; T ) is # bits (nats) transmittable over AWGN with symbols � − S w,b � � tanh( − 3 w + b ) , tanh( − w + b ) , tanh( w + b ) , tanh(3 w + b ) → {± 1 } 8/11

  24. I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) Mutual Information: I ( X ; T ) = I ( S w,b ; S w,b + Z ) = ⇒ I ( X ; T ) is # bits (nats) transmittable over AWGN with symbols � − S w,b � � tanh( − 3 w + b ) , tanh( − w + b ) , tanh( w + b ) , tanh(3 w + b ) → {± 1 } 8/11

  25. I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) Mutual Information: I ( X ; T ) = I ( S w,b ; S w,b + Z ) = ⇒ I ( X ; T ) is # bits (nats) transmittable over AWGN with symbols � − S w,b � � tanh( − 3 w + b ) , tanh( − w + b ) , tanh( w + b ) , tanh(3 w + b ) → {± 1 } 1.5 ln(4) Mutual information ln(3) 1 ln(2) 0.5 0 10 0 10 2 10 4 10 6 Epoch 8/11

  26. Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: 9/11

  27. Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP 9/11

  28. Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP 9/11

  29. Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP Verified in multiple additional experiments 9/11

  30. Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP Verified in multiple additional experiments = ⇒ Compression of I ( X ; T ℓ ) driven by clustering of representations 9/11

  31. Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering 10/11

  32. Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � 10/11

  33. Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � � measures clustering (maximized by uniform distribution) � Bin ( T ℓ ) H 10/11

  34. Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � � measures clustering (maximized by uniform distribution) � Bin ( T ℓ ) H � highly correlated in noisy DNNs ⋆ Test: I ( X ; T ℓ ) and H � Bin ( T ℓ ) ⋆ When bin size chosen ∝ noise std. 10/11

  35. Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � � measures clustering (maximized by uniform distribution) � Bin ( T ℓ ) H � highly correlated in noisy DNNs ⋆ Test: I ( X ; T ℓ ) and H � Bin ( T ℓ ) = ⇒ Past works not measuring MI but clustering (via binned-MI)! 10/11

  36. Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � � measures clustering (maximized by uniform distribution) � Bin ( T ℓ ) H � highly correlated in noisy DNNs ⋆ Test: I ( X ; T ℓ ) and H � Bin ( T ℓ ) = ⇒ Past works not measuring MI but clustering (via binned-MI)! By-Product Result: 10/11

  37. Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � � measures clustering (maximized by uniform distribution) � Bin ( T ℓ ) H � highly correlated in noisy DNNs ⋆ Test: I ( X ; T ℓ ) and H � Bin ( T ℓ ) = ⇒ Past works not measuring MI but clustering (via binned-MI)! By-Product Result: Refute ‘compression (tight clustering) improves generalization’ claim [Come see us at poster #96 for details] 10/11

  38. Summary Reexamined Information Bottleneck Compression: 11/11

  39. Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible 11/11

  40. Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training 11/11

  41. Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs 11/11

  42. Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs ◮ Optimal estimator (in n and d ) for accurate MI estimation 11/11

  43. Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs ◮ Optimal estimator (in n and d ) for accurate MI estimation ◮ Clustering of learned representations is the source of compression 11/11

  44. Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs ◮ Optimal estimator (in n and d ) for accurate MI estimation ◮ Clustering of learned representations is the source of compression Clarify Past Observations of Compression: in fact show clustering 11/11

  45. Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs ◮ Optimal estimator (in n and d ) for accurate MI estimation ◮ Clustering of learned representations is the source of compression Clarify Past Observations of Compression: in fact show clustering ◮ Compression/clustering and generalization and not necessarily related 11/11

  46. Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs ◮ Optimal estimator (in n and d ) for accurate MI estimation ◮ Clustering of learned representations is the source of compression Clarify Past Observations of Compression: in fact show clustering ◮ Compression/clustering and generalization and not necessarily related Thank you! 11/11

  47. Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: 11/11

  48. Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP 11/11

  49. Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP 11/11

  50. Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP 11/11

  51. Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP ⊛ ⊛ ⊛ weight orthonormality regularization [Cisse et al. ’17] 11/11

  52. Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP Verified in multiple additional experiments 11/11

  53. Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP Verified in multiple additional experiments = ⇒ Compression of I ( X ; T ℓ ) driven by clustering of representations 11/11

  54. Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 11/11

  55. Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − 11/11

  56. Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − Structure: S ℓ ⊥ Z ℓ = ⇒ T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ 11/11

  57. Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − Structure: S ℓ ⊥ Z ℓ = ⇒ T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ 11/11

  58. Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ Structure: S ℓ ⊥ Z ℓ = ⇒ 11/11

  59. Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − Structure: S ℓ ⊥ Z ℓ = ⇒ T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ ⊛ ⊛ Know the distribution N σ of Z ℓ (noise injected by design) ⊛ 11/11

  60. Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − Structure: S ℓ ⊥ Z ℓ = ⇒ T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ ⊛ ⊛ Know the distribution N σ of Z ℓ (noise injected by design) ⊛ 11/11

  61. Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − Structure: S ℓ ⊥ Z ℓ = ⇒ T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ ⊛ ⊛ ⊛ Know the distribution N σ of Z ℓ (noise injected by design) ⊛ ⊛ ⊛ Extremely complicated P = ⇒ Treat as unknown 11/11

  62. Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − Structure: S ℓ ⊥ Z ℓ = ⇒ T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ ⊛ ⊛ Know the distribution N σ of Z ℓ (noise injected by design) ⊛ ⊛ ⊛ ⊛ Extremely complicated P = ⇒ Treat as unknown ⊛ ⊛ ⊛ Easily get i.i.d. samples from P via DNN forward pass 11/11

  63. Structured Estimator (with Implementation in Mind) Differential Entropy Estimation under Gaussian Convolutions Estimate h ( P ∗ N σ ) via n i.i.d. samples S n � ( S i ) n i =1 from unknown P ∈ F d (nonparametric class) and knowledge of N σ (noise distribution). 11/11

  64. Structured Estimator (with Implementation in Mind) Differential Entropy Estimation under Gaussian Convolutions Estimate h ( P ∗ N σ ) via n i.i.d. samples S n � ( S i ) n i =1 from unknown P ∈ F d (nonparametric class) and knowledge of N σ (noise distribution). Nonparametric Class: Specified by DNN architecture ( d = T ℓ ‘width’) 11/11

  65. Structured Estimator (with Implementation in Mind) Differential Entropy Estimation under Gaussian Convolutions Estimate h ( P ∗ N σ ) via n i.i.d. samples S n � ( S i ) n i =1 from unknown P ∈ F d (nonparametric class) and knowledge of N σ (noise distribution). Nonparametric Class: Specified by DNN architecture ( d = T ℓ ‘width’) Goal: Simple & parallelizable for efficient implementation 11/11

  66. Structured Estimator (with Implementation in Mind) Differential Entropy Estimation under Gaussian Convolutions Estimate h ( P ∗ N σ ) via n i.i.d. samples S n � ( S i ) n i =1 from unknown P ∈ F d (nonparametric class) and knowledge of N σ (noise distribution). Nonparametric Class: Specified by DNN architecture ( d = T ℓ ‘width’) Goal: Simple & parallelizable for efficient implementation n Estimator: ˆ h ( S n , σ ) � h ( ˆ P S n ∗ N σ ) , where ˆ P S n � 1 � δ S i n i =1 11/11

  67. Structured Estimator (with Implementation in Mind) Differential Entropy Estimation under Gaussian Convolutions Estimate h ( P ∗ N σ ) via n i.i.d. samples S n � ( S i ) n i =1 from unknown P ∈ F d (nonparametric class) and knowledge of N σ (noise distribution). Nonparametric Class: Specified by DNN architecture ( d = T ℓ ‘width’) Goal: Simple & parallelizable for efficient implementation n Estimator: ˆ h ( S n , σ ) � h ( ˆ P S n ∗ N σ ) , where ˆ P S n � 1 � δ S i n i =1 Plug-in: ˆ h is plug-in est. for the functional T σ ( P ) � h ( P ∗ N σ ) 11/11

  68. Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � ≤ C σ,d,K · n − 1 � � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . 11/11

  69. Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � ≤ C σ,d,K · n − 1 � � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . Comments: 11/11

  70. Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � ≤ C σ,d,K · n − 1 � � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . Comments: Explicit Expression: Enables concrete error bounds in simulations 11/11

  71. Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � ≤ C σ,d,K · n − 1 � � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . Comments: Explicit Expression: Enables concrete error bounds in simulations � n − 1 2 � Minimax Rate Optimal: Attains parametric estimation rate O 11/11

  72. Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � � ≤ C σ,d,K · n − 1 � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . Comments: Explicit Expression: Enables concrete error bounds in simulations � n − 1 2 � Minimax Rate Optimal: Attains parametric estimation rate O Proof (initial step): Based on [Polyanskiy-Wu’16] � � � h ( P ∗ N σ ) − h ( ˆ � � W 1 ( P ∗ N σ , ˆ P S n ∗ N σ ) P S n ∗ N σ ) � � 11/11

  73. Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � � ≤ C σ,d,K · n − 1 � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . Comments: Explicit Expression: Enables concrete error bounds in simulations � n − 1 2 � Minimax Rate Optimal: Attains parametric estimation rate O Proof (initial step): Based on [Polyanskiy-Wu’16] � � � h ( P ∗ N σ ) − h ( ˆ � � W 1 ( P ∗ N σ , ˆ P S n ∗ N σ ) P S n ∗ N σ ) � � = ⇒ Analyze empirical 1-Wasserstein distance under Gaussian convolutions 11/11

  74. Empirical W 1 & The Magic of Gaussian Convolution p -Wasserstein Distance: For two distributions P and Q on R d and p ≥ 1 � E � X − Y � p � 1 /p W p ( P, Q ) � inf infimum over all couplings of P and Q 11/11

  75. Empirical W 1 & The Magic of Gaussian Convolution p -Wasserstein Distance: For two distributions P and Q on R d and p ≥ 1 � E � X − Y � p � 1 /p W p ( P, Q ) � inf infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: 11/11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend