relative fisher information and natural gradient for
play

Relative Fisher Information and Natural Gradient for Learning Large - PowerPoint PPT Presentation

Relative Fisher Information and Natural Gradient for Learning Large Modular Models Ke Sun 1 Frank Nielsen 2 , 3 1 King Abdullah University of Science & Technology (KAUST) 2 cole Polytechnique 3 Sony CSL ICML 2017 Fisher Information Metric


  1. Relative Fisher Information and Natural Gradient for Learning Large Modular Models Ke Sun 1 Frank Nielsen 2 , 3 1 King Abdullah University of Science & Technology (KAUST) 2 École Polytechnique 3 Sony CSL ICML 2017

  2. Fisher Information Metric (FIM) Consider a statistical model p ( x | Θ ) of order D . The FIM (Hotelling29,Rao45) I ( Θ ) = ( I ij ) is defined by a D × D positive semi-definite matrix � ∂ l � ∂ l I ij = E p , (1) ∂ Θ i ∂ Θ j where l ( Θ ) = log p ( x | Θ ) denotes the log-likelihood. 1/29

  3. Equivalent Expressions � ∂ l � ∂ l I ij = E p ∂ Θ i ∂ Θ j � � ∂ 2 l = − E p ∂ Θ i ∂ Θ j � ∂ � � p ( x | Θ ) ∂ p ( x | Θ ) = 4 d x . ∂ Θ i ∂ Θ j Observed FIM (Efron & Hinkley, 1978) With respect to X n = { x k } n k = 1 , n ∂ 2 log p ( x i | Θ ) � I = −∇ 2 l ( Θ | X n ) = − ˆ . ∂ Θ ∂ Θ ⊺ i = 1 2/29

  4. FIM and Statistical Learning ◮ Any parametric learning is inside a corresponding parameter manifold M Θ M Θ θ T θ M Θ : a tangent space with a learning curve a local inner product g ( θ ) ◮ FIM gives an invariant Riemannian metric g ( Θ ) = I ( Θ ) for any loss function based on standard f-divergence (KL, cross-entropy, . . . ) S. Amari. Information Geometry and Its Applications. 2016. 3/29

  5. Invariance The FIM is not invariant and depends on the parameterization: g Θ ( Θ ) = J ⊺ g Λ ( Λ ) J where J is the Jacobian matrix J ij = ∂ Λ i ∂ Θ j . However its measurements such as � δ Θ , δ Θ � g ( Θ ) is invariant: � δ Θ , δ Θ � g ( Θ ) = δ Θ ⊺ g ( Θ ) δ Θ = δ Θ ⊺ J ⊺ g Λ ( Λ ) J δ Θ = δ Λ ⊺ g Λ ( Λ ) δ Λ = � δ Λ , δ Λ � g ( Λ ) . Regardless of the choice of the coordinate system, it is essentially the same metric! 4/29

  6. Statistical Formulation of a Multilayer Perceptron (MLP) � p ( y | x , Θ ) = p ( y | h L − 1 , θ L ) · · · p ( h 2 | h 1 , θ 2 ) p ( h 1 | x , θ 1 ) , h 1 , ··· , h L − 1 y 1 y 2 y 3 y 4 y 5 y θ L h L − 1 h 1 θ 1 x 1 x 2 x 3 x 4 x 5 x 5/29

  7. The FIM of a MLP The FIM of a MLP has the following expression � ∂ l � ∂ l g ( Θ ) = E x ∼ ˆ p ( X n ) , y ∼ p ( y | x , Θ ) ∂ Θ ∂ Θ ⊺ � ∂ l i n � = 1 ∂ l i � E p ( y | x i , Θ ) n ∂ Θ ∂ Θ ⊺ i = 1 where ◮ ˆ p ( X n ) is the empirical distribution of the samples X n = { x i } n i = 1 ◮ l i ( Θ ) = log p ( y | x i , Θ ) is the conditional log-likelihood 6/29

  8. Meaning of the FIM of a MLP Consider a learning step on M Θ from Θ to Θ + δ Θ . The step size � δ Θ , δ Θ � g ( Θ ) = δ Θ ⊺ g ( Θ ) δ Θ � � ∂ l i �� n 1 ∂ l i � = δ Θ ⊺ E p ( y | x i , Θ ) δ Θ ∂ Θ ∂ Θ ⊺ n i = 1 n � � 2 = 1 δ Θ ⊺ ∂ l i � E p ( y | x i , Θ ) n ∂ Θ i = 1 ∂ l measures how much δ Θ is statistically along ∂ Θ . Will δ Θ make a significant change to the mapping x → y or not? 7/29

  9. Natural Gradient: Seeking a Short Path Consider min Θ ∈M Θ L ( Θ ) . At Θ t ∈ M Θ , the target is to minimize wrt δ Θ + 1 L ( Θ t + δ Θ ) 2 γ � δ Θ , δ Θ � g ( Θ t ) ( γ : learning rate) � �� � � �� � Loss function Squared step size ≈ L ( Θ t ) + δ Θ ⊺ ▽ L ( Θ t ) + 1 2 γ δ Θ ⊺ g ( Θ t ) δ Θ , giving a learning step δ Θ t = − γ g − 1 ( Θ t ) ▽ L ( Θ t ) � �� � natural gradient ◮ Equivalence with mirror descent (Raskutti & Mukherjee 2013) 8/29

  10. Natural Gradient: Intrinsics δ Θ t = − γ g − 1 ( Θ t ) ▽ L ( Θ t ) This Riemannnian metric is a property of the parameter space that is independent of the loss function L ( Θ ) . The good performance of natural gradient relies on that L ( Θ ) is similarly curved as log p ( x | Θ ) ( x ∼ p ( x | Θ ) ). Natural gradient is not universally good for any loss functions. 9/29

  11. Natural Gradient: Pros and Cons Pros ◮ Invariant (intrinsic) gradient ◮ Not trapped in plateaus ◮ Achieve Fisher efficiency in online learning Cons ◮ Too expensive to compute (no closed-form FIM; need matrix inversion) 10/29

  12. Relative FIM — Informal Ideas ◮ Decompose the learning system into subsystems ◮ The subsystems are interfaced with each other through hidden variables h i ◮ Some subsystems are interfaced with the I/O environment through x i and y i ◮ Compute the subsystem FIM by integrating out its interface variables h i , so that the intrinsics of this subsystem can be discussed regardless of the remaining parts 11/29

  13. From FIM to Relative FIM (RFIM) FIM log p ( x | θ ) (likelihood scalar) θ (parameter vector) How sensitive is x wrt tiny movements of θ on M θ ? RFIM log p ( r | θ , θ f ) (likelihood scalar) θ (parameter vector) Given θ f , how sensitive is r wrt tiny movements of θ ? 12/29

  14. Relative FIM — Definition Given θ f (the reference ), the Relative Fisher Information Metric (RFIM) of θ wrt h (the response ) is � ∂ � ∂ θ ln p ( h | θ , θ f ) ∂ g h ( θ | θ f ) = E p ( h | θ , θ f ) ∂ θ ⊺ ln p ( h | θ , θ f ) , or simply g h ( θ ) . Meaning: given θ f , how variations of θ will affect the response h . 13/29

  15. Different Subsystems – Simple Examples h ′ i h i θ θ h i Figure: Generator Figure: Discriminator or Regressor 14/29

  16. A Dynamic Geometry � � = p ( y | Θ , x ) p ( h 1 | θ 1 , x ) p ( h 2 | θ 2 , h 1 ) p ( y | θ 3 , h 2 ) Model: h 1 h 2 x + ∆ x h 1 + ∆ h 1 M θ 3 M Θ x Manifold: M θ 1 M θ 2 h 1 h 2 + ∆ h 2 Computational graph: h 2 y y x Θ x θ 1 h 1 θ 2 h 2 θ 3 θ 1 θ 2 h 1 θ 3 h 2 Θ θ 1 g h 1 ( θ 1 ) g h 2 ( θ 2 ) g y ( θ 3 ) θ 2 θ 3 I ( Θ ) Metric: Θ h 1 h 2 ◮ As the interface hidden variables h i are changing, the subsystem geometry is not absolute but is relative to its reference variables provided by adjacent subsystems 15/29

  17. RFIM of One tanh Neuron Consider a neuron with input x , weights w , a hyperbolic tangent activation function, and a stochastic output y ∈ {− 1 , 1 } , given by p ( y = 1 ) = 1 + tanh ( w ⊺ ˜ x ) tanh ( t ) = exp ( t ) − exp ( − t ) , exp ( t ) + exp ( − t ) . 2 x = ( x ⊺ , 1 ) ⊺ denotes the augmented vector of x ˜ ν tanh ( w , x ) = sech 2 ( w ⊺ ˜ g y ( w | x ) = ν tanh ( w , x )˜ x ⊺ , x ˜ x ) . 16/29

  18. RFIM of Parametric Rectified Linear Unit x ) , σ 2 ) , p ( y | w , x ) = G ( y | relu ( w ⊺ ˜ ( G is for Gaussian) � t if t ≥ 0 relu ( t ) = if t < 0 . ( 0 ≤ ι < 1 ) ι t By certain assumptions, g y ( w | x ) = ν relu ( w , x )˜ x ˜ x ⊺ , 2   � 1 − ι � ν relu ( w , x ) = 1    ι + ( 1 − ι ) sigm w ⊺ ˜ x . σ 2  ω ���� sigmoid Set σ = 1, ι = 0, it simplifies to � 1 � ν relu ( w , x ) = sigm 2 ω w ⊺ ˜ x . 17/29

  19. Generic Expression of One-neuron RFIMs Denote f ∈ { tanh , sigm , relu , elu } to be an element-wise nonlinear activation function. The RFIM is g y ( w | x ) = ν f ( w , x )˜ x ⊺ , x ˜ where ν f ( w , x ) is a positive coefficient with large values in the linear region , or the effective learning zone of the neuron. 18/29

  20. RFIM of a Linear Layer x : input; W : connection weights; y : stochastic output following x , σ 2 I ) . p ( y | W , x ) = G ( y | W ⊺ ˜ We vectorize W by stacking its columns { w i } . Then   x ˜ ˜ x ⊺ g y ( W | x ) = 1 ...    . σ 2  x ˜ ˜ x ⊺ 19/29

  21. RFIM of a Non-linear Layer A nonlinear layer applies an element-wise activation on W ⊺ ˜ x . We have   ν f ( w 1 , x )˜ x ˜ x ⊺ g y ( W | x ) = ...    ,  ν f ( w m , x )˜ x ˜ x ⊺ where ν f ( w i , x ) depends on the activation function f . 20/29

  22. The RFIMs of single neuron models, a linear layer, a non-linear layer, a soft-max layer, two consecutive layers all have simple closed form solutions 1 . 1 See the paper. 21/29

  23. List of RFIMs the RFIM g y ( w ) Subsystem sech 2 ( w ⊺ ˜ A tanh neuron x )˜ x ˜ x ⊺ A sigm neuron sigm ( w ⊺ ˜ x ) [ 1 − sigm ( w ⊺ ˜ x )] ˜ x ˜ x ⊺ � 1 − ι �� 2 ˜ � A relu neuron ι + ( 1 − ι ) sigm ω w ⊺ ˜ x x ˜ x ⊺ � ˜ x ˜ if w ⊺ ˜ x ⊺ x ≥ 0 A elu neuron x )) 2 ˜ ( α exp ( w ⊺ ˜ x ˜ x ⊺ if w ⊺ ˜ x < 0 A linear layer diag [˜ x ˜ x ⊺ , · · · , ˜ x ˜ x ⊺ ] A non-linear layer diag [ ν f ( w 1 , ˜ x )˜ x ˜ x ⊺ , · · · , ν f ( w m , ˜ x )˜ x ˜ x ⊺ ] ( η 1 − η 2 1 )˜ x ˜ x ⊺ − η 1 η 2 ˜ x ˜ x ⊺ − η 1 η m ˜ x ˜ x ⊺   · · · ( η 2 − η 2 A soft-max layer − η 2 η 1 ˜ x ˜ x ⊺ 2 )˜ x ˜ x ⊺ − η 2 η m ˜ x ˜ x ⊺ · · ·     . . . ... .   . . .   . . .   ( η m − η 2 − η m η 1 ˜ x ˜ x ⊺ − η m η 2 ˜ x ˜ x ⊺ m )˜ x ˜ x ⊺ · · · Two layers see the paper. 22/29

  24. Relative Natural Gradient Descent (RNGD) For each subsystem, � � � − 1 · ∂ L � g h ( θ t | θ f ) θ t + 1 ← θ t − γ · ¯ � ∂ θ � θ = θ t � �� � inverse RFIM where n g h ( θ t | θ f ) = 1 � g h ( θ t | θ i ¯ f ) . n i = 1 By definition, RFIM is a function of the reference variables. g h ( θ t | θ f ) is its expectation wrt an empirical distribution of θ f . ¯ 23/29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend