non convex learning via replica exchange stochastic
play

Non-convex Learning via Replica Exchange Stochastic Gradient MCMC A - PowerPoint PPT Presentation

Non-convex Learning via Replica Exchange Stochastic Gradient MCMC A scalable parallel tempering algorithm for DNNs Qi Feng *2 Liyao Gao * 1 July 27, 2020 1 Purdue University 2 University of Southern California * Equal contribution Wei Deng 1


  1. Non-convex Learning via Replica Exchange Stochastic Gradient MCMC A scalable parallel tempering algorithm for DNNs Qi Feng *2 Liyao Gao * 1 July 27, 2020 1 Purdue University 2 University of Southern California * Equal contribution Wei Deng 1 Faming Liang 1 Guang Lin 1

  2. Intro

  3. Markov chain Monte Carlo The increasing concern for AI safety problems draws our attention to Markov chain Monte Carlo (MCMC) , which is known for • Multi-modal sampling [Teh et al., 2016] • Non-convex optimization [Zhang et al., 2017] 1

  4. Acceleration strategies for MCMC Popular strategies to accelerate MCMC: • Simulated annealing [Kirkpatrick et al., 1983] • Simulated tempering [Marinari and Parisi, 1992] • Replica exchange MCMC [Swendsen and Wang, 1986] 2

  5. Replica exchange stochastic gradient MCMC

  6. Replica exchange Langevin difgusion t In other words, a jump process is included in a Markov process t t 1 Moreover, the positions of the two particles swap with a probability 3 t t Consider two Langevin difgusion processes with τ 1 > τ 2 � d β ( 1 ) = −∇ U ( β ( 1 ) 2 τ 1 d W ( 1 ) t ) dt + � d β ( 2 ) = −∇ U ( β ( 2 ) 2 τ 2 d W ( 2 ) t ) dt + t , � �� � U ( β ( 1 ) ) − U ( β ( 2 ) S ( β ( 1 ) t , β ( 2 ) ) τ 1 − 1 t ) := e τ 2 P ( β t + dt = ( β ( 2 ) t , β ( 1 ) t ) | β t = ( β ( 1 ) t , β ( 2 ) t )) = rS ( β ( 1 ) t , β ( 2 ) t ) dt P ( β t + dt = ( β ( 1 ) t , β ( 2 ) t ) | β t = ( β ( 1 ) t , β ( 2 ) t )) = 1 − rS ( β ( 1 ) t , β ( 2 ) t ) dt

  7. A demo Figure 1: Trajectory plot for replica exchange Langevin difgusion. 4

  8. Why the naïve numerical algorithm fails k (1) 1 Consider the scalable stochastic gradient Langevin dynamics k 5 k algorithm [Welling and Teh, 2011] � β ( 1 ) β ( 1 ) β ( 1 ) 2 η k τ 1 ξ ( 1 ) � k + 1 = � − η k ∇ � L ( � k ) + � β ( 2 ) � k + 1 = � β ( 2 ) − η k ∇ � L ( � β ( 2 ) 2 η k τ 2 ξ ( 2 ) k ) + k . β ( 1 ) β ( 2 ) Swap the chains with a naïve swapping rate r S ( � k + 1 , � k + 1 ) η k § : � �� � β ( 1 ) β ( 2 ) � L ( � k + 1 ) − � L ( � τ 1 − 1 k + 1 ) S ( � β ( 1 ) k + 1 , � β ( 2 ) k + 1 ) = e . τ 2 β ( · ) Exponentiating the unbiased estimators � L ( � k + 1 ) leads to a large bias . § In the implementations, we fix r η k = 1 by default.

  9. Why the naïve numerical algorithm fails k (1) 1 Consider the scalable stochastic gradient Langevin dynamics k 5 k algorithm [Welling and Teh, 2011] � β ( 1 ) β ( 1 ) β ( 1 ) 2 η k τ 1 ξ ( 1 ) � k + 1 = � − η k ∇ � L ( � k ) + � β ( 2 ) � k + 1 = � β ( 2 ) − η k ∇ � L ( � β ( 2 ) 2 η k τ 2 ξ ( 2 ) k ) + k . β ( 1 ) β ( 2 ) Swap the chains with a naïve swapping rate r S ( � k + 1 , � k + 1 ) η k § : � �� � β ( 1 ) β ( 2 ) � L ( � k + 1 ) − � L ( � τ 1 − 1 k + 1 ) S ( � β ( 1 ) k + 1 , � β ( 2 ) k + 1 ) = e . τ 2 β ( · ) Exponentiating the unbiased estimators � L ( � k + 1 ) leads to a large bias . § In the implementations, we fix r η k = 1 by default.

  10. Why the naïve numerical algorithm fails k (1) 1 Consider the scalable stochastic gradient Langevin dynamics k 5 k algorithm [Welling and Teh, 2011] � β ( 1 ) β ( 1 ) β ( 1 ) 2 η k τ 1 ξ ( 1 ) � k + 1 = � − η k ∇ � L ( � k ) + � β ( 2 ) � k + 1 = � β ( 2 ) − η k ∇ � L ( � β ( 2 ) 2 η k τ 2 ξ ( 2 ) k ) + k . β ( 1 ) β ( 2 ) Swap the chains with a naïve swapping rate r S ( � k + 1 , � k + 1 ) η k § : � �� � β ( 1 ) β ( 2 ) � L ( � k + 1 ) − � L ( � τ 1 − 1 k + 1 ) S ( � β ( 1 ) k + 1 , � β ( 2 ) k + 1 ) = e . τ 2 β ( · ) Exponentiating the unbiased estimators � L ( � k + 1 ) leads to a large bias . § In the implementations, we fix r η k = 1 by default.

  11. A corrected algorithm 1 1 1 2 dW t S t t dW 2 S t 2 S t (2) 1 1 6 1 Assume � L ( θ ) ∼ N ( L ( θ ) , σ 2 ) and consider the geometric Brownian motion of { � S t } t ∈ [ 0 , 1 ] in each swap as a Martingale � �� � � � � L ( � β ( 1 ) ) − � L ( � β ( 2 ) ) − τ 1 − 1 τ 1 − 1 σ 2 t � S t = e τ 2 τ 2 � �� � � � √ L ( � β ( 1 ) ) − L ( � β ( 2 ) ) − τ 1 − 1 τ 1 − 1 σ 2 t + 2 σ W t = e . τ 2 τ 2 Taking the derivative of � S t with respect to t and W t , Itô’s lemma gives, � � � 1 � d � d 2 � dt + d � √ d � σ � S t = dt + 1 dW t = − 1 S t dW t . τ 1 τ 2 By fixing t = 1 in (2), we have the suggested unbiased swapping rate � �� � � σ 2 � � L ( � β ( 1 ) ) − � L ( � β ( 2 ) ) − τ 1 − 1 � τ 1 − 1 S 1 = e . τ 2 τ 2

  12. A corrected algorithm 1 1 1 2 dW t S t t dW 2 S t 2 S t (2) 1 1 6 1 Assume � L ( θ ) ∼ N ( L ( θ ) , σ 2 ) and consider the geometric Brownian motion of { � S t } t ∈ [ 0 , 1 ] in each swap as a Martingale � �� � � � � L ( � β ( 1 ) ) − � L ( � β ( 2 ) ) − τ 1 − 1 τ 1 − 1 σ 2 t � S t = e τ 2 τ 2 � �� � � � √ L ( � β ( 1 ) ) − L ( � β ( 2 ) ) − τ 1 − 1 τ 1 − 1 σ 2 t + 2 σ W t = e . τ 2 τ 2 Taking the derivative of � S t with respect to t and W t , Itô’s lemma gives, � � � 1 � d � d 2 � dt + d � √ d � σ � S t = dt + 1 dW t = − 1 S t dW t . τ 1 τ 2 By fixing t = 1 in (2), we have the suggested unbiased swapping rate � �� � � σ 2 � � L ( � β ( 1 ) ) − � L ( � β ( 2 ) ) − τ 1 − 1 � τ 1 − 1 S 1 = e . τ 2 τ 2

  13. Unknown corrections in practice Figure 2: Unknown corrections on CIFAR 10 and CIFAR 100 datasets. 7

  14. An adaptive algorithm for unknown corrections k F 1 1 Swapping step Sampling step Stochastic approximation step k 8 � β ( 1 ) � k + 1 = � β ( 1 ) k − η ( 1 ) k ∇ � L ( � β ( 1 ) 2 η ( 1 ) k τ 1 ξ ( 1 ) k ) + � β ( 2 ) � k + 1 = � β ( 2 ) − η ( 2 ) k ∇ � L ( � β ( 2 ) 2 η ( 2 ) k τ 2 ξ ( 2 ) k ) + k , Obtain an unbiased estimate ˜ σ 2 m + 1 for σ 2 . ˆ m + 1 = ( 1 − γ m )ˆ m + γ m ˜ σ 2 σ 2 σ 2 m + 1 , Generate a uniform random number u ∈ [ 0 , 1 ] . �� � � �� � � τ 1 − 1 σ 2 ˆ ˆ � L ( � β ( 1 ) k + 1 ) − � L ( � β ( 2 ) m + 1 τ 1 − 1 k + 1 ) − S 1 = exp τ 2 τ 2 If u < ˆ S 1 : Swap � β ( 1 ) k + 1 and � β ( 2 ) k + 1 .

  15. Convergence Analysis

  16. Discretization Error Replica exchange SGLD tracks the replica exchange Langevin is the noise in the stochastic swapping rate. 9 Lemma (Discretization Error) Given the smoothness and dissipativity assumptions in the appendix, difgusion in some sense. and a small (fixed) learning rate η , we have that √ E [sup 0 ≤ t ≤ T ∥ β t − � t || 2 ] ≤ ˜ β η O ( η +max i E [ ∥ φ i ∥ 2 ]+max i E [ | ψ i | 2 ]) , where � β η t is the continuous-time interpolation for reSGLD, φ := ∇ � U − ∇ U is the noise in the stochastic gradient, and ψ := � S − S

  17. 10 Lyapunov condition Hessian Lower bound (3) Poincaré inequality acceleration 2 (ii) Comparison method: acceleration with a larger Dirichlet form (i) Log-Sobolev inequality for Langevin difgusion [Cattiaux et al., 2010] Accelerated exponential decay of W 2 Smooth gradient condition → ∇ 2 G ≽ − CI 2 d for some constant C > 0. � d ν t [Chen et al., 2019] → χ 2 ( ν || π ) ≤ c p E ( d π ) � � ∥ x 1 ∥ 2 + ∥ x 2 ∥ 2 a / 4 · V ( x 1 , x 2 ) ≤ κ − γ ( ∥ x 1 ∥ 2 + ∥ x 2 ∥ 2 ) → L V ( x 1 , x 2 ) V ( x 1 , x 2 ) := e τ 1 τ 2 � E S ( f ) = E ( f ) + 1 S ( x 1 , x 2 ) · ( f ( x 2 , x 1 ) − f ( x 1 , x 2 )) 2 d π ( x 1 , x 2 ) , , � �� �

  18. 10 Lyapunov condition Hessian Lower bound (3) Poincaré inequality acceleration 2 (ii) Comparison method: acceleration with a larger Dirichlet form (i) Log-Sobolev inequality for Langevin difgusion [Cattiaux et al., 2010] Accelerated exponential decay of W 2 Smooth gradient condition → ∇ 2 G ≽ − CI 2 d for some constant C > 0. � d ν t [Chen et al., 2019] → χ 2 ( ν || π ) ≤ c p E ( d π ) � � ∥ x 1 ∥ 2 + ∥ x 2 ∥ 2 a / 4 · V ( x 1 , x 2 ) ≤ κ − γ ( ∥ x 1 ∥ 2 + ∥ x 2 ∥ 2 ) → L V ( x 1 , x 2 ) V ( x 1 , x 2 ) := e τ 1 τ 2 � E S ( f ) = E ( f ) + 1 S ( x 1 , x 2 ) · ( f ( x 2 , x 1 ) − f ( x 1 , x 2 )) 2 d π ( x 1 , x 2 ) , , � �� �

  19. 10 Lyapunov condition Hessian Lower bound (3) Poincaré inequality acceleration 2 (i) Log-Sobolev inequality for Langevin difgusion [Cattiaux et al., 2010] Accelerated exponential decay of W 2 Smooth gradient condition → ∇ 2 G ≽ − CI 2 d for some constant C > 0. � d ν t [Chen et al., 2019] → χ 2 ( ν || π ) ≤ c p E ( d π ) � � ∥ x 1 ∥ 2 + ∥ x 2 ∥ 2 a / 4 · V ( x 1 , x 2 ) ≤ κ − γ ( ∥ x 1 ∥ 2 + ∥ x 2 ∥ 2 ) → L V ( x 1 , x 2 ) V ( x 1 , x 2 ) := e τ 1 τ 2 (ii) Comparison method: acceleration with a larger Dirichlet form � E S ( f ) = E ( f ) + 1 S ( x 1 , x 2 ) · ( f ( x 2 , x 1 ) − f ( x 1 , x 2 )) 2 d π ( x 1 , x 2 ) , , � �� �

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend