Non-convex Learning via Replica Exchange Stochastic Gradient MCMC A - PowerPoint PPT Presentation

Non-convex Learning via Replica Exchange Stochastic Gradient MCMC A scalable parallel tempering algorithm for DNNs Qi Feng *2 Liyao Gao * 1 July 27, 2020 1 Purdue University 2 University of Southern California * Equal contribution Wei Deng 1 Faming Liang 1 Guang Lin 1

Markov chain Monte Carlo The increasing concern for AI safety problems draws our attention to Markov chain Monte Carlo (MCMC) , which is known for • Multi-modal sampling [Teh et al., 2016] • Non-convex optimization [Zhang et al., 2017] 1

Acceleration strategies for MCMC Popular strategies to accelerate MCMC: • Simulated annealing [Kirkpatrick et al., 1983] • Simulated tempering [Marinari and Parisi, 1992] • Replica exchange MCMC [Swendsen and Wang, 1986] 2

Replica exchange stochastic gradient MCMC

Replica exchange Langevin difgusion t In other words, a jump process is included in a Markov process t t 1 Moreover, the positions of the two particles swap with a probability 3 t t Consider two Langevin difgusion processes with τ 1 > τ 2 � d β ( 1 ) = −∇ U ( β ( 1 ) 2 τ 1 d W ( 1 ) t ) dt + � d β ( 2 ) = −∇ U ( β ( 2 ) 2 τ 2 d W ( 2 ) t ) dt + t , � �� U ( β ( 1 ) ) − U ( β ( 2 ) S ( β ( 1 ) t , β ( 2 ) ) τ 1 − 1 t ) := e τ 2 P ( β t + dt = ( β ( 2 ) t , β ( 1 ) t ) | β t = ( β ( 1 ) t , β ( 2 ) t )) = rS ( β ( 1 ) t , β ( 2 ) t ) dt P ( β t + dt = ( β ( 1 ) t , β ( 2 ) t ) | β t = ( β ( 1 ) t , β ( 2 ) t )) = 1 − rS ( β ( 1 ) t , β ( 2 ) t ) dt

A demo Figure 1: Trajectory plot for replica exchange Langevin difgusion. 4

Why the naïve numerical algorithm fails k (1) 1 Consider the scalable stochastic gradient Langevin dynamics k 5 k algorithm [Welling and Teh, 2011] � β ( 1 ) β ( 1 ) β ( 1 ) 2 η k τ 1 ξ ( 1 ) � k + 1 = � − η k ∇ � L ( � k ) + � β ( 2 ) � k + 1 = � β ( 2 ) − η k ∇ � L ( � β ( 2 ) 2 η k τ 2 ξ ( 2 ) k ) + k . β ( 1 ) β ( 2 ) Swap the chains with a naïve swapping rate r S ( � k + 1 , � k + 1 ) η k § : � �� β ( 1 ) β ( 2 ) � L ( � k + 1 ) − � L ( � τ 1 − 1 k + 1 ) S ( � β ( 1 ) k + 1 , � β ( 2 ) k + 1 ) = e . τ 2 β ( · ) Exponentiating the unbiased estimators � L ( � k + 1 ) leads to a large bias . § In the implementations, we fix r η k = 1 by default.

A corrected algorithm 1 1 1 2 dW t S t t dW 2 S t 2 S t (2) 1 1 6 1 Assume � L ( θ ) ∼ N ( L ( θ ) , σ 2 ) and consider the geometric Brownian motion of { � S t } t ∈ [ 0 , 1 ] in each swap as a Martingale � �� L ( � β ( 1 ) ) − � L ( � β ( 2 ) ) − τ 1 − 1 τ 1 − 1 σ 2 t � S t = e τ 2 τ 2 � �� √ L ( � β ( 1 ) ) − L ( � β ( 2 ) ) − τ 1 − 1 τ 1 − 1 σ 2 t + 2 σ W t = e . τ 2 τ 2 Taking the derivative of � S t with respect to t and W t , Itô’s lemma gives, � � � 1 � d � d 2 � dt + d � √ d � σ � S t = dt + 1 dW t = − 1 S t dW t . τ 1 τ 2 By fixing t = 1 in (2), we have the suggested unbiased swapping rate � �� σ 2 � � L ( � β ( 1 ) ) − � L ( � β ( 2 ) ) − τ 1 − 1 � τ 1 − 1 S 1 = e . τ 2 τ 2

Unknown corrections in practice Figure 2: Unknown corrections on CIFAR 10 and CIFAR 100 datasets. 7

An adaptive algorithm for unknown corrections k F 1 1 Swapping step Sampling step Stochastic approximation step k 8 � β ( 1 ) � k + 1 = � β ( 1 ) k − η ( 1 ) k ∇ � L ( � β ( 1 ) 2 η ( 1 ) k τ 1 ξ ( 1 ) k ) + � β ( 2 ) � k + 1 = � β ( 2 ) − η ( 2 ) k ∇ � L ( � β ( 2 ) 2 η ( 2 ) k τ 2 ξ ( 2 ) k ) + k , Obtain an unbiased estimate ˜ σ 2 m + 1 for σ 2 . ˆ m + 1 = ( 1 − γ m )ˆ m + γ m ˜ σ 2 σ 2 σ 2 m + 1 , Generate a uniform random number u ∈ [ 0 , 1 ] . �� τ 1 − 1 σ 2 ˆ ˆ � L ( � β ( 1 ) k + 1 ) − � L ( � β ( 2 ) m + 1 τ 1 − 1 k + 1 ) − S 1 = exp τ 2 τ 2 If u < ˆ S 1 : Swap � β ( 1 ) k + 1 and � β ( 2 ) k + 1 .

Convergence Analysis

Discretization Error Replica exchange SGLD tracks the replica exchange Langevin is the noise in the stochastic swapping rate. 9 Lemma (Discretization Error) Given the smoothness and dissipativity assumptions in the appendix, difgusion in some sense. and a small (fixed) learning rate η , we have that √ E [sup 0 ≤ t ≤ T ∥ β t − � t || 2 ] ≤ ˜ β η O ( η +max i E [ ∥ φ i ∥ 2 ]+max i E [ | ψ i | 2 ]) , where � β η t is the continuous-time interpolation for reSGLD, φ := ∇ � U − ∇ U is the noise in the stochastic gradient, and ψ := � S − S

10 Lyapunov condition Hessian Lower bound (3) Poincaré inequality acceleration 2 (ii) Comparison method: acceleration with a larger Dirichlet form (i) Log-Sobolev inequality for Langevin difgusion [Cattiaux et al., 2010] Accelerated exponential decay of W 2 Smooth gradient condition → ∇ 2 G ≽ − CI 2 d for some constant C > 0. � d ν t [Chen et al., 2019] → χ 2 ( ν || π ) ≤ c p E ( d π ) � � ∥ x 1 ∥ 2 + ∥ x 2 ∥ 2 a / 4 · V ( x 1 , x 2 ) ≤ κ − γ ( ∥ x 1 ∥ 2 + ∥ x 2 ∥ 2 ) → L V ( x 1 , x 2 ) V ( x 1 , x 2 ) := e τ 1 τ 2 � E S ( f ) = E ( f ) + 1 S ( x 1 , x 2 ) · ( f ( x 2 , x 1 ) − f ( x 1 , x 2 )) 2 d π ( x 1 , x 2 ) , , � ��

10 Lyapunov condition Hessian Lower bound (3) Poincaré inequality acceleration 2 (i) Log-Sobolev inequality for Langevin difgusion [Cattiaux et al., 2010] Accelerated exponential decay of W 2 Smooth gradient condition → ∇ 2 G ≽ − CI 2 d for some constant C > 0. � d ν t [Chen et al., 2019] → χ 2 ( ν || π ) ≤ c p E ( d π ) � � ∥ x 1 ∥ 2 + ∥ x 2 ∥ 2 a / 4 · V ( x 1 , x 2 ) ≤ κ − γ ( ∥ x 1 ∥ 2 + ∥ x 2 ∥ 2 ) → L V ( x 1 , x 2 ) V ( x 1 , x 2 ) := e τ 1 τ 2 (ii) Comparison method: acceleration with a larger Dirichlet form � E S ( f ) = E ( f ) + 1 S ( x 1 , x 2 ) · ( f ( x 2 , x 1 ) − f ( x 1 , x 2 )) 2 d π ( x 1 , x 2 ) , , � ��

Non-convex Learning via Replica Exchange Stochastic Gradient MCMC A - PowerPoint PPT Presentation

Non-convex Learning via Replica Exchange Stochastic Gradient MCMC A scalable parallel tempering algorithm for DNNs Qi Feng 2 Liyao Gao 1 July 27, 2020 1 Purdue University 2 University of Southern California * Equal contribution Wei Deng 1

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Optimizing Convex Functions over Non-Convex Domains Dan Bienstock and Alex Michalka

Replica and all that Giorgio Parisi In this talk I will present an history of the replica method.

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

Convex hull: basic facts Convex hull: basic facts CG Lecture 1 CG Lecture 1 Problem : give a set

Convex hulls of spheres and convex hulls of convex polytopes lying on parallel hyperplanes

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

Convex Analysis Jos e De Don a September 2004 Centre of Complex Dynamic Systems and

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS133 Computational Geometry Convex Hull 4/12/2018 1 Convex Hull Given a set of n points,

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Non Convex Minimization using Convex Relaxation Some Hints to Formulate Equivalent Convex Energies

SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator

Design and Architectures for Design and Architectures for Embedded Systems (ESII) Embedded

Mat 3770 Trees Grid Steiner Trees Kruskal Na ve B&B Theorems Pretaxial B&B

Mixed Integer Programming: Algorithms and Applications Julia Borghoff Mykonos May 2012 1 / 46

Foundations of Artificial Intelligence 21. Combinatorial Optimization: Advanced Techniques Malte

A brief review of quantum annealing Hidetoshi Nishimori Tokyo Institute of Technology

Maximization of Submodular Functions Seffi Naor Lecture 1 4th Cargese Workshop on Combinatorial

343H: Honors AI Lecture 6: Adversarial Search 2/4/2014 Kristen Grauman UT-Austin Slides

in Monolithic 3D ICs Hao Zhuang, Jingwei Lu, Kambiz Samadi, Yang Du, and Chung-Kuan Cheng Dept.

Non-convex Learning via Replica Exchange Stochastic Gradient MCMC A - PowerPoint PPT Presentation

Non-convex Learning via Replica Exchange Stochastic Gradient MCMC A scalable parallel tempering algorithm for DNNs Qi Feng *2 Liyao Gao * 1 July 27, 2020 1 Purdue University 2 University of Southern California * Equal contribution Wei Deng 1

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Optimizing Convex Functions over Non-Convex Domains Dan Bienstock and Alex Michalka

Replica and all that Giorgio Parisi In this talk I will present an history of the replica method.

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

Convex hull: basic facts Convex hull: basic facts CG Lecture 1 CG Lecture 1 Problem : give a set

Convex hulls of spheres and convex hulls of convex polytopes lying on parallel hyperplanes

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

Convex Analysis Jos e De Don a September 2004 Centre of Complex Dynamic Systems and

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS133 Computational Geometry Convex Hull 4/12/2018 1 Convex Hull Given a set of n points,

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Non Convex Minimization using Convex Relaxation Some Hints to Formulate Equivalent Convex Energies

SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator

Design and Architectures for Design and Architectures for Embedded Systems (ESII) Embedded

Mat 3770 Trees Grid Steiner Trees Kruskal Na ve B&amp;B Theorems Pretaxial B&amp;B

Mixed Integer Programming: Algorithms and Applications Julia Borghoff Mykonos May 2012 1 / 46

Foundations of Artificial Intelligence 21. Combinatorial Optimization: Advanced Techniques Malte

A brief review of quantum annealing Hidetoshi Nishimori Tokyo Institute of Technology

Maximization of Submodular Functions Seffi Naor Lecture 1 4th Cargese Workshop on Combinatorial

343H: Honors AI Lecture 6: Adversarial Search 2/4/2014 Kristen Grauman UT-Austin Slides

in Monolithic 3D ICs Hao Zhuang, Jingwei Lu, Kambiz Samadi*, Yang Du*, and Chung-Kuan Cheng Dept.

Non-convex Learning via Replica Exchange Stochastic Gradient MCMC A scalable parallel tempering algorithm for DNNs Qi Feng 2 Liyao Gao 1 July 27, 2020 1 Purdue University 2 University of Southern California * Equal contribution Wei Deng 1

Mat 3770 Trees Grid Steiner Trees Kruskal Na ve B&B Theorems Pretaxial B&B

in Monolithic 3D ICs Hao Zhuang, Jingwei Lu, Kambiz Samadi, Yang Du, and Chung-Kuan Cheng Dept.