Global convergence of gradient descent for non-convex learning - PowerPoint PPT Presentation

Global convergence of gradient descent for non-convex learning problems Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France É C O L E N O R M A L E S U P É R I E U R E Joint work with L´ ena¨ ıc Chizat Institut Henri Poincar´ e - April 5, 2019

Machine learning Scientific context • Proliferation of digital data – Personal data – Industry – Scientific: from bioinformatics to humanities • Need for automated processing of massive data

Machine learning Scientific context • Proliferation of digital data – Personal data – Industry – Scientific: from bioinformatics to humanities • Need for automated processing of massive data • Series of “hypes” Big data → Data science → Machine Learning → Deep Learning → Artificial Intelligence

Machine learning Scientific context • Proliferation of digital data – Personal data – Industry – Scientific: from bioinformatics to humanities • Need for automated processing of massive data • Series of “hypes” Big data → Data science → Machine Learning → Deep Learning → Artificial Intelligence • Healthy interactions between theory, applications, and hype?

Recent progress in perception (vision, audio, text) person ride dog From translate.google.fr From Peyr´ e et al. (2017)

Recent progress in perception (vision, audio, text) person ride dog From translate.google.fr From Peyr´ e et al. (2017) (1) Massive data (2) Computing power (3) Methodological and scientific progress

Recent progress in perception (vision, audio, text) person ride dog From translate.google.fr From Peyr´ e et al. (2017) (1) Massive data (2) Computing power (3) Methodological and scientific progress “Intelligence” = models + algorithms + data + computing power

Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d

Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d • Advertising : n > 10 9 – Φ( x ) ∈ { 0 , 1 } d , d > 10 9 – Navigation history + ad - Linear predictions - h ( x, θ ) = θ ⊤ Φ( x )

Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d • Advertising : n > 10 9 – Φ( x ) ∈ { 0 , 1 } d , d > 10 9 – Navigation history + ad • Linear predictions – h ( x, θ ) = θ ⊤ Φ( x )

Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d x 1 x 2 x 3 x 4 x 5 x 6 y 1 = 1 y 2 = 1 y 3 = 1 y 4 = − 1 y 5 = − 1 y 6 = − 1

Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d x 1 x 2 x 3 x 4 x 5 x 6 y 1 = 1 y 2 = 1 y 3 = 1 y 4 = − 1 y 5 = − 1 y 6 = − 1 – Neural networks ( n, d > 10 6 ): h ( x, θ ) = θ ⊤ m σ ( θ ⊤ m − 1 σ ( · · · θ ⊤ 2 σ ( θ ⊤ 1 x )) θ 1 θ 2 θ 3 y x

Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d • (regularized) empirical risk minimization : n n 1 = 1 � � � � � � min ℓ y i , h ( x i , θ ) + λ Ω( θ ) f i ( θ ) n n θ ∈ R d i =1 i =1 data fitting term + regularizer

Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d • (regularized) empirical risk minimization : n n 1 = 1 � � � � � � min ℓ y i , h ( x i , θ ) + λ Ω( θ ) f i ( θ ) n n θ ∈ R d i =1 i =1 data fitting term + regularizer • Actual goal : minimize test error E p ( x,y ) ℓ ( y, h ( x, θ ))

Convex optimization problems • Convexity in machine learning – Convex loss and linear predictions h ( x, θ ) = θ ⊤ Φ( x )

Convex optimization problems • Convexity in machine learning – Convex loss and linear predictions h ( x, θ ) = θ ⊤ Φ( x ) • (approximately) Matching theory and practice – Fruitful discussions between theoreticians and practitioners – Quantitative theoretical analysis suggests practical improvements

Convex optimization problems • Convexity in machine learning – Convex loss and linear predictions h ( x, θ ) = θ ⊤ Φ( x ) • (approximately) Matching theory and practice – Fruitful discussions between theoreticians and practitioners – Quantitative theoretical analysis suggests practical improvements • Golden years of convexity in machine learning (1995 to 201*) – Support vector machines and kernel methods – Inference in graphical models – Sparsity / low-rank models with first-order methods – Convex relaxation of unsupervised learning problems – Optimal transport – Stochastic methods for large-scale learning and online learning

Exponentially convergent SGD for smooth finite sums n n 1 f i ( θ ) = 1 � � � � � � • Finite sums : min ℓ y i , h ( x i , θ ) + λ Ω( θ ) n n θ ∈ R d i =1 i =1

Exponentially convergent SGD for smooth finite sums n n 1 f i ( θ ) = 1 � � � � � � • Finite sums : min ℓ y i , h ( x i , θ ) + λ Ω( θ ) n n θ ∈ R d i =1 i =1 • Non-accelerated algorithms (with similar properties) – SAG (Le Roux, Schmidt, and Bach, 2012) – SDCA (Shalev-Shwartz and Zhang, 2013) – SVRG (Johnson and Zhang, 2013; Zhang et al., 2013) – MISO (Mairal, 2015), Finito (Defazio et al., 2014a) – SAGA (Defazio, Bach, and Lacoste-Julien, 2014b), etc... n ∇ f i ( t ) ( θ t − 1 )+1 � � � y t − 1 − y t − 1 θ t = θ t − 1 − γ i i ( t ) n i =1

Exponentially convergent SGD for smooth finite sums n n 1 f i ( θ ) = 1 � � � � � � • Finite sums : min ℓ y i , h ( x i , θ ) + λ Ω( θ ) n n θ ∈ R d i =1 i =1 • Non-accelerated algorithms (with similar properties) – SAG (Le Roux, Schmidt, and Bach, 2012) – SDCA (Shalev-Shwartz and Zhang, 2013) – SVRG (Johnson and Zhang, 2013; Zhang et al., 2013) – MISO (Mairal, 2015), Finito (Defazio et al., 2014a) – SAGA (Defazio, Bach, and Lacoste-Julien, 2014b), etc... • Accelerated algorithms – Shalev-Shwartz and Zhang (2014); Nitanda (2014) – Lin et al. (2015b); Defazio (2016), etc... – Catalyst (Lin, Mairal, and Harchaoui, 2015a)

Exponentially convergent SGD for finite sums • Running-time to reach precision ε (with κ = condition number) � × log 1 Gradient descent d × � nκ � ε � n √ κ � × log 1 Accelerated gradient descent d × � ε

Exponentially convergent SGD for finite sums • Running-time to reach precision ε (with κ = condition number) � 1 Stochastic gradient descent d × � κ × � ε � × log 1 Gradient descent d × � nκ � ε � n √ κ � × log 1 Accelerated gradient descent d × � ε � × log 1 SAG(A), SVRG, SDCA, MISO d × � ( n + κ ) � ε

Exponentially convergent SGD for finite sums • Running-time to reach precision ε (with κ = condition number) � 1 Stochastic gradient descent d × � κ × � ε � × log 1 Gradient descent d × � nκ � ε � n √ κ � × log 1 Accelerated gradient descent d × � ε � × log 1 SAG(A), SVRG, SDCA, MISO d × � ( n + κ ) � ε � ( n + √ nκ ) � � × log 1 Accelerated versions d × � ε NB: slightly different (smaller) notion of condition number for batch methods

Global convergence of gradient descent for non-convex learning - PowerPoint PPT Presentation

Global convergence of gradient descent for non-convex learning problems Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France C O L E N O R M A L E S U P R I E U R E Joint work with L ena c Chizat Institut Henri

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Advanced Machine Learning Gradient Descent for Non-Convex Functions Amit Sethi Electrical

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

3.1 Online Convex Programming Definition 3.1.1 (Convex Set) A set of vectors X R n is convex

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

Benax , inexpensive hand-made robot for kids, programmable in local language A case study in

Crowdsourcing and Human Computer Interaction Design Crowdsourcing and Human Computation

Newspeak & its Children: Avarice and Sloth Gilad Bracha Ministry of Truth Thursday, June

Making an ILean 2 Vertical iPad mount 3 1 2/19/2020 Portable scan and read station for a

Attributes Tues May 2 Kristen Grauman UT Austin A5 end game Deadline extended to Friday

Scaling Drupal 8 Naveen Valecha Abhishek Anand Session Track : Coding and development Who we

Computer Aided Translation Philipp Koehn 1 September 2017 Philipp Koehn Computer Aided

On-the-Fly Model Checking of Security Protocols and Web Services Luca Vigan Department of

Global convergence of gradient descent for non-convex learning - PowerPoint PPT Presentation

Global convergence of gradient descent for non-convex learning problems Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France C O L E N O R M A L E S U P R I E U R E Joint work with L ena c Chizat Institut Henri

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Advanced Machine Learning Gradient Descent for Non-Convex Functions Amit Sethi Electrical

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

3.1 Online Convex Programming Definition 3.1.1 (Convex Set) A set of vectors X R n is convex

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

Benax , inexpensive hand-made robot for kids, programmable in local language A case study in

Crowdsourcing and Human Computer Interaction Design Crowdsourcing and Human Computation

Newspeak &amp; its Children: Avarice and Sloth Gilad Bracha Ministry of Truth Thursday, June

Making an ILean 2 Vertical iPad mount 3 1 2/19/2020 Portable scan and read station for a

Attributes Tues May 2 Kristen Grauman UT Austin A5 end game Deadline extended to Friday

Scaling Drupal 8 Naveen Valecha Abhishek Anand Session Track : Coding and development Who we

Computer Aided Translation Philipp Koehn 1 September 2017 Philipp Koehn Computer Aided

On-the-Fly Model Checking of Security Protocols and Web Services Luca Vigan Department of

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Newspeak & its Children: Avarice and Sloth Gilad Bracha Ministry of Truth Thursday, June