global convergence of gradient descent for non convex
play

Global convergence of gradient descent for non-convex learning - PowerPoint PPT Presentation

Global convergence of gradient descent for non-convex learning problems Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France C O L E N O R M A L E S U P R I E U R E Joint work with L ena c Chizat Institut Henri


  1. Global convergence of gradient descent for non-convex learning problems Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France É C O L E N O R M A L E S U P É R I E U R E Joint work with L´ ena¨ ıc Chizat Institut Henri Poincar´ e - April 5, 2019

  2. Machine learning Scientific context • Proliferation of digital data – Personal data – Industry – Scientific: from bioinformatics to humanities • Need for automated processing of massive data

  3. Machine learning Scientific context • Proliferation of digital data – Personal data – Industry – Scientific: from bioinformatics to humanities • Need for automated processing of massive data • Series of “hypes” Big data → Data science → Machine Learning → Deep Learning → Artificial Intelligence

  4. Machine learning Scientific context • Proliferation of digital data – Personal data – Industry – Scientific: from bioinformatics to humanities • Need for automated processing of massive data • Series of “hypes” Big data → Data science → Machine Learning → Deep Learning → Artificial Intelligence • Healthy interactions between theory, applications, and hype?

  5. Recent progress in perception (vision, audio, text) person ride dog From translate.google.fr From Peyr´ e et al. (2017)

  6. Recent progress in perception (vision, audio, text) person ride dog From translate.google.fr From Peyr´ e et al. (2017) (1) Massive data (2) Computing power (3) Methodological and scientific progress

  7. Recent progress in perception (vision, audio, text) person ride dog From translate.google.fr From Peyr´ e et al. (2017) (1) Massive data (2) Computing power (3) Methodological and scientific progress “Intelligence” = models + algorithms + data + computing power

  8. Recent progress in perception (vision, audio, text) person ride dog From translate.google.fr From Peyr´ e et al. (2017) (1) Massive data (2) Computing power (3) Methodological and scientific progress “Intelligence” = models + algorithms + data + computing power

  9. Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d

  10. Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d • Advertising : n > 10 9 – Φ( x ) ∈ { 0 , 1 } d , d > 10 9 – Navigation history + ad - Linear predictions - h ( x, θ ) = θ ⊤ Φ( x )

  11. Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d • Advertising : n > 10 9 – Φ( x ) ∈ { 0 , 1 } d , d > 10 9 – Navigation history + ad • Linear predictions – h ( x, θ ) = θ ⊤ Φ( x )

  12. Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d x 1 x 2 x 3 x 4 x 5 x 6 y 1 = 1 y 2 = 1 y 3 = 1 y 4 = − 1 y 5 = − 1 y 6 = − 1

  13. Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d x 1 x 2 x 3 x 4 x 5 x 6 y 1 = 1 y 2 = 1 y 3 = 1 y 4 = − 1 y 5 = − 1 y 6 = − 1 – Neural networks ( n, d > 10 6 ): h ( x, θ ) = θ ⊤ m σ ( θ ⊤ m − 1 σ ( · · · θ ⊤ 2 σ ( θ ⊤ 1 x )) θ 1 θ 2 θ 3 y x

  14. Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d • (regularized) empirical risk minimization : n n 1 = 1 � � � � � � min ℓ y i , h ( x i , θ ) + λ Ω( θ ) f i ( θ ) n n θ ∈ R d i =1 i =1 data fitting term + regularizer

  15. Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d • (regularized) empirical risk minimization : n n 1 = 1 � � � � � � min ℓ y i , h ( x i , θ ) + λ Ω( θ ) f i ( θ ) n n θ ∈ R d i =1 i =1 data fitting term + regularizer • Actual goal : minimize test error E p ( x,y ) ℓ ( y, h ( x, θ ))

  16. Convex optimization problems • Convexity in machine learning – Convex loss and linear predictions h ( x, θ ) = θ ⊤ Φ( x )

  17. Convex optimization problems • Convexity in machine learning – Convex loss and linear predictions h ( x, θ ) = θ ⊤ Φ( x ) • (approximately) Matching theory and practice – Fruitful discussions between theoreticians and practitioners – Quantitative theoretical analysis suggests practical improvements

  18. Convex optimization problems • Convexity in machine learning – Convex loss and linear predictions h ( x, θ ) = θ ⊤ Φ( x ) • (approximately) Matching theory and practice – Fruitful discussions between theoreticians and practitioners – Quantitative theoretical analysis suggests practical improvements • Golden years of convexity in machine learning (1995 to 201*) – Support vector machines and kernel methods – Inference in graphical models – Sparsity / low-rank models with first-order methods – Convex relaxation of unsupervised learning problems – Optimal transport – Stochastic methods for large-scale learning and online learning

  19. Convex optimization problems • Convexity in machine learning – Convex loss and linear predictions h ( x, θ ) = θ ⊤ Φ( x ) • (approximately) Matching theory and practice – Fruitful discussions between theoreticians and practitioners – Quantitative theoretical analysis suggests practical improvements • Golden years of convexity in machine learning (1995 to 201*) – Support vector machines and kernel methods – Inference in graphical models – Sparsity / low-rank models with first-order methods – Convex relaxation of unsupervised learning problems – Optimal transport – Stochastic methods for large-scale learning and online learning

  20. Exponentially convergent SGD for smooth finite sums n n 1 f i ( θ ) = 1 � � � � � � • Finite sums : min ℓ y i , h ( x i , θ ) + λ Ω( θ ) n n θ ∈ R d i =1 i =1

  21. Exponentially convergent SGD for smooth finite sums n n 1 f i ( θ ) = 1 � � � � � � • Finite sums : min ℓ y i , h ( x i , θ ) + λ Ω( θ ) n n θ ∈ R d i =1 i =1 • Non-accelerated algorithms (with similar properties) – SAG (Le Roux, Schmidt, and Bach, 2012) – SDCA (Shalev-Shwartz and Zhang, 2013) – SVRG (Johnson and Zhang, 2013; Zhang et al., 2013) – MISO (Mairal, 2015), Finito (Defazio et al., 2014a) – SAGA (Defazio, Bach, and Lacoste-Julien, 2014b), etc... n ∇ f i ( t ) ( θ t − 1 )+1 � � � y t − 1 − y t − 1 θ t = θ t − 1 − γ i i ( t ) n i =1

  22. Exponentially convergent SGD for smooth finite sums n n 1 f i ( θ ) = 1 � � � � � � • Finite sums : min ℓ y i , h ( x i , θ ) + λ Ω( θ ) n n θ ∈ R d i =1 i =1 • Non-accelerated algorithms (with similar properties) – SAG (Le Roux, Schmidt, and Bach, 2012) – SDCA (Shalev-Shwartz and Zhang, 2013) – SVRG (Johnson and Zhang, 2013; Zhang et al., 2013) – MISO (Mairal, 2015), Finito (Defazio et al., 2014a) – SAGA (Defazio, Bach, and Lacoste-Julien, 2014b), etc... n ∇ f i ( t ) ( θ t − 1 )+1 � � � y t − 1 − y t − 1 θ t = θ t − 1 − γ i i ( t ) n i =1

  23. Exponentially convergent SGD for smooth finite sums n n 1 f i ( θ ) = 1 � � � � � � • Finite sums : min ℓ y i , h ( x i , θ ) + λ Ω( θ ) n n θ ∈ R d i =1 i =1 • Non-accelerated algorithms (with similar properties) – SAG (Le Roux, Schmidt, and Bach, 2012) – SDCA (Shalev-Shwartz and Zhang, 2013) – SVRG (Johnson and Zhang, 2013; Zhang et al., 2013) – MISO (Mairal, 2015), Finito (Defazio et al., 2014a) – SAGA (Defazio, Bach, and Lacoste-Julien, 2014b), etc... • Accelerated algorithms – Shalev-Shwartz and Zhang (2014); Nitanda (2014) – Lin et al. (2015b); Defazio (2016), etc... – Catalyst (Lin, Mairal, and Harchaoui, 2015a)

  24. Exponentially convergent SGD for finite sums • Running-time to reach precision ε (with κ = condition number) � × log 1 Gradient descent d × � nκ � ε � n √ κ � × log 1 Accelerated gradient descent d × � ε

  25. Exponentially convergent SGD for finite sums • Running-time to reach precision ε (with κ = condition number) � 1 Stochastic gradient descent d × � κ × � ε � × log 1 Gradient descent d × � nκ � ε � n √ κ � × log 1 Accelerated gradient descent d × � ε � × log 1 SAG(A), SVRG, SDCA, MISO d × � ( n + κ ) � ε

  26. Exponentially convergent SGD for finite sums • Running-time to reach precision ε (with κ = condition number) � 1 Stochastic gradient descent d × � κ × � ε � × log 1 Gradient descent d × � nκ � ε � n √ κ � × log 1 Accelerated gradient descent d × � ε � × log 1 SAG(A), SVRG, SDCA, MISO d × � ( n + κ ) � ε � ( n + √ nκ ) � � × log 1 Accelerated versions d × � ε NB: slightly different (smaller) notion of condition number for batch methods

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend