what does backpropagation compute
play

What does backpropagation compute? Edouard Pauwels (IRIT, Toulouse - PowerPoint PPT Presentation

What does backpropagation compute? Edouard Pauwels (IRIT, Toulouse 3) joint work with J er ome Bolte (TSE, Toulouse 1) Optimization for machine learning CIRM March 2020 1 / 28 Plan Motivation: There is something that we do not understand


  1. What does backpropagation compute? Edouard Pauwels (IRIT, Toulouse 3) joint work with J´ erˆ ome Bolte (TSE, Toulouse 1) Optimization for machine learning CIRM March 2020 1 / 28

  2. Plan Motivation: There is something that we do not understand in backpropagation for deep learning. 2 / 28

  3. Plan Motivation: There is something that we do not understand in backpropagation for deep learning. Nonsmooth analysis is not really compatible with calculus. 2 / 28

  4. Plan Motivation: There is something that we do not understand in backpropagation for deep learning. Nonsmooth analysis is not really compatible with calculus. Contribution: Conservative set valued fields. Analytic, geometric and algorithmic properties. 2 / 28

  5. Backpropagation Automatic differentiation (AD, 70s): 3 / 28

  6. Backpropagation Automatic differentiation (AD, 70s): Automatized numerical implementation of the chain rule: H : R p �→ R p , G : R p �→ R p , f : R p → R , (differentiable) . f ◦ G ◦ H : R p �→ R . ∇ ( f ◦ G ◦ H ) T = ∇ f T × J G × J H 3 / 28

  7. Backpropagation Automatic differentiation (AD, 70s): Automatized numerical implementation of the chain rule: H : R p �→ R p , G : R p �→ R p , f : R p → R , (differentiable) . f ◦ G ◦ H : R p �→ R . ∇ ( f ◦ G ◦ H ) T = ∇ f T × J G × J H Function = program: smooth elementary operations, combined smoothly. x �→ ( H ( x ) , G ( H ( x )) , f ( G ( H ( x )))) 3 / 28

  8. Backpropagation Automatic differentiation (AD, 70s): Automatized numerical implementation of the chain rule: H : R p �→ R p , G : R p �→ R p , f : R p → R , (differentiable) . f ◦ G ◦ H : R p �→ R . ∇ ( f ◦ G ◦ H ) T = ∇ f T × J G × J H Function = program: smooth elementary operations, combined smoothly. x �→ ( H ( x ) , G ( H ( x )) , f ( G ( H ( x )))) ∇ f T × ( J G × J H ). Forward mode of AD: ( ∇ f T × J G ) × J H . Backward mode of AD: 3 / 28

  9. Backpropagation Automatic differentiation (AD, 70s): Automatized numerical implementation of the chain rule: H : R p �→ R p , G : R p �→ R p , f : R p → R , (differentiable) . f ◦ G ◦ H : R p �→ R . ∇ ( f ◦ G ◦ H ) T = ∇ f T × J G × J H Function = program: smooth elementary operations, combined smoothly. x �→ ( H ( x ) , G ( H ( x )) , f ( G ( H ( x )))) ∇ f T × ( J G × J H ). Forward mode of AD: ( ∇ f T × J G ) × J H . Backward mode of AD: Backpropagation: Backward AD for neural network training. It computes gradient (provided that everybody is smooth). 3 / 28

  10. Neural network / compositional modeling Input . . . z 0 ∈ R p z 1 ∈ R p 1 z L ∈ R p L x ∈ R p 4 / 28

  11. Neural network / compositional modeling Input . . . z 0 ∈ R p z 1 ∈ R p 1 z L ∈ R p L x ∈ R p For i = 1 , . . . , L : z i ∈ R p i “layer”. z i = φ i ( W i z i − 1 + b i ) φ i : R p i �→ R p i “activation functions”, nonlinear. W i ∈ R p i × p i − 1 , b i ∈ R p i , θ = ( W 1 , b 1 , . . . , W L , b L ), model parameters. 4 / 28

  12. Neural network / compositional modeling Input . . . z 0 ∈ R p z 1 ∈ R p 1 z L ∈ R p L x ∈ R p For i = 1 , . . . , L : z i ∈ R p i “layer”. z i = φ i ( W i z i − 1 + b i ) φ i : R p i �→ R p i “activation functions”, nonlinear. W i ∈ R p i × p i − 1 , b i ∈ R p i , θ = ( W 1 , b 1 , . . . , W L , b L ), model parameters. F θ ( x ) = z L = φ L ( W L φ L − 1 ( W L − 1 ( . . . φ 1 ( W 1 x + b 1 ) ) + b L − 1 ) + b L ) 4 / 28

  13. Neural network / compositional modeling Input . . . z 0 ∈ R p z 1 ∈ R p 1 z L ∈ R p L x ∈ R p For i = 1 , . . . , L : z i ∈ R p i “layer”. z i = φ i ( W i z i − 1 + b i ) φ i : R p i �→ R p i “activation functions”, nonlinear. W i ∈ R p i × p i − 1 , b i ∈ R p i , θ = ( W 1 , b 1 , . . . , W L , b L ), model parameters. F θ ( x ) = z L = φ L ( W L φ L − 1 ( W L − 1 ( . . . φ 1 ( W 1 x + b 1 ) ) + b L − 1 ) + b L ) i =1 in R p × R p L , loss ℓ : R p L × R p L → R + . Training set: { ( x i , y i ) } n n n 1 1 � � min J ( θ ) := ℓ ( F θ ( x i ) , y i ) = J i ( θ ) . n n θ i =1 i =1 4 / 28

  14. Backpropagation and learning Stochastic (minibatch) gradient algorithm: Given ( I k ) k ∈ N iid, uniform on { 1 , . . . , n } , ( α k ) k ∈ N positive, iterate, θ k +1 = θ k − α k ∇ J I k ( θ k ) . Backpropagation: Backward mode of automatic differentiation used to compute ∇ J i 5 / 28

  15. Backpropagation and learning Stochastic (minibatch) gradient algorithm: Given ( I k ) k ∈ N iid, uniform on { 1 , . . . , n } , ( α k ) k ∈ N positive, iterate, θ k +1 = θ k − α k ∇ J I k ( θ k ) . Backpropagation: Backward mode of automatic differentiation used to compute ∇ J i Profusion of numerical tools: e.g. Tensorflow, Pytorch. Democratized the usage of these models. Goes beyond neural nets (differentiable programming). 5 / 28

  16. Nonsmooth activations Positive part: relu ( t ) = max { 0 , t } , Less straightforward examples: Max pooling in convolutional networks. knn grouping layers, farthest point subsampling layers. Qi et. al. 2017. PointNet++: Deep Hierarchical Feature Learning on point Sets in a Metric Space. Sorting layers. Anil et. al. 2019. Sorting Out Lipschitz Function Approximation. ICML. 6 / 28

  17. 2.0 2 relu' relu 1.5 1 1.0 0 0.5 abs' abs 0.0 1 2 1 0 1 2 2 1 0 1 2 x x 2 6 leaky_relu' relu6' leaky_relu relu6 4 1 2 0 0 2 1 0 1 2 2 1 0 1 2 x x Nonsmooth backpropagation Set relu ′ (0) = 0 and implement the chain rule of smooth calculus. ( f ◦ g ) ′ = g ′ × f ′ ◦ g . 7 / 28

  18. Nonsmooth backpropagation Set relu ′ (0) = 0 and implement the chain rule of smooth calculus. ( f ◦ g ) ′ = g ′ × f ′ ◦ g . Tensorflow examples: 2.0 2 relu' relu 1.5 1 1.0 0 0.5 abs' abs 0.0 1 2 1 0 1 2 2 1 0 1 2 x x 2 6 leaky_relu' relu6' leaky_relu relu6 4 1 2 0 0 2 1 0 1 2 2 1 0 1 2 x x 7 / 28

  19. AD acts on programs, not on functions relu 2( t ) = relu ( − t ) + t = relu ( t ) relu 3( t ) = 1 2( relu ( t ) + relu2 ( t )) = relu ( t ) . 2.0 2.0 relu2' relu3' relu2 relu3 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 2 1 0 1 2 2 1 0 1 2 x x Known from AD litterature ( e.g. Griewank 08, Kakade & Lee 2018). 8 / 28

  20. Derivative of zero at 0 zero ( t ) = relu 2( t ) − relu ( t ) = 0 . 1.00 zero' zero 0.75 0.50 0.25 0.00 2 1 0 1 2 x 9 / 28

  21. AD acts on programs, not on functions Derivative of sine at 0: sin ′ = cos . 1.0 2 mysin' mysin 0.5 1 0.0 0 0.5 sin' sin 1.0 1 2 1 0 1 2 2 1 0 1 2 x x 1.0 mysin3' 2 mysin3 0.5 1 0.0 0 0.5 mysin2' 1 mysin2 1.0 2 1 0 1 2 2 1 0 1 2 x x 10 / 28

  22. Consequences for optimization and learning No convexity, no calculus: ∂ ( f + g ) ⊂ ∂ f + ∂ g . 11 / 28

  23. Consequences for optimization and learning No convexity, no calculus: ∂ ( f + g ) ⊂ ∂ f + ∂ g . Minibatch + subgradient: locally Lipschitz, convex, n J ( θ ) = 1 � J i ( θ ) n i =1 ∈ ∂ J i ( θ ) , i = 1 , . . . n , v i E I [ v I ] ∈ ∂ J ( θ ) , I uniform on { 1 , . . . , n } , 11 / 28

  24. Consequences for optimization and learning No convexity, no calculus: ∂ ( f + g ) ⊂ ∂ f + ∂ g . Minibatch + subgradient: locally Lipschitz, no sum rule, n J ( θ ) = 1 � J i ( θ ) n i =1 ∈ ∂ J i ( θ ) , i = 1 , . . . n , v i E I [ v I ] �∈ ∂ J ( θ ) , I uniform on { 1 , . . . , n } , 11 / 28

  25. Consequences for optimization and learning No convexity, no calculus: ∂ ( f + g ) ⊂ ∂ f + ∂ g . Minibatch + subgradient: locally Lipschitz, no sum rule, auto differentiation. n J ( θ ) = 1 � J i ( θ ) n i =1 �∈ ∂ J i ( θ ) , i = 1 , . . . n , v i E I [ v I ] �∈ ∂ J ( θ ) , I uniform on { 1 , . . . , n } , 11 / 28

  26. Consequences for optimization and learning No convexity, no calculus: ∂ ( f + g ) ⊂ ∂ f + ∂ g . Minibatch + subgradient: locally Lipschitz, no sum rule, auto differentiation. n J ( θ ) = 1 � J i ( θ ) n i =1 �∈ ∂ J i ( θ ) , i = 1 , . . . n , v i E I [ v I ] �∈ ∂ J ( θ ) , I uniform on { 1 , . . . , n } , Discrepancy: Analyse: θ k +1 = θ k − α k ( v k + ǫ k ), v k ∈ ∂ J ( θ k ), ( ǫ i ) i ∈ N zero mean (martingale increments). (Davis et. al. 2018. Stochastic subgradient method converges on tame functions. FOCM.) Implement: θ k +1 = θ k − α k D I k ( θ k ) 11 / 28

  27. Question Smooth: Nonsmooth: num num J P J P diff autodiff diff autodiff ∇ J D num ∂J D A mathematical model for “nonsmooth automatic differentiation”? 12 / 28

  28. Outline 1. Conservative set valued field 2. Properties of conservative fields 3. Consequences for deep learning 13 / 28

  29. What is a derivative? 14 / 28

  30. What is a derivative? Linear operator: derivative : C 1 ( R ) C 0 ( R ) �→ f ′ �→ f 14 / 28

  31. What is a derivative? Linear operator: derivative : C 1 ( R ) C 0 ( R ) �→ f ′ �→ f Notions of subgradients inherited from calculus of variation follow the “operator” view. 14 / 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend