What does backpropagation compute? Edouard Pauwels (IRIT, Toulouse - PowerPoint PPT Presentation

What does backpropagation compute? Edouard Pauwels (IRIT, Toulouse 3) joint work with J´ erˆ ome Bolte (TSE, Toulouse 1) Optimization for machine learning CIRM March 2020 1 / 28

Plan Motivation: There is something that we do not understand in backpropagation for deep learning. 2 / 28

Plan Motivation: There is something that we do not understand in backpropagation for deep learning. Nonsmooth analysis is not really compatible with calculus. 2 / 28

Plan Motivation: There is something that we do not understand in backpropagation for deep learning. Nonsmooth analysis is not really compatible with calculus. Contribution: Conservative set valued fields. Analytic, geometric and algorithmic properties. 2 / 28

Backpropagation Automatic differentiation (AD, 70s): 3 / 28

Backpropagation Automatic differentiation (AD, 70s): Automatized numerical implementation of the chain rule: H : R p �→ R p , G : R p �→ R p , f : R p → R , (differentiable) . f ◦ G ◦ H : R p �→ R . ∇ ( f ◦ G ◦ H ) T = ∇ f T × J G × J H 3 / 28

Backpropagation Automatic differentiation (AD, 70s): Automatized numerical implementation of the chain rule: H : R p �→ R p , G : R p �→ R p , f : R p → R , (differentiable) . f ◦ G ◦ H : R p �→ R . ∇ ( f ◦ G ◦ H ) T = ∇ f T × J G × J H Function = program: smooth elementary operations, combined smoothly. x �→ ( H ( x ) , G ( H ( x )) , f ( G ( H ( x )))) 3 / 28

Backpropagation Automatic differentiation (AD, 70s): Automatized numerical implementation of the chain rule: H : R p �→ R p , G : R p �→ R p , f : R p → R , (differentiable) . f ◦ G ◦ H : R p �→ R . ∇ ( f ◦ G ◦ H ) T = ∇ f T × J G × J H Function = program: smooth elementary operations, combined smoothly. x �→ ( H ( x ) , G ( H ( x )) , f ( G ( H ( x )))) ∇ f T × ( J G × J H ). Forward mode of AD: ( ∇ f T × J G ) × J H . Backward mode of AD: 3 / 28

Backpropagation Automatic differentiation (AD, 70s): Automatized numerical implementation of the chain rule: H : R p �→ R p , G : R p �→ R p , f : R p → R , (differentiable) . f ◦ G ◦ H : R p �→ R . ∇ ( f ◦ G ◦ H ) T = ∇ f T × J G × J H Function = program: smooth elementary operations, combined smoothly. x �→ ( H ( x ) , G ( H ( x )) , f ( G ( H ( x )))) ∇ f T × ( J G × J H ). Forward mode of AD: ( ∇ f T × J G ) × J H . Backward mode of AD: Backpropagation: Backward AD for neural network training. It computes gradient (provided that everybody is smooth). 3 / 28

Neural network / compositional modeling Input . . . z 0 ∈ R p z 1 ∈ R p 1 z L ∈ R p L x ∈ R p 4 / 28

Neural network / compositional modeling Input . . . z 0 ∈ R p z 1 ∈ R p 1 z L ∈ R p L x ∈ R p For i = 1 , . . . , L : z i ∈ R p i “layer”. z i = φ i ( W i z i − 1 + b i ) φ i : R p i �→ R p i “activation functions”, nonlinear. W i ∈ R p i × p i − 1 , b i ∈ R p i , θ = ( W 1 , b 1 , . . . , W L , b L ), model parameters. 4 / 28

Neural network / compositional modeling Input . . . z 0 ∈ R p z 1 ∈ R p 1 z L ∈ R p L x ∈ R p For i = 1 , . . . , L : z i ∈ R p i “layer”. z i = φ i ( W i z i − 1 + b i ) φ i : R p i �→ R p i “activation functions”, nonlinear. W i ∈ R p i × p i − 1 , b i ∈ R p i , θ = ( W 1 , b 1 , . . . , W L , b L ), model parameters. F θ ( x ) = z L = φ L ( W L φ L − 1 ( W L − 1 ( . . . φ 1 ( W 1 x + b 1 ) ) + b L − 1 ) + b L ) 4 / 28

Neural network / compositional modeling Input . . . z 0 ∈ R p z 1 ∈ R p 1 z L ∈ R p L x ∈ R p For i = 1 , . . . , L : z i ∈ R p i “layer”. z i = φ i ( W i z i − 1 + b i ) φ i : R p i �→ R p i “activation functions”, nonlinear. W i ∈ R p i × p i − 1 , b i ∈ R p i , θ = ( W 1 , b 1 , . . . , W L , b L ), model parameters. F θ ( x ) = z L = φ L ( W L φ L − 1 ( W L − 1 ( . . . φ 1 ( W 1 x + b 1 ) ) + b L − 1 ) + b L ) i =1 in R p × R p L , loss ℓ : R p L × R p L → R + . Training set: { ( x i , y i ) } n n n 1 1 � � min J ( θ ) := ℓ ( F θ ( x i ) , y i ) = J i ( θ ) . n n θ i =1 i =1 4 / 28

Backpropagation and learning Stochastic (minibatch) gradient algorithm: Given ( I k ) k ∈ N iid, uniform on { 1 , . . . , n } , ( α k ) k ∈ N positive, iterate, θ k +1 = θ k − α k ∇ J I k ( θ k ) . Backpropagation: Backward mode of automatic differentiation used to compute ∇ J i 5 / 28

Backpropagation and learning Stochastic (minibatch) gradient algorithm: Given ( I k ) k ∈ N iid, uniform on { 1 , . . . , n } , ( α k ) k ∈ N positive, iterate, θ k +1 = θ k − α k ∇ J I k ( θ k ) . Backpropagation: Backward mode of automatic differentiation used to compute ∇ J i Profusion of numerical tools: e.g. Tensorflow, Pytorch. Democratized the usage of these models. Goes beyond neural nets (differentiable programming). 5 / 28

Nonsmooth activations Positive part: relu ( t ) = max { 0 , t } , Less straightforward examples: Max pooling in convolutional networks. knn grouping layers, farthest point subsampling layers. Qi et. al. 2017. PointNet++: Deep Hierarchical Feature Learning on point Sets in a Metric Space. Sorting layers. Anil et. al. 2019. Sorting Out Lipschitz Function Approximation. ICML. 6 / 28

2.0 2 relu' relu 1.5 1 1.0 0 0.5 abs' abs 0.0 1 2 1 0 1 2 2 1 0 1 2 x x 2 6 leaky_relu' relu6' leaky_relu relu6 4 1 2 0 0 2 1 0 1 2 2 1 0 1 2 x x Nonsmooth backpropagation Set relu ′ (0) = 0 and implement the chain rule of smooth calculus. ( f ◦ g ) ′ = g ′ × f ′ ◦ g . 7 / 28

Nonsmooth backpropagation Set relu ′ (0) = 0 and implement the chain rule of smooth calculus. ( f ◦ g ) ′ = g ′ × f ′ ◦ g . Tensorflow examples: 2.0 2 relu' relu 1.5 1 1.0 0 0.5 abs' abs 0.0 1 2 1 0 1 2 2 1 0 1 2 x x 2 6 leaky_relu' relu6' leaky_relu relu6 4 1 2 0 0 2 1 0 1 2 2 1 0 1 2 x x 7 / 28

AD acts on programs, not on functions relu 2( t ) = relu ( − t ) + t = relu ( t ) relu 3( t ) = 1 2( relu ( t ) + relu2 ( t )) = relu ( t ) . 2.0 2.0 relu2' relu3' relu2 relu3 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 2 1 0 1 2 2 1 0 1 2 x x Known from AD litterature ( e.g. Griewank 08, Kakade & Lee 2018). 8 / 28

Derivative of zero at 0 zero ( t ) = relu 2( t ) − relu ( t ) = 0 . 1.00 zero' zero 0.75 0.50 0.25 0.00 2 1 0 1 2 x 9 / 28

AD acts on programs, not on functions Derivative of sine at 0: sin ′ = cos . 1.0 2 mysin' mysin 0.5 1 0.0 0 0.5 sin' sin 1.0 1 2 1 0 1 2 2 1 0 1 2 x x 1.0 mysin3' 2 mysin3 0.5 1 0.0 0 0.5 mysin2' 1 mysin2 1.0 2 1 0 1 2 2 1 0 1 2 x x 10 / 28

Consequences for optimization and learning No convexity, no calculus: ∂ ( f + g ) ⊂ ∂ f + ∂ g . 11 / 28

Consequences for optimization and learning No convexity, no calculus: ∂ ( f + g ) ⊂ ∂ f + ∂ g . Minibatch + subgradient: locally Lipschitz, convex, n J ( θ ) = 1 � J i ( θ ) n i =1 ∈ ∂ J i ( θ ) , i = 1 , . . . n , v i E I [ v I ] ∈ ∂ J ( θ ) , I uniform on { 1 , . . . , n } , 11 / 28

Consequences for optimization and learning No convexity, no calculus: ∂ ( f + g ) ⊂ ∂ f + ∂ g . Minibatch + subgradient: locally Lipschitz, no sum rule, n J ( θ ) = 1 � J i ( θ ) n i =1 ∈ ∂ J i ( θ ) , i = 1 , . . . n , v i E I [ v I ] �∈ ∂ J ( θ ) , I uniform on { 1 , . . . , n } , 11 / 28

Consequences for optimization and learning No convexity, no calculus: ∂ ( f + g ) ⊂ ∂ f + ∂ g . Minibatch + subgradient: locally Lipschitz, no sum rule, auto differentiation. n J ( θ ) = 1 � J i ( θ ) n i =1 �∈ ∂ J i ( θ ) , i = 1 , . . . n , v i E I [ v I ] �∈ ∂ J ( θ ) , I uniform on { 1 , . . . , n } , 11 / 28

Consequences for optimization and learning No convexity, no calculus: ∂ ( f + g ) ⊂ ∂ f + ∂ g . Minibatch + subgradient: locally Lipschitz, no sum rule, auto differentiation. n J ( θ ) = 1 � J i ( θ ) n i =1 �∈ ∂ J i ( θ ) , i = 1 , . . . n , v i E I [ v I ] �∈ ∂ J ( θ ) , I uniform on { 1 , . . . , n } , Discrepancy: Analyse: θ k +1 = θ k − α k ( v k + ǫ k ), v k ∈ ∂ J ( θ k ), ( ǫ i ) i ∈ N zero mean (martingale increments). (Davis et. al. 2018. Stochastic subgradient method converges on tame functions. FOCM.) Implement: θ k +1 = θ k − α k D I k ( θ k ) 11 / 28

Question Smooth: Nonsmooth: num num J P J P diff autodiff diff autodiff ∇ J D num ∂J D A mathematical model for “nonsmooth automatic differentiation”? 12 / 28

Outline 1. Conservative set valued field 2. Properties of conservative fields 3. Consequences for deep learning 13 / 28

What is a derivative? 14 / 28

What is a derivative? Linear operator: derivative : C 1 ( R ) C 0 ( R ) �→ f ′ �→ f 14 / 28

What is a derivative? Linear operator: derivative : C 1 ( R ) C 0 ( R ) �→ f ′ �→ f Notions of subgradients inherited from calculus of variation follow the “operator” view. 14 / 28

What does backpropagation compute? Edouard Pauwels (IRIT, Toulouse - PowerPoint PPT Presentation

What does backpropagation compute? Edouard Pauwels (IRIT, Toulouse 3) joint work with J er ome Bolte (TSE, Toulouse 1) Optimization for machine learning CIRM March 2020 1 / 28 Plan Motivation: There is something that we do not understand

Backpropagation Why backpropagation Neural networks are sequences of parametrized functions

MLPs with Backpropagation CS 472 Backpropagation 1 Multilayer Nets? Linear Systems F(cx) =

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1

Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey

Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&A 3 BACKPROPAGATION 4 A

1 1 easy to compute , 1 easy to compute 2

Learning From Data Lecture 21 Neural Networks: Backpropagation Forward propagation: algorithmic

Lecture 11: Multi-layer Perceptron Forward Pass Backpropagation Aykut Erdem November

GRAPH REPRESENTATIONS, BACKPROPAGATION AND BIOLOGICAL PLAUSIBILITY Marco Gori SAILAB,

OPEN COMPUTE BRIEF 7x24 Exchange Carolinas Chapter 2017 Winter Meeting AGENDA Open

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute Jan Stephan, Intern Devtech

Infrastructure as a Service (IaaS) Google Compute Engine AWS Elastic Compute Cloud (EC2) Azure

Does God play dice with the cell? Does God play dice with the cell? Does God play dice with the

Lesson 2 Greek Vocabulary One does not equal five!!! One does not equal five!!! One does not

How does the power industry support How does the power industry support How does the power

Riemann-Roch and the trace formula Jean-Michel Bismut Institut de Math ematique dOrsay Le

Anisotropic Structures - Theory and Design

Operator valued Fourier transforms on nilpotent Lie groups Daniel Beltit a Institute of

R echauffement climatique et energie nucl eaire du futur, aspects soci etaux,

Open Access Culture Change James Davenport Founding Editor-in-Chief LMS Journal of Computation

Incomplete Pairwise Comparison Matrices in Multi-Attribute Decision Making S. Bozki*, J.

GRADUATE NYC! Academy for Leaders in the Field of College Transition Graduate NYC! Academy for

Improving Care for Children with Chronic and Complex Needs: A Look at the National Care

What does backpropagation compute? Edouard Pauwels (IRIT, Toulouse - PowerPoint PPT Presentation

What does backpropagation compute? Edouard Pauwels (IRIT, Toulouse 3) joint work with J er ome Bolte (TSE, Toulouse 1) Optimization for machine learning CIRM March 2020 1 / 28 Plan Motivation: There is something that we do not understand

Backpropagation Why backpropagation Neural networks are sequences of parametrized functions

MLPs with Backpropagation CS 472 Backpropagation 1 Multilayer Nets? Linear Systems F(cx) =

CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1

Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey

Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&amp;A 3 BACKPROPAGATION 4 A

1 1 easy to compute , 1 easy to compute 2

Learning From Data Lecture 21 Neural Networks: Backpropagation Forward propagation: algorithmic

Lecture 11: Multi-layer Perceptron Forward Pass Backpropagation Aykut Erdem November

GRAPH REPRESENTATIONS, BACKPROPAGATION AND BIOLOGICAL PLAUSIBILITY Marco Gori SAILAB,

OPEN COMPUTE BRIEF 7x24 Exchange Carolinas Chapter 2017 Winter Meeting AGENDA Open

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

MULTI-GPU PROGRAMMING MODELS Jiri Kraus, Senior Devtech Compute Jan Stephan, Intern Devtech

Infrastructure as a Service (IaaS) Google Compute Engine AWS Elastic Compute Cloud (EC2) Azure

Does God play dice with the cell? Does God play dice with the cell? Does God play dice with the

Lesson 2 Greek Vocabulary One does not equal five!!! One does not equal five!!! One does not

How does the power industry support How does the power industry support How does the power

Riemann-Roch and the trace formula Jean-Michel Bismut Institut de Math ematique dOrsay Le

Anisotropic Structures - Theory and Design

Operator valued Fourier transforms on nilpotent Lie groups Daniel Beltit a Institute of

R echauffement climatique et energie nucl eaire du futur, aspects soci etaux,

Open Access Culture Change James Davenport Founding Editor-in-Chief LMS Journal of Computation

Incomplete Pairwise Comparison Matrices in Multi-Attribute Decision Making S. Bozki*, J.

GRADUATE NYC! Academy for Leaders in the Field of College Transition Graduate NYC! Academy for

Improving Care for Children with Chronic and Complex Needs: A Look at the National Care

Backpropagation Matt Gormley Lecture 12 Oct 10, 2018 1 Q&A 3 BACKPROPAGATION 4 A