differential programming
play

Differential Programming Gabriel Peyr www.numerical-tours.com C O - PowerPoint PPT Presentation

Differential Programming Gabriel Peyr www.numerical-tours.com C O L E N O R M A L E S U P R I E U R E https://mathematical-coffees.github.io Organized by : Mrouane Debbah & Gabriel Peyr Optimization Deep Learning Optimal


  1. Differential Programming Gabriel Peyré www.numerical-tours.com É C O L E N O R M A L E S U P É R I E U R E

  2. https://mathematical-coffees.github.io Organized by : Mérouane Debbah & Gabriel Peyré Optimization Deep Learning Optimal Transport Quantum computing Compressed Sensing Artificial intelligence Mean field games Topos Alexandre Gramfort, INRIA Yves Achdou, Paris 6 Frédéric Magniez, CNRS and Paris 7 Olivier Grisel (INRIA) Daniel Bennequin, Paris 7 Edouard Oyallon, CentraleSupelec Olivier Guéant, Paris 1 Marco Cuturi, ENSAE Gabriel Peyré, CNRS and ENS Iordanis Kerenidis, CNRS and Paris 7 Jalal Fadili, ENSICaen Joris Van den Bossche (INRIA) Guillaume Lecué, CNRS and ENSAE

  3. Model Fitting in Data Sciences def. min E ( θ ) = L ( f ( x, θ ) , y ) θ Output Input Loss Model Parameter

  4. Model Fitting in Data Sciences def. min E ( θ ) = L ( f ( x, θ ) , y ) θ Output Input Loss Model Parameter Deep-learning: f ( · , θ ) y class probabilities θ 4 θ 1 x θ 3 θ 2

  5. Model Fitting in Data Sciences def. min E ( θ ) = L ( f ( x, θ ) , y ) θ Output Input Loss Model Parameter Deep-learning: f ( · , θ ) y class probabilities θ 4 θ 1 x θ 3 θ 2 Super-resolution: f ( x, · ) degradation y observation θ unknown image

  6. Model Fitting in Data Sciences def. min E ( θ ) = L ( f ( x, θ ) , y ) θ Output Input Loss Model Parameter Deep-learning: f ( · , θ ) Medical imaging registration: y class probabilities θ 4 θ 1 y x θ 3 θ 2 f ( · , θ ) x di ff eomorphism Super-resolution: f ( x, · ) degradation y observation θ unknown image

  7. Gradient-based Methods def. min E ( θ ) = L ( f ( x, θ ) , y ) θ θ ` +1 = θ ` � τ ` r E ( θ ` ) Gradient descent: Optimal τ ` = τ ? Small τ ` Large τ ` `

  8. Gradient-based Methods def. min E ( θ ) = L ( f ( x, θ ) , y ) θ θ ` +1 = θ ` � τ ` r E ( θ ` ) Gradient descent: Optimal τ ` = τ ? Small τ ` Large τ ` ` Nesterov / heavy-ball (quasi)-Newton Many generalization: Stochastic / incremental methods Proximal splitting (non-smooth E ) . . .

  9. The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1).

  10. The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1). Question: What is the complexity of computing r E : R n ! R n ?

  11. The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1). Question: What is the complexity of computing r E : R n ! R n ? r E ( θ ) ⇡ 1 ε ( E ( θ + εδ 1 ) � E ( θ ) , . . . E ( θ + εδ n ) � E ( θ )) Finite di ff erences: K ( n + 1) operations, intractable for large n .

  12. The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1). Question: What is the complexity of computing r E : R n ! R n ? r E ( θ ) ⇡ 1 ε ( E ( θ + εδ 1 ) � E ( θ ) , . . . E ( θ + εδ n ) � E ( θ )) Finite di ff erences: K ( n + 1) operations, intractable for large n . Theorem: there is an algorithm to compute r E in O ( K ) operations. [Seppo Linnainmaa, 1970]

  13. The Complexity of Gradient Computation Setup: E : R n → R computable in K operations. Hypothesis: elementary operations ( a × b, log( a ) , √ a . . . ) and their derivatives cost O (1). Question: What is the complexity of computing r E : R n ! R n ? r E ( θ ) ⇡ 1 ε ( E ( θ + εδ 1 ) � E ( θ ) , . . . E ( θ + εδ n ) � E ( θ )) Finite di ff erences: K ( n + 1) operations, intractable for large n . Theorem: there is an algorithm to compute r E in O ( K ) operations. [Seppo Linnainmaa, 1970] This algorithm is reverse mode automatic di ff erentiation Seppo Linnainmaa

  14. Differentiating Composition of Functions g 1 g R g 0 ∈ R x = x R x 1 x 2 x R +1 x 0 . . . ∂ g r ( x r ) ∈ R n r +1 × n r g r : R n r → R n r +1 x r +1 = g r ( x r ) r g R ( x r ) = [ ∂ g r ( x r )] > 2 R n r +1 ⇥ 1

  15. Differentiating Composition of Functions g 1 g R g 0 ∈ R x = x R x 1 x 2 x R +1 x 0 . . . ∂ g r ( x r ) ∈ R n r +1 × n r g r : R n r → R n r +1 x r +1 = g r ( x r ) r g R ( x r ) = [ ∂ g r ( x r )] > 2 R n r +1 ⇥ 1 ∂ g ( x ) = ∂ g R ( x R ) × ∂ g R − 1 ( x R − 1 ) × . . . × ∂ g 1 ( x 1 ) × ∂ g 0 ( x 0 ) Chain A R n 0 rule: A R − 1 A 1 A 0 1 × × × × . . . n R n 2 n 1 n R − 1

  16. Differentiating Composition of Functions g 1 g R g 0 ∈ R x = x R x 1 x 2 x R +1 x 0 . . . ∂ g r ( x r ) ∈ R n r +1 × n r g r : R n r → R n r +1 x r +1 = g r ( x r ) r g R ( x r ) = [ ∂ g r ( x r )] > 2 R n r +1 ⇥ 1 ∂ g ( x ) = ∂ g R ( x R ) × ∂ g R − 1 ( x R − 1 ) × . . . × ∂ g 1 ( x 1 ) × ∂ g 0 ( x 0 ) Chain A R n 0 rule: A R − 1 A 1 A 0 1 × × × × . . . n R n 2 n 1 n R − 1 ∂ g ( x ) = (( . . . (( A 0 × A 1 ) × A 2 ) . . . × A R − 2 ) × A R − 1 ) × A R Forward n R − 2 n R − 1 n R n 0 n 1 n 2 O ( n 3 ) n R − 1 n R n 1 n 2 n 3 Complexity: (if n r = 1 for r = 0 , . . . , R − 1) ( R − 1) n 3 + n 2

  17. Differentiating Composition of Functions g 1 g R g 0 ∈ R x = x R x 1 x 2 x R +1 x 0 . . . ∂ g r ( x r ) ∈ R n r +1 × n r g r : R n r → R n r +1 x r +1 = g r ( x r ) r g R ( x r ) = [ ∂ g r ( x r )] > 2 R n r +1 ⇥ 1 ∂ g ( x ) = ∂ g R ( x R ) × ∂ g R − 1 ( x R − 1 ) × . . . × ∂ g 1 ( x 1 ) × ∂ g 0 ( x 0 ) Chain A R n 0 rule: A R − 1 A 1 A 0 1 × × × × . . . n R n 2 n 1 n R − 1 ∂ g ( x ) = (( . . . (( A 0 × A 1 ) × A 2 ) . . . × A R − 2 ) × A R − 1 ) × A R Forward n R − 2 n R − 1 n R n 0 n 1 n 2 O ( n 3 ) n R − 1 n R n 1 n 2 n 3 Complexity: (if n r = 1 for r = 0 , . . . , R − 1) ( R − 1) n 3 + n 2 ∂ g ( x ) = A 0 × ( A 1 × ( A 2 × . . . × ( A R − 2 × ( A R − 1 × A R )) . . . )) Backward n 1 n 2 n R − 1 n R O ( n 2 ) n 0 n 1 n R − 2 n R − 1 Complexity: Rn 2

  18. Feedfordward Computational Graphs x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . .

  19. Feedfordward Computational Graphs x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . Example: deep neural network (here fully connected) x r +1 = ρ ( A r x r + b r ) θ r = ( A r , b r ) x r ∈ R d r ρ ( u ) A r ∈ R d r +1 × d r u θ 4 b r ∈ R d r +1 θ 1 x θ 3 θ 2

  20. Feedfordward Computational Graphs x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . Example: deep neural network (here fully connected) x r +1 = ρ ( A r x r + b r ) θ r = ( A r , b r ) x r ∈ R d r ρ ( u ) A r ∈ R d r +1 × d r u θ 4 b r ∈ R d r +1 θ 1 x θ 3 θ 2 X def. L ( x R +1 , y ) = log exp( x R +1 ,i ) − x R +1 ,i y i Logistic loss: i (classification) e x R +1 r x R +1 L ( x R +1 , y ) = i e x R +1 ,i � y P

  21. Backpropagation Algorithm x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . .

  22. Backpropagation Algorithm x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . r x r E = [ ∂ x r g R ( x r , θ r )] > ( r x r +1 E ) ∀ r = R, . . . , 0 , Proposition: r θ r E = [ ∂ θ r g R ( x r , θ r )] > ( r x r +1 E )

  23. Backpropagation Algorithm x r +1 = g r ( x r , θ r ) E ( x ) = L ( x R +1 , y ) θ R θ 0 θ R − 1 θ 1 y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . r x r E = [ ∂ x r g R ( x r , θ r )] > ( r x r +1 E ) ∀ r = R, . . . , 0 , Proposition: r θ r E = [ ∂ θ r g R ( x r , θ r )] > ( r x r +1 E ) x r +1 = ρ ( A r x r + b r ) Example: deep neural network r x r E = A > r M r r A r E = M r x > ∀ r = R, . . . , 0 , def. = ρ 0 ( A r x r + b r ) � r x r +1 E M r r r b r E = M r 1

  24. Recurrent Architectures x r +1 = g r ( x r , θ ) Shared parameters: θ y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . .

  25. Recurrent Architectures x r +1 = g r ( x r , θ ) Shared parameters: θ y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . Recurrent networks for natural language processing: b t b 0 b T b 1 g g x 2 x T x 1 g g x t x t − 1 = . . . θ a t a 1 a 0 a T θ

  26. Recurrent Architectures x r +1 = g r ( x r , θ ) Shared parameters: θ y g 1 g R g 0 L E x = x R x 1 x 2 x R +1 x 0 . . . Recurrent networks for natural language processing: b t b 0 b T b 1 g g x 2 x T x 1 g g x t x t − 1 = . . . θ a t a 1 a 0 a T θ Take home message: for complicated computational architectures, you do not want to do the computation/implementation by hand.

  27. Computational Graph

  28. Computational Graph Computer program ⇔ directed acyclic graph ⇔ linear ordering of nodes ( θ r ) r computing ` function ` ( ✓ 1 , . . . , ✓ M ) forward θ 4 for r = M + 1 , . . . , R g 4 θ 2 θ r = g r ( θ Parents( r ) ) input θ 3 output g 3 return θ R θ 1 θ 5 g 5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend