on the iteration complexity of hypergradient computation
play

On the Iteration Complexity of Hypergradient Computation Riccardo - PowerPoint PPT Presentation

On the Iteration Complexity of Hypergradient Computation Riccardo Grazzi Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia. Department of Computer Science, University College London. riccardo.grazzi@iit.it Joint


  1. On the Iteration Complexity of Hypergradient Computation Riccardo Grazzi Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia. Department of Computer Science, University College London. riccardo.grazzi@iit.it Joint work with Luca Franceschi, Massimiliano Pontil and Saverio Salzo. 1

  2. β€’ Gradient-based methods exploiting the hypergradient βˆ‡π‘”(πœ‡) . Bilevel Optimization Problem min (upper-level) π‘₯(πœ‡) ∢= Ξ¦(π‘₯(πœ‡), πœ‡) (lower-level) β€’ Hyperparameter optimization, meta-learning. β€’ Graph and recurrent neural networks. How can we solve this optimization problem? β€’ Black-box methods (random/grid search, Bayesian optimization, ...). 2 πœ‡βˆˆΞ›βŠ†β„ π‘œ 𝑔(πœ‡) ∢= 𝐹(π‘₯(πœ‡), πœ‡)

  3. Bilevel Optimization Problem min (upper-level) π‘₯(πœ‡) ∢= Ξ¦(π‘₯(πœ‡), πœ‡) (lower-level) β€’ Hyperparameter optimization, meta-learning. β€’ Graph and recurrent neural networks. How can we solve this optimization problem? β€’ Black-box methods (random/grid search, Bayesian optimization, ...). 2 πœ‡βˆˆΞ›βŠ†β„ π‘œ 𝑔(πœ‡) ∢= 𝐹(π‘₯(πœ‡), πœ‡) β€’ Gradient-based methods exploiting the hypergradient βˆ‡π‘”(πœ‡) .

  4. Computing the Hypergradient βˆ‡π‘”(πœ‡) Can be really expensive or even impossible to compute Exactly. Two common approximation strategies are 1. Iterative Difgerentiation (ITD) . 2. Approximate Implicit Difgerentiation (AID) . Which one is the best? β€’ Previous works provide mostly qualitative and empirical results. Can we have quantitative results on the approximation error? β€’ Yes! If the fixed point map Ξ¦(β‹…, πœ‡) is a contraction . 3

  5. Computing the Hypergradient βˆ‡π‘”(πœ‡) Can be really expensive or even impossible to compute Exactly. Two common approximation strategies are 1. Iterative Difgerentiation (ITD) . 2. Approximate Implicit Difgerentiation (AID) . Which one is the best? β€’ Previous works provide mostly qualitative and empirical results. Can we have quantitative results on the approximation error? β€’ Yes! If the fixed point map Ξ¦(β‹…, πœ‡) is a contraction . 3

  6. Our Contributions β€’ If Ξ¦(β‹…, πœ‡) is a contraction, the results confirm the theory. Upper bounds on the approximation error for both ITD and AID β€’ If Ξ¦(β‹…, πœ‡) is NOT a contraction, ITD can be still a reliable strategy. 4 Extensive experimental comparison among difgerent AID strategies and ITD β€’ We prove that ITD is generally worse than AID in terms of upper bounds. β€’ Both methods achieve non-asymptotic linear convergence rates . Logistic Regression Kernel Ridge Regression Biased Regularization Hyper Representation 9 10 7 10 3 10 βˆ’2 8 10 10 2 ||βˆ‡ f ( Ξ» ) βˆ’ g ( Ξ» )|| 5 10 10 7 βˆ’3 10 10 1 10 6 3 10 10 βˆ’4 ITD 10 0 FP k = t 10 5 10 FP k = 10 βˆ’5 10 1 10 βˆ’1 CG k = t 10 4 10 CG k = 10 βˆ’6 10 0 250 500 750 1000 1250 0 50 100 150 200 0 25 50 75 100 125 150 0 100 200 300 400 500 t t t t

  7. All can be cast into the same bilevel framework where at the lower-level we seek for the solution to a parametric fixed point equation. Motivation Source: S.Ravi, H. Larochelle (2016). β€’ Hyperparameter optimization (learn the kernel/regularization, ...). β€’ Meta-learning (MAML, L2LOpt, ...). Source: snap.stanford.edu/proj/embeddings-www β€’ Graph Neural Networks. β€’ Some Recurrent Models. β€’ Deep Equilibrium Models. 5

  8. All can be cast into the same bilevel framework where at the lower-level we seek for the solution to a parametric fixed point equation. Motivation Source: S.Ravi, H. Larochelle (2016). β€’ Hyperparameter optimization (learn the kernel/regularization, ...). β€’ Meta-learning (MAML, L2LOpt, ...). Source: snap.stanford.edu/proj/embeddings-www β€’ Graph Neural Networks. β€’ Some Recurrent Models. β€’ Deep Equilibrium Models. 5

  9. Motivation Source: S.Ravi, H. Larochelle (2016). β€’ Hyperparameter optimization (learn the kernel/regularization, ...). β€’ Meta-learning (MAML, L2LOpt, ...). Source: snap.stanford.edu/proj/embeddings-www β€’ Graph Neural Networks. β€’ Some Recurrent Models. β€’ Deep Equilibrium Models. 5 All can be cast into the same bilevel framework where at the lower-level we seek for the solution to a parametric fixed point equation.

  10. Example: Optimizing the Regularization Hyperparameter in Ridge Regression min If the step size 𝛽 is suffjciently small, Ξ¦(β‹…, πœ‡) is also a contraction . Ξ¦(π‘₯, πœ‡) = π‘₯ βˆ’ π›½βˆ‡ 1 β„“(π‘₯, πœ‡) π‘₯(πœ‡) is the unique fixed point of the one step GD map 2 } 2β€–π‘Œπ‘₯ βˆ’ 𝑧‖ 2 6 π‘₯βˆˆβ„ 𝑒 π‘₯(πœ‡) = arg min 2 2β€–π‘Œ val π‘₯(πœ‡) βˆ’ 𝑧 val β€– 2 1 πœ‡βˆˆ(0,∞) {β„“(π‘₯, πœ‡) ∢= 1 2 + πœ‡ 2 β€–π‘₯β€– 2

  11. β€’ βˆ‡π‘” is even harder to evaluate. The Bilevel Framework min (upper-level) π‘₯(πœ‡) ∢= Ξ¦(π‘₯(πœ‡), πœ‡) (lower-level) β€’ 𝑔 is usually non convex and expensive or impossible to evaluate exactly. 7 πœ‡βˆˆΞ›βŠ†β„ π‘œ 𝑔(πœ‡) ∢= 𝐹(π‘₯(πœ‡), πœ‡) β€’ π‘₯(πœ‡) ∈ ℝ 𝑒 is oΔ§ten not available in closed form.

  12. The Bilevel Framework min (upper-level) π‘₯(πœ‡) ∢= Ξ¦(π‘₯(πœ‡), πœ‡) (lower-level) β€’ 𝑔 is usually non convex and expensive or impossible to evaluate exactly. 7 πœ‡βˆˆΞ›βŠ†β„ π‘œ 𝑔(πœ‡) ∢= 𝐹(π‘₯(πœ‡), πœ‡) β€’ π‘₯(πœ‡) ∈ ℝ 𝑒 is oΔ§ten not available in closed form. β€’ βˆ‡π‘” is even harder to evaluate.

  13. How to Compute the Hypergradient βˆ‡π‘”(πœ‡) ? 2. Compute 𝑀 𝑒,𝑙 (πœ‡) with 𝑙 steps of a solver for Which one is the best? + πœ– 2 Ξ¦(π‘₯ 𝑒 (πœ‡), πœ‡) ⊀ 𝑀 𝑒,𝑙 (πœ‡). βˆ‡π‘”(πœ‡) ∢=βˆ‡ 2 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡) Μ‚ 3. Compute the approximate gradient as (𝐽 βˆ’ πœ– 1 Ξ¦(π‘₯ 𝑒 (πœ‡), πœ‡) ⊀ )𝑀 = βˆ‡ 1 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡). the linear system 1. Get π‘₯ 𝑒 (πœ‡) with 𝑒 steps of a lower-level solver. Iterative Difgerentiaton (ITD) Approximate Implicit Difgerentiation (AID) difgerentiation. (RMAD) or forward (FMAD) mode automatic 3. Compute βˆ‡π‘” 𝑒 (πœ‡) effjciently using reverse 2. Compute 𝑔 𝑒 (πœ‡) = 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡) . for 𝑗 = 1, 2, … 𝑒 1. Set π‘₯ 0 (πœ‡) = 0 and compute, 8 ⌊ π‘₯ 𝑗 (πœ‡) = Ξ¦(π‘₯ π‘—βˆ’1 (πœ‡), πœ‡).

  14. How to Compute the Hypergradient βˆ‡π‘”(πœ‡) ? 2. Compute 𝑀 𝑒,𝑙 (πœ‡) with 𝑙 steps of a solver for Which one is the best? + πœ– 2 Ξ¦(π‘₯ 𝑒 (πœ‡), πœ‡) ⊀ 𝑀 𝑒,𝑙 (πœ‡). βˆ‡π‘”(πœ‡) ∢=βˆ‡ 2 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡) Μ‚ 3. Compute the approximate gradient as (𝐽 βˆ’ πœ– 1 Ξ¦(π‘₯ 𝑒 (πœ‡), πœ‡) ⊀ )𝑀 = βˆ‡ 1 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡). the linear system 1. Get π‘₯ 𝑒 (πœ‡) with 𝑒 steps of a lower-level solver. Iterative Difgerentiaton (ITD) Approximate Implicit Difgerentiation (AID) difgerentiation. (RMAD) or forward (FMAD) mode automatic 3. Compute βˆ‡π‘” 𝑒 (πœ‡) effjciently using reverse 2. Compute 𝑔 𝑒 (πœ‡) = 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡) . for 𝑗 = 1, 2, … 𝑒 1. Set π‘₯ 0 (πœ‡) = 0 and compute, 8 ⌊ π‘₯ 𝑗 (πœ‡) = Ξ¦(π‘₯ π‘—βˆ’1 (πœ‡), πœ‡).

  15. How to Compute the Hypergradient βˆ‡π‘”(πœ‡) ? 2. Compute 𝑀 𝑒,𝑙 (πœ‡) with 𝑙 steps of a solver for Which one is the best? + πœ– 2 Ξ¦(π‘₯ 𝑒 (πœ‡), πœ‡) ⊀ 𝑀 𝑒,𝑙 (πœ‡). βˆ‡π‘”(πœ‡) ∢=βˆ‡ 2 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡) Μ‚ 3. Compute the approximate gradient as (𝐽 βˆ’ πœ– 1 Ξ¦(π‘₯ 𝑒 (πœ‡), πœ‡) ⊀ )𝑀 = βˆ‡ 1 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡). the linear system 1. Get π‘₯ 𝑒 (πœ‡) with 𝑒 steps of a lower-level solver. Iterative Difgerentiaton (ITD) Approximate Implicit Difgerentiation (AID) difgerentiation. (RMAD) or forward (FMAD) mode automatic 3. Compute βˆ‡π‘” 𝑒 (πœ‡) effjciently using reverse 2. Compute 𝑔 𝑒 (πœ‡) = 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡) . for 𝑗 = 1, 2, … 𝑒 1. Set π‘₯ 0 (πœ‡) = 0 and compute, 8 ⌊ π‘₯ 𝑗 (πœ‡) = Ξ¦(π‘₯ π‘—βˆ’1 (πœ‡), πœ‡).

  16. A First Comparison ITD β€’ Ignores the bilevel structure. β€’ Cost in time (RMAD): 𝑃(Cost(𝑔 𝑒 (πœ‡))) β€’ Cost in memory (RMAD): 𝑃(𝑒𝑒) . β€’ Can we control β€–βˆ‡π‘” 𝑒 (πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– ? AID β€’ Can use any lower-level solver. β€’ Cost in time ( 𝑙 = 𝑒 ): 𝑃(Cost(𝑔 𝑒 (πœ‡))) . β€’ Cost in memory: 𝑃(𝑒) . βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– ? 𝑔 𝑒 (πœ‡) = 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡) . 9 β€’ Can we control β€– Μ‚

  17. β€’ We provide non-asymptotic upper β€’ We provide non-asymptotic upper bounds on β€– Μ‚ Previous Work on the Approximation Error (Pedregosa, 2016). 𝑔 𝑒 (πœ‡) = 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡) . βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– . (Rajeswaran et al., 2019). regularization meta-learning with biased linear rate in 𝑒 and 𝑙 for β†’ βˆ’ βˆ’ βˆ’ βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– βˆ’ 10 ITD β†’ βˆ’ βˆ’ β†’ (Franceschi et al., 2018). bounds on β€–βˆ‡π‘” 𝑒 (πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– . AID βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– βˆ’ βˆ’ βˆ’ βˆ’ β€’ arg min 𝑔 𝑒 βˆ’ π‘’β†’βˆž arg min 𝑔 β€’ β€– Μ‚ 𝑒,π‘™β†’βˆž 0 β€’ β€– Μ‚ 𝑒,π‘™β†’βˆž 0 at a

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend