On the Iteration Complexity of Hypergradient Computation Riccardo - PowerPoint PPT Presentation

On the Iteration Complexity of Hypergradient Computation Riccardo Grazzi Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia. Department of Computer Science, University College London. riccardo.grazzi@iit.it Joint work with Luca Franceschi, Massimiliano Pontil and Saverio Salzo. 1

• Gradient-based methods exploiting the hypergradient ∇𝑔(𝜇) . Bilevel Optimization Problem min (upper-level) 𝑥(𝜇) ∶= Φ(𝑥(𝜇), 𝜇) (lower-level) • Hyperparameter optimization, meta-learning. • Graph and recurrent neural networks. How can we solve this optimization problem? • Black-box methods (random/grid search, Bayesian optimization, ...). 2 𝜇∈Λ⊆ℝ 𝑜 𝑔(𝜇) ∶= 𝐹(𝑥(𝜇), 𝜇)

Bilevel Optimization Problem min (upper-level) 𝑥(𝜇) ∶= Φ(𝑥(𝜇), 𝜇) (lower-level) • Hyperparameter optimization, meta-learning. • Graph and recurrent neural networks. How can we solve this optimization problem? • Black-box methods (random/grid search, Bayesian optimization, ...). 2 𝜇∈Λ⊆ℝ 𝑜 𝑔(𝜇) ∶= 𝐹(𝑥(𝜇), 𝜇) • Gradient-based methods exploiting the hypergradient ∇𝑔(𝜇) .

Computing the Hypergradient ∇𝑔(𝜇) Can be really expensive or even impossible to compute Exactly. Two common approximation strategies are 1. Iterative Difgerentiation (ITD) . 2. Approximate Implicit Difgerentiation (AID) . Which one is the best? • Previous works provide mostly qualitative and empirical results. Can we have quantitative results on the approximation error? • Yes! If the fixed point map Φ(⋅, 𝜇) is a contraction . 3

Our Contributions • If Φ(⋅, 𝜇) is a contraction, the results confirm the theory. Upper bounds on the approximation error for both ITD and AID • If Φ(⋅, 𝜇) is NOT a contraction, ITD can be still a reliable strategy. 4 Extensive experimental comparison among difgerent AID strategies and ITD • We prove that ITD is generally worse than AID in terms of upper bounds. • Both methods achieve non-asymptotic linear convergence rates . Logistic Regression Kernel Ridge Regression Biased Regularization Hyper Representation 9 10 7 10 3 10 −2 8 10 10 2 ||∇ f ( λ ) − g ( λ )|| 5 10 10 7 −3 10 10 1 10 6 3 10 10 −4 ITD 10 0 FP k = t 10 5 10 FP k = 10 −5 10 1 10 −1 CG k = t 10 4 10 CG k = 10 −6 10 0 250 500 750 1000 1250 0 50 100 150 200 0 25 50 75 100 125 150 0 100 200 300 400 500 t t t t

All can be cast into the same bilevel framework where at the lower-level we seek for the solution to a parametric fixed point equation. Motivation Source: S.Ravi, H. Larochelle (2016). • Hyperparameter optimization (learn the kernel/regularization, ...). • Meta-learning (MAML, L2LOpt, ...). Source: snap.stanford.edu/proj/embeddings-www • Graph Neural Networks. • Some Recurrent Models. • Deep Equilibrium Models. 5

Motivation Source: S.Ravi, H. Larochelle (2016). • Hyperparameter optimization (learn the kernel/regularization, ...). • Meta-learning (MAML, L2LOpt, ...). Source: snap.stanford.edu/proj/embeddings-www • Graph Neural Networks. • Some Recurrent Models. • Deep Equilibrium Models. 5 All can be cast into the same bilevel framework where at the lower-level we seek for the solution to a parametric fixed point equation.

Example: Optimizing the Regularization Hyperparameter in Ridge Regression min If the step size 𝛽 is suffjciently small, Φ(⋅, 𝜇) is also a contraction . Φ(𝑥, 𝜇) = 𝑥 − 𝛽∇ 1 ℓ(𝑥, 𝜇) 𝑥(𝜇) is the unique fixed point of the one step GD map 2 } 2‖𝑌𝑥 − 𝑧‖ 2 6 𝑥∈ℝ 𝑒 𝑥(𝜇) = arg min 2 2‖𝑌 val 𝑥(𝜇) − 𝑧 val ‖ 2 1 𝜇∈(0,∞) {ℓ(𝑥, 𝜇) ∶= 1 2 + 𝜇 2 ‖𝑥‖ 2

• ∇𝑔 is even harder to evaluate. The Bilevel Framework min (upper-level) 𝑥(𝜇) ∶= Φ(𝑥(𝜇), 𝜇) (lower-level) • 𝑔 is usually non convex and expensive or impossible to evaluate exactly. 7 𝜇∈Λ⊆ℝ 𝑜 𝑔(𝜇) ∶= 𝐹(𝑥(𝜇), 𝜇) • 𝑥(𝜇) ∈ ℝ 𝑒 is oħten not available in closed form.

The Bilevel Framework min (upper-level) 𝑥(𝜇) ∶= Φ(𝑥(𝜇), 𝜇) (lower-level) • 𝑔 is usually non convex and expensive or impossible to evaluate exactly. 7 𝜇∈Λ⊆ℝ 𝑜 𝑔(𝜇) ∶= 𝐹(𝑥(𝜇), 𝜇) • 𝑥(𝜇) ∈ ℝ 𝑒 is oħten not available in closed form. • ∇𝑔 is even harder to evaluate.

How to Compute the Hypergradient ∇𝑔(𝜇) ? 2. Compute 𝑤 𝑢,𝑙 (𝜇) with 𝑙 steps of a solver for Which one is the best? + 𝜖 2 Φ(𝑥 𝑢 (𝜇), 𝜇) ⊤ 𝑤 𝑢,𝑙 (𝜇). ∇𝑔(𝜇) ∶=∇ 2 𝐹(𝑥 𝑢 (𝜇), 𝜇) ̂ 3. Compute the approximate gradient as (𝐽 − 𝜖 1 Φ(𝑥 𝑢 (𝜇), 𝜇) ⊤ )𝑤 = ∇ 1 𝐹(𝑥 𝑢 (𝜇), 𝜇). the linear system 1. Get 𝑥 𝑢 (𝜇) with 𝑢 steps of a lower-level solver. Iterative Difgerentiaton (ITD) Approximate Implicit Difgerentiation (AID) difgerentiation. (RMAD) or forward (FMAD) mode automatic 3. Compute ∇𝑔 𝑢 (𝜇) effjciently using reverse 2. Compute 𝑔 𝑢 (𝜇) = 𝐹(𝑥 𝑢 (𝜇), 𝜇) . for 𝑗 = 1, 2, … 𝑢 1. Set 𝑥 0 (𝜇) = 0 and compute, 8 ⌊ 𝑥 𝑗 (𝜇) = Φ(𝑥 𝑗−1 (𝜇), 𝜇).

A First Comparison ITD • Ignores the bilevel structure. • Cost in time (RMAD): 𝑃(Cost(𝑔 𝑢 (𝜇))) • Cost in memory (RMAD): 𝑃(𝑢𝑒) . • Can we control ‖∇𝑔 𝑢 (𝜇) − ∇𝑔(𝜇)‖ ? AID • Can use any lower-level solver. • Cost in time ( 𝑙 = 𝑢 ): 𝑃(Cost(𝑔 𝑢 (𝜇))) . • Cost in memory: 𝑃(𝑒) . ∇𝑔(𝜇) − ∇𝑔(𝜇)‖ ? 𝑔 𝑢 (𝜇) = 𝐹(𝑥 𝑢 (𝜇), 𝜇) . 9 • Can we control ‖ ̂

• We provide non-asymptotic upper • We provide non-asymptotic upper bounds on ‖ ̂ Previous Work on the Approximation Error (Pedregosa, 2016). 𝑔 𝑢 (𝜇) = 𝐹(𝑥 𝑢 (𝜇), 𝜇) . ∇𝑔(𝜇) − ∇𝑔(𝜇)‖ . (Rajeswaran et al., 2019). regularization meta-learning with biased linear rate in 𝑢 and 𝑙 for → − − − ∇𝑔(𝜇) − ∇𝑔(𝜇)‖ − 10 ITD → − − → (Franceschi et al., 2018). bounds on ‖∇𝑔 𝑢 (𝜇) − ∇𝑔(𝜇)‖ . AID ∇𝑔(𝜇) − ∇𝑔(𝜇)‖ − − − − • arg min 𝑔 𝑢 − 𝑢→∞ arg min 𝑔 • ‖ ̂ 𝑢,𝑙→∞ 0 • ‖ ̂ 𝑢,𝑙→∞ 0 at a

On the Iteration Complexity of Hypergradient Computation Riccardo - PowerPoint PPT Presentation

On the Iteration Complexity of Hypergradient Computation Riccardo Grazzi Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia. Department of Computer Science, University College London. riccardo.grazzi@iit.it Joint

Matrix Iteration Higher Modes Inverse Iteration Matrix Iteration Giacomo Boffi with Shifts

Iteration/loops variety of iteration constructs provided with varying degrees of complexity,

Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1 Table of Contents

Complexity Measures for Parallel Computation Complexity Measures for Parallel Computation

CATTLE MARKET ITERATION 01 CIRCULATION PRESENTATION ITERATION 01 INTRODUCTION This document

CONVERGENCE OF A GENERALIZED MIDPOINT ITERATION JARED ABLE, DANIEL BRADLEY, ALVIN MOON, AND

Trust region policy optimization (TRPO) Value Iteration Value Iteration This is what we

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Policy iteration comments Each step of policy iteration is guaranteed to strictly improve the

Markov Decision Processes and Exact Solution Methods: Value Iteration Policy Iteration Linear

Manipulating an Abstraction (Iteration) CT @ VT An algorithm with iteration START BOOK LIST =

CS 473: Artificial Intelligence MDP Planning: Value Iteration and Policy Iteration Travis Mandel

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Combinatorial Newton iteration for Boltzmann oracle Carine Pivoteau joint work with Bruno Salvy

Blockly Lists & Iteration CT @ VT Things we are seeing Using lists to represent a data

61A Lecture 10 Monday, September 17 Sequence Iteration 2 Sequence Iteration def count(s,

Ascertaining the Reality of Network Ascertaining the Reality of Network Neutrality Violation in

Differentiating proofs for programs Marie Kerjean Inria Bretagne Marie Kerjean (Inria Bretagne)

wi4243AP: Complex Analysis week 5, Monday K. P. Hart Faculty EEMCS TU Delft Delft, 29

We have seen two applications: signal smoothing root finding Today we look

Section 4 Numerical Differentiation and Integration Numerical Analysis I Xiaojing Ye, Math

Adaptive Packet Marking for Maintaining End-to-End Throughput in a differentiated- Services

One More Advantage of Deep Learning: From Traditional NN . . . While in General, A Perfect

BCD Smart Power Roadmap Trends and Challenges Giuseppe Croce NEREID WORKSHOP Smart Energy

On the Iteration Complexity of Hypergradient Computation Riccardo - PowerPoint PPT Presentation

On the Iteration Complexity of Hypergradient Computation Riccardo Grazzi Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia. Department of Computer Science, University College London. riccardo.grazzi@iit.it Joint

Matrix Iteration Higher Modes Inverse Iteration Matrix Iteration Giacomo Boffi with Shifts

Iteration/loops variety of iteration constructs provided with varying degrees of complexity,

Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1 Table of Contents

Complexity Measures for Parallel Computation Complexity Measures for Parallel Computation

CATTLE MARKET ITERATION 01 CIRCULATION PRESENTATION ITERATION 01 INTRODUCTION This document

CONVERGENCE OF A GENERALIZED MIDPOINT ITERATION JARED ABLE, DANIEL BRADLEY, ALVIN MOON, AND

Trust region policy optimization (TRPO) Value Iteration Value Iteration This is what we

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Policy iteration comments Each step of policy iteration is guaranteed to strictly improve the

Markov Decision Processes and Exact Solution Methods: Value Iteration Policy Iteration Linear

Manipulating an Abstraction (Iteration) CT @ VT An algorithm with iteration START BOOK LIST =

CS 473: Artificial Intelligence MDP Planning: Value Iteration and Policy Iteration Travis Mandel

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Combinatorial Newton iteration for Boltzmann oracle Carine Pivoteau joint work with Bruno Salvy

Blockly Lists &amp; Iteration CT @ VT Things we are seeing Using lists to represent a data

61A Lecture 10 Monday, September 17 Sequence Iteration 2 Sequence Iteration def count(s,

Ascertaining the Reality of Network Ascertaining the Reality of Network Neutrality Violation in

Differentiating proofs for programs Marie Kerjean Inria Bretagne Marie Kerjean (Inria Bretagne)

wi4243AP: Complex Analysis week 5, Monday K. P. Hart Faculty EEMCS TU Delft Delft, 29

We have seen two applications: signal smoothing root finding Today we look

Section 4 Numerical Differentiation and Integration Numerical Analysis I Xiaojing Ye, Math

Adaptive Packet Marking for Maintaining End-to-End Throughput in a differentiated- Services

One More Advantage of Deep Learning: From Traditional NN . . . While in General, A Perfect

BCD Smart Power Roadmap Trends and Challenges Giuseppe Croce NEREID WORKSHOP Smart Energy

Blockly Lists & Iteration CT @ VT Things we are seeing Using lists to represent a data