On the Iteration Complexity of Hypergradient Computation Riccardo - - PowerPoint PPT Presentation

β–Ά
on the iteration complexity of hypergradient computation
SMART_READER_LITE
LIVE PREVIEW

On the Iteration Complexity of Hypergradient Computation Riccardo - - PowerPoint PPT Presentation

On the Iteration Complexity of Hypergradient Computation Riccardo Grazzi Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia. Department of Computer Science, University College London. riccardo.grazzi@iit.it Joint


slide-1
SLIDE 1

On the Iteration Complexity of Hypergradient Computation

Riccardo Grazzi

Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia. Department of Computer Science, University College London. riccardo.grazzi@iit.it

Joint work with Luca Franceschi, Massimiliano Pontil and Saverio Salzo.

1

slide-2
SLIDE 2

Bilevel Optimization Problem

min

πœ‡βˆˆΞ›βŠ†β„π‘œ 𝑔(πœ‡) ∢= 𝐹(π‘₯(πœ‡), πœ‡)

(upper-level) π‘₯(πœ‡) ∢= Ξ¦(π‘₯(πœ‡), πœ‡) (lower-level)

  • Hyperparameter optimization, meta-learning.
  • Graph and recurrent neural networks.

How can we solve this optimization problem?

  • Black-box methods (random/grid search, Bayesian optimization, ...).
  • Gradient-based methods exploiting the hypergradient βˆ‡π‘”(πœ‡).

2

slide-3
SLIDE 3

Bilevel Optimization Problem

min

πœ‡βˆˆΞ›βŠ†β„π‘œ 𝑔(πœ‡) ∢= 𝐹(π‘₯(πœ‡), πœ‡)

(upper-level) π‘₯(πœ‡) ∢= Ξ¦(π‘₯(πœ‡), πœ‡) (lower-level)

  • Hyperparameter optimization, meta-learning.
  • Graph and recurrent neural networks.

How can we solve this optimization problem?

  • Black-box methods (random/grid search, Bayesian optimization, ...).
  • Gradient-based methods exploiting the hypergradient βˆ‡π‘”(πœ‡).

2

slide-4
SLIDE 4

Computing the Hypergradient βˆ‡π‘”(πœ‡)

Can be really expensive or even impossible to compute Exactly. Two common approximation strategies are

  • 1. Iterative Difgerentiation (ITD).
  • 2. Approximate Implicit Difgerentiation (AID).

Which one is the best?

  • Previous works provide mostly qualitative and empirical results.

Can we have quantitative results on the approximation error?

  • Yes! If the fixed point map Ξ¦(β‹…, πœ‡) is a contraction.

3

slide-5
SLIDE 5

Computing the Hypergradient βˆ‡π‘”(πœ‡)

Can be really expensive or even impossible to compute Exactly. Two common approximation strategies are

  • 1. Iterative Difgerentiation (ITD).
  • 2. Approximate Implicit Difgerentiation (AID).

Which one is the best?

  • Previous works provide mostly qualitative and empirical results.

Can we have quantitative results on the approximation error?

  • Yes! If the fixed point map Ξ¦(β‹…, πœ‡) is a contraction.

3

slide-6
SLIDE 6

Our Contributions

Upper bounds on the approximation error for both ITD and AID

  • Both methods achieve non-asymptotic linear convergence rates.
  • We prove that ITD is generally worse than AID in terms of upper bounds.

Extensive experimental comparison among difgerent AID strategies and ITD

  • If Ξ¦(β‹…, πœ‡) is a contraction, the results confirm the theory.
  • If Ξ¦(β‹…, πœ‡) is NOT a contraction, ITD can be still a reliable strategy.

250 500 750 1000 1250 t 10

βˆ’6

10

βˆ’5

10

βˆ’4

10

βˆ’3

10

βˆ’2

||βˆ‡f(Ξ») βˆ’ g(Ξ»)|| Logistic Regression ITD FP k = t FP k = 10 CG k = t CG k = 10 50 100 150 200 t 10

1

10

3

10

5

10

7

Kernel Ridge Regression 25 50 75 100 125 150 t 10

βˆ’1

10 10

1

10

2

10

3

Biased Regularization 100 200 300 400 500 t 10

4

10

5

10

6

10

7

10

8

10

9

Hyper Representation

4

slide-7
SLIDE 7

Motivation

Source: S.Ravi, H. Larochelle (2016).

  • Hyperparameter optimization

(learn the kernel/regularization, ...).

  • Meta-learning (MAML, L2LOpt, ...).

Source: snap.stanford.edu/proj/embeddings-www

  • Graph Neural Networks.
  • Some Recurrent Models.
  • Deep Equilibrium Models.

All can be cast into the same bilevel framework where at the lower-level we seek for the solution to a parametric fixed point equation.

5

slide-8
SLIDE 8

Motivation

Source: S.Ravi, H. Larochelle (2016).

  • Hyperparameter optimization

(learn the kernel/regularization, ...).

  • Meta-learning (MAML, L2LOpt, ...).

Source: snap.stanford.edu/proj/embeddings-www

  • Graph Neural Networks.
  • Some Recurrent Models.
  • Deep Equilibrium Models.

All can be cast into the same bilevel framework where at the lower-level we seek for the solution to a parametric fixed point equation.

5

slide-9
SLIDE 9

Motivation

Source: S.Ravi, H. Larochelle (2016).

  • Hyperparameter optimization

(learn the kernel/regularization, ...).

  • Meta-learning (MAML, L2LOpt, ...).

Source: snap.stanford.edu/proj/embeddings-www

  • Graph Neural Networks.
  • Some Recurrent Models.
  • Deep Equilibrium Models.

All can be cast into the same bilevel framework where at the lower-level we seek for the solution to a parametric fixed point equation.

5

slide-10
SLIDE 10

Example: Optimizing the Regularization Hyperparameter in Ridge Regression

min

πœ‡βˆˆ(0,∞)

1 2β€–π‘Œvalπ‘₯(πœ‡) βˆ’ 𝑧valβ€–2

2

π‘₯(πœ‡) = arg min

π‘₯βˆˆβ„π‘’

{β„“(π‘₯, πœ‡) ∢= 1 2β€–π‘Œπ‘₯ βˆ’ 𝑧‖2

2 + πœ‡

2 β€–π‘₯β€–2

2}

π‘₯(πœ‡) is the unique fixed point of the one step GD map Ξ¦(π‘₯, πœ‡) = π‘₯ βˆ’ π›½βˆ‡1β„“(π‘₯, πœ‡) If the step size 𝛽 is suffjciently small, Ξ¦(β‹…, πœ‡) is also a contraction.

6

slide-11
SLIDE 11

The Bilevel Framework

min

πœ‡βˆˆΞ›βŠ†β„π‘œ 𝑔(πœ‡) ∢= 𝐹(π‘₯(πœ‡), πœ‡)

(upper-level) π‘₯(πœ‡) ∢= Ξ¦(π‘₯(πœ‡), πœ‡) (lower-level)

  • π‘₯(πœ‡) ∈ ℝ𝑒 is oΔ§ten not available in closed form.
  • 𝑔 is usually non convex and expensive or impossible to evaluate exactly.
  • βˆ‡π‘” is even harder to evaluate.

7

slide-12
SLIDE 12

The Bilevel Framework

min

πœ‡βˆˆΞ›βŠ†β„π‘œ 𝑔(πœ‡) ∢= 𝐹(π‘₯(πœ‡), πœ‡)

(upper-level) π‘₯(πœ‡) ∢= Ξ¦(π‘₯(πœ‡), πœ‡) (lower-level)

  • π‘₯(πœ‡) ∈ ℝ𝑒 is oΔ§ten not available in closed form.
  • 𝑔 is usually non convex and expensive or impossible to evaluate exactly.
  • βˆ‡π‘” is even harder to evaluate.

7

slide-13
SLIDE 13

How to Compute the Hypergradient βˆ‡π‘”(πœ‡)?

Iterative Difgerentiaton (ITD)

  • 1. Set π‘₯0(πœ‡) = 0 and compute,

for 𝑗 = 1, 2, … 𝑒 ⌊ π‘₯𝑗(πœ‡) = Ξ¦(π‘₯π‘—βˆ’1(πœ‡), πœ‡).

  • 2. Compute 𝑔𝑒(πœ‡) = 𝐹(π‘₯𝑒(πœ‡), πœ‡).
  • 3. Compute βˆ‡π‘”π‘’(πœ‡) effjciently using reverse

(RMAD) or forward (FMAD) mode automatic difgerentiation.

Approximate Implicit Difgerentiation (AID)

  • 1. Get π‘₯𝑒(πœ‡) with 𝑒 steps of a lower-level solver.
  • 2. Compute 𝑀𝑒,𝑙(πœ‡) with 𝑙 steps of a solver for

the linear system (𝐽 βˆ’ πœ–1Ξ¦(π‘₯𝑒(πœ‡), πœ‡)⊀)𝑀 = βˆ‡1𝐹(π‘₯𝑒(πœ‡), πœ‡).

  • 3. Compute the approximate gradient as

Μ‚ βˆ‡π‘”(πœ‡) ∢=βˆ‡2𝐹(π‘₯𝑒(πœ‡), πœ‡) + πœ–2Ξ¦(π‘₯𝑒(πœ‡), πœ‡)βŠ€π‘€π‘’,𝑙(πœ‡).

Which one is the best?

8

slide-14
SLIDE 14

How to Compute the Hypergradient βˆ‡π‘”(πœ‡)?

Iterative Difgerentiaton (ITD)

  • 1. Set π‘₯0(πœ‡) = 0 and compute,

for 𝑗 = 1, 2, … 𝑒 ⌊ π‘₯𝑗(πœ‡) = Ξ¦(π‘₯π‘—βˆ’1(πœ‡), πœ‡).

  • 2. Compute 𝑔𝑒(πœ‡) = 𝐹(π‘₯𝑒(πœ‡), πœ‡).
  • 3. Compute βˆ‡π‘”π‘’(πœ‡) effjciently using reverse

(RMAD) or forward (FMAD) mode automatic difgerentiation.

Approximate Implicit Difgerentiation (AID)

  • 1. Get π‘₯𝑒(πœ‡) with 𝑒 steps of a lower-level solver.
  • 2. Compute 𝑀𝑒,𝑙(πœ‡) with 𝑙 steps of a solver for

the linear system (𝐽 βˆ’ πœ–1Ξ¦(π‘₯𝑒(πœ‡), πœ‡)⊀)𝑀 = βˆ‡1𝐹(π‘₯𝑒(πœ‡), πœ‡).

  • 3. Compute the approximate gradient as

Μ‚ βˆ‡π‘”(πœ‡) ∢=βˆ‡2𝐹(π‘₯𝑒(πœ‡), πœ‡) + πœ–2Ξ¦(π‘₯𝑒(πœ‡), πœ‡)βŠ€π‘€π‘’,𝑙(πœ‡).

Which one is the best?

8

slide-15
SLIDE 15

How to Compute the Hypergradient βˆ‡π‘”(πœ‡)?

Iterative Difgerentiaton (ITD)

  • 1. Set π‘₯0(πœ‡) = 0 and compute,

for 𝑗 = 1, 2, … 𝑒 ⌊ π‘₯𝑗(πœ‡) = Ξ¦(π‘₯π‘—βˆ’1(πœ‡), πœ‡).

  • 2. Compute 𝑔𝑒(πœ‡) = 𝐹(π‘₯𝑒(πœ‡), πœ‡).
  • 3. Compute βˆ‡π‘”π‘’(πœ‡) effjciently using reverse

(RMAD) or forward (FMAD) mode automatic difgerentiation.

Approximate Implicit Difgerentiation (AID)

  • 1. Get π‘₯𝑒(πœ‡) with 𝑒 steps of a lower-level solver.
  • 2. Compute 𝑀𝑒,𝑙(πœ‡) with 𝑙 steps of a solver for

the linear system (𝐽 βˆ’ πœ–1Ξ¦(π‘₯𝑒(πœ‡), πœ‡)⊀)𝑀 = βˆ‡1𝐹(π‘₯𝑒(πœ‡), πœ‡).

  • 3. Compute the approximate gradient as

Μ‚ βˆ‡π‘”(πœ‡) ∢=βˆ‡2𝐹(π‘₯𝑒(πœ‡), πœ‡) + πœ–2Ξ¦(π‘₯𝑒(πœ‡), πœ‡)βŠ€π‘€π‘’,𝑙(πœ‡).

Which one is the best?

8

slide-16
SLIDE 16

A First Comparison

ITD

  • Ignores the bilevel structure.
  • Cost in time (RMAD): 𝑃(Cost(𝑔𝑒(πœ‡)))
  • Cost in memory (RMAD): 𝑃(𝑒𝑒).
  • Can we control β€–βˆ‡π‘”π‘’(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€–?

AID

  • Can use any lower-level solver.
  • Cost in time (𝑙 = 𝑒): 𝑃(Cost(𝑔𝑒(πœ‡))).
  • Cost in memory: 𝑃(𝑒).
  • Can we control β€– Μ‚

βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€–?

𝑔𝑒(πœ‡) = 𝐹(π‘₯𝑒(πœ‡), πœ‡).

9

slide-17
SLIDE 17

Previous Work on the Approximation Error

ITD

  • arg min 𝑔𝑒 βˆ’

βˆ’ βˆ’ β†’

π‘’β†’βˆž arg min 𝑔

(Franceschi et al., 2018).

  • We provide non-asymptotic upper

bounds on β€–βˆ‡π‘”π‘’(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€–. AID

  • β€– Μ‚

βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– βˆ’ βˆ’ βˆ’ βˆ’ β†’

𝑒,π‘™β†’βˆž 0

(Pedregosa, 2016).

  • β€– Μ‚

βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– βˆ’ βˆ’ βˆ’ βˆ’ β†’

𝑒,π‘™β†’βˆž 0 at a

linear rate in 𝑒 and 𝑙 for meta-learning with biased regularization (Rajeswaran et al., 2019).

  • We provide non-asymptotic upper

bounds on β€– Μ‚ βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€–.

𝑔𝑒(πœ‡) = 𝐹(π‘₯𝑒(πœ‡), πœ‡).

10

slide-18
SLIDE 18

Previous Work on the Approximation Error

ITD

  • arg min 𝑔𝑒 βˆ’

βˆ’ βˆ’ β†’

π‘’β†’βˆž arg min 𝑔

(Franceschi et al., 2018).

  • We provide non-asymptotic upper

bounds on β€–βˆ‡π‘”π‘’(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€–. AID

  • β€– Μ‚

βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– βˆ’ βˆ’ βˆ’ βˆ’ β†’

𝑒,π‘™β†’βˆž 0

(Pedregosa, 2016).

  • β€– Μ‚

βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– βˆ’ βˆ’ βˆ’ βˆ’ β†’

𝑒,π‘™β†’βˆž 0 at a

linear rate in 𝑒 and 𝑙 for meta-learning with biased regularization (Rajeswaran et al., 2019).

  • We provide non-asymptotic upper

bounds on β€– Μ‚ βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€–.

𝑔𝑒(πœ‡) = 𝐹(π‘₯𝑒(πœ‡), πœ‡).

10

slide-19
SLIDE 19

Preliminaries

Assumptions

  • Ξ¦(β‹…, πœ‡) is a contraction with constant π‘Ÿπœ‡ < 1.
  • πœ–1Ξ¦(β‹…, πœ‡), πœ–2Ξ¦(β‹…, πœ‡), βˆ‡1𝐹(β‹…, πœ‡) and βˆ‡2𝐹(β‹…, πœ‡) are Lipschitz continuous

⟹ 𝑔 difgerentiable and π‘₯β€²(πœ‡) ∢= (𝐽 βˆ’ πœ–1Ξ¦(π‘₯(πœ‡), πœ‡))βˆ’1πœ–2Ξ¦(π‘₯(πœ‡), πœ‡) βˆ‡π‘”(πœ‡) = βˆ‡2𝐹(π‘₯(πœ‡), πœ‡) + π‘₯β€²(πœ‡)βŠ€βˆ‡1𝐹(π‘₯(πœ‡), πœ‡).

11

slide-20
SLIDE 20

Main Contribution

Theorem (ITD error upper bound) β€–βˆ‡π‘”π‘’(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– ≀ (𝑑1(πœ‡) + 𝑑2(πœ‡) π‘Ÿπœ‡ 𝑒 + 𝑑3(πœ‡))π‘Ÿπ‘’

πœ‡,

Theorem (AID error upper bound) Let 𝑀𝑒(πœ‡) ∢= (𝐽 βˆ’ πœ–1Ξ¦(π‘₯𝑒(πœ‡), πœ‡)⊀)βˆ’1βˆ‡1𝐹(π‘₯𝑒(πœ‡), πœ‡) and assume that

  • β€–π‘₯𝑒(πœ‡) βˆ’ π‘₯(πœ‡)β€– ≀ πœπœ‡(𝑒)β€–π‘₯(πœ‡)β€–,
  • ‖𝑀𝑒,𝑙(πœ‡) βˆ’ 𝑀𝑒(πœ‡)β€– ≀ πœπœ‡(𝑙)‖𝑀𝑒(πœ‡)β€–.

Then, β€– Μ‚ βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– ≀ (𝑑1(πœ‡) + 𝑑2(πœ‡) 1 βˆ’ π‘Ÿπœ‡ )πœπœ‡(𝑒) + 𝑑3(πœ‡)πœπœ‡(𝑙).

12

slide-21
SLIDE 21

Main Contribution (Part 2)

Effjcient solvers for the linear system in AID:

  • fixed point method (FP)
  • conjugate gradient (CG)

Theorem (CG and FP error upper bounds) Assume that the lower-level problem is solved as in ITD. Then (FP) β€– Μ‚ βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– ≀ (𝑑1(πœ‡) + 𝑑2(πœ‡)1 βˆ’ π‘Ÿπ‘™

πœ‡

1 βˆ’ π‘Ÿπœ‡ )π‘Ÿπ‘’

πœ‡ + 𝑑3(πœ‡)π‘Ÿπ‘™ πœ‡.

Moreover, when πœ–1Ξ¦(π‘₯𝑒(πœ‡), πœ‡) is symmetric, (CG) β€– Μ‚ βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– ≀ (𝑑1(πœ‡) + 𝑑2(πœ‡) 1 βˆ’ π‘Ÿπœ‡ )π‘Ÿπ‘’

πœ‡ + 𝑑3(πœ‡) Μ‚

𝑑(πœ‡)π‘žπ‘™

πœ‡,

where π‘žπœ‡ < π‘Ÿπœ‡.

13

slide-22
SLIDE 22

So... Which method has the best approximation error?

From our analysis:

  • ITD, CG and FP converge linearly (in 𝑒 and 𝑙) to βˆ‡π‘”(πœ‡).
  • FP bound < ITD bound

for every 𝑒, when 𝑙 = 𝑒.

  • CG bound < FP bound

for 𝑙 big enough when πœ–1Ξ¦(π‘₯𝑒(πœ‡), πœ‡) is symmetric.

Is this true also for the actual error in practice? What happens when Ξ¦(β‹…, πœ‡) is not a contraction?

14

slide-23
SLIDE 23

So... Which method has the best approximation error?

From our analysis:

  • ITD, CG and FP converge linearly (in 𝑒 and 𝑙) to βˆ‡π‘”(πœ‡).
  • FP bound < ITD bound

for every 𝑒, when 𝑙 = 𝑒.

  • CG bound < FP bound

for 𝑙 big enough when πœ–1Ξ¦(π‘₯𝑒(πœ‡), πœ‡) is symmetric.

Is this true also for the actual error in practice? What happens when Ξ¦(β‹…, πœ‡) is not a contraction?

14

slide-24
SLIDE 24

Hypergradient Approximation on Synthetic Data

250 500 750 1000 1250 t 10

βˆ’6

10

βˆ’5

10

βˆ’4

10

βˆ’3

10

βˆ’2

||βˆ‡f(Ξ») βˆ’ g(Ξ»)|| Logistic Regression ITD FP k = t FP k = 10 CG k = t CG k = 10 50 100 150 200 t 10

1

10

3

10

5

10

7

Kernel Ridge Regression 25 50 75 100 125 150 t 10

βˆ’1

10 10

1

10

2

10

3

Biased Regularization 100 200 300 400 500 t 10

4

10

5

10

6

10

7

10

8

10

9

Hyper Representation

Hypergradient approximation errors (mean/std on randomly drawn values of πœ‡). 𝑕(πœ‡) is equal to βˆ‡π‘”π‘’(πœ‡) for ITD and to Μ‚ βˆ‡π‘”(πœ‡) for CG and FP. In all settings Ξ¦(β‹…, πœ‡) is a contraction and πœ–1Ξ¦(π‘₯, πœ‡) is symmetric.

  • AΔ§ter a while the error decreases linearly for all methods.
  • Methods with lower error bounds have lower error on average.

15

slide-25
SLIDE 25

Equilibrium Models 1 on MNIST (Proof of Concept)

min

𝛿=(𝐡,𝐢,𝑑),πœ„ π‘œ

βˆ‘

𝑗=1

CE(π‘₯𝑗(𝛿)βŠ€πœ„, 𝑧𝑗), π‘₯𝑗(𝛿) = πœšπ‘—(π‘₯𝑗(𝛿), 𝛿) = tanh(𝐡π‘₯𝑗(𝛿) + 𝐢𝑦𝑗 + 𝑑)

200 400 600 Time (s) 10βˆ’5 10βˆ’4 10βˆ’3 10βˆ’2

Objective

CG † CG FP † FP ITD † ITD 10 20 30 40 50 Iterations Γ—100 0.925 0.930 0.935 0.940 0.945

Test accuracy

10 20 30 40 50 Iterations Γ—100 10βˆ’5 10βˆ’4 10βˆ’3 10βˆ’2 10βˆ’1

Hypergradient norm ||g(Ξ»)||

10βˆ’3 10βˆ’2 10βˆ’1 100 Learning rate 0.88 0.89 0.90 0.91 0.92 0.93 0.94 0.95

Test accuracy vs learning rate

πœšπ‘—(β‹…, 𝛿) NOT a contraction for † methods.

  • When πœšπ‘—(β‹…, 𝛿) is a contraction all the methods perform similarly.
  • ITD is the most stable when the contraction assumption is not satisfied.

1Shaojie Bai, J Zico Kolter, and Vladlen Koltun. β€œDeep equilibrium models”. In Advances in Neural Information Processing Systems. 2019, pp. 688–699.

16

slide-26
SLIDE 26

Conclusions

We studied the iteration complexity of two strategies used to approximate the hypergradient in bilevel problems: iterative difgerentiation (ITD) and approximate implicit difgerentiation (AID). We proved non-asymptotic upper bounds on the approximation error

  • CG, FP and ITD converge linearly to the exact hypergradient.
  • ITD is generally worse than AID in terms of upper bounds.

We conducted experiments comparing ITD and AID

  • If Ξ¦(β‹…, πœ‡) is a contraction, the results confirm the theory.
  • If Ξ¦(β‹…, πœ‡) is NOT a contraction, ITD can be still a reliable strategy.

17

slide-27
SLIDE 27

Thank you for the attention

CODE (PyTorch): https://github.com/prolearner/hypertorch

18

slide-28
SLIDE 28

References

Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. (2018). Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pages 1563–1572. Pedregosa, F. (2016). Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, pages 737–746. Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S. (2019). Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems, pages 113–124.

19