optimization based meta learning
play

Optimization-Based Meta-Learning ( fi nishing from last time) and - PowerPoint PPT Presentation

Optimization-Based Meta-Learning ( fi nishing from last time) and Non-Parametric Few-Shot Learning CS 330 1 Logistics Homework 1 due, Homework 2 out this Wednesday Fill out poster presentation preferences ! (Tues 12/3 or Weds 12/4) Course


  1. Optimization-Based Meta-Learning ( fi nishing from last time) and Non-Parametric Few-Shot Learning CS 330 1

  2. Logistics Homework 1 due, Homework 2 out this Wednesday Fill out poster presentation preferences ! (Tues 12/3 or Weds 12/4) Course project details & suggestions posted 
 Proposal due Monday 10/28 � 2

  3. Plan for Today Optimization-Based Meta-Learning - Recap & discuss advanced topics 
 Non-Parametric Few-Shot Learning - Siamese networks, matching networks, prototypical networks 
 Properties of Meta-Learning Algorithms - Comparison of approaches � 3

  4. Recap from Last Time pre-trained parameters φ θ � α r θ L ( θ , D tr ) Fine-tuning training data 
 [test-&me] for new task X X L ( θ � α r θ L ( θ , D tr L ( θ � α r θ L ( θ , D tr L ( θ � α r θ L ( θ , D tr i ) , D ts i ) , D ts min min min i ) , i ) i ) MAML θ θ θ task i task i i Op&mizes for an effec&ve ini&aliza&on for fine-tuning. Discussed : performance on extrapolated tasks, expressive power � 4

  5. Probabilis.c Interpreta.on of Op.miza.on-Based Inference Key idea : Acquire through op&miza&on. Meta-parameters serve as a prior. One form of prior knowledge: ini.aliza.on for fine-tuning θ task-specific parameters (empirical Bayes) MAP es&mate How to compute MAP es.mate? Gradient descent with early stopping = MAP inference under meta-parameters Gaussian prior with mean at ini&al parameters [Santos ’96] (exact in linear case, approximate in nonlinear case) MAML approximates hierarchical Bayesian inference. Grant et al. ICLR ‘18 � 5

  6. Op.miza.on-Based Inference Key idea : Acquire through op&miza&on. Meta-parameters serve as a prior. One form of prior knowledge: ini.aliza.on for fine-tuning θ φ θ � α r θ L ( θ , D tr ) Gradient-descent + early stopping (MAML): implicit Gaussian prior Other forms of priors? Gradient-descent with explicit Gaussian prior Rajeswaran et al. implicit MAML ‘19 Bayesian linear regression on learned features Harrison et al. ALPaCA ‘18 Closed-form or convex opBmizaBon on learned features ridge regression, logisBc regression s upport vector machine Ber&neYo et al. R2-D2 ‘19 Lee et al. MetaOptNet ‘19 Current SOTA on few-shot image classifica.on � 6

  7. Op.miza.on-Based Inference Key idea : Acquire through op&miza&on. Challenges How to choose architecture that is effec&ve for inner gradient-step? Idea : Progressive neural architecture search + MAML (Kim et al. Auto-Meta) - finds highly non-standard architecture (deep & narrow) - different from architectures that work well for standard supervised learning MAML, basic architecture: 63.11% MiniImagenet, 5-way 5-shot MAML + AutoMeta: 74.65% � 7

  8. Op.miza.on-Based Inference Key idea : Acquire through op&miza&on. Challenges Bi-level op&miza&on can exhibit instabili&es. Idea : Automa&cally learn inner vector learning rate, tune outer learning rate (Li et al. Meta-SGD, Behl et al. AlphaMAML) Idea : Op&mize only a subset of the parameters in the inner loop (Zhou et al. DEML, Zintgraf et al. CAVIA) Idea : Decouple inner learning rate, BN sta&s&cs per-step (Antoniou et al. MAML++) Idea : Introduce context variables for increased expressive power. (Finn et al. bias transforma&on, Zintgraf et al. CAVIA) Takeaway : a range of simple tricks that can help op&miza&on significantly � 8

  9. Op.miza.on-Based Inference Key idea : Acquire through op&miza&on. Challenges Backpropaga&ng through many inner gradient steps is compute- & memory- intensive. Idea : [Crudely] approximate as iden&ty (Finn et al. first-order MAML ‘17, Nichol et al. Rep&le ’18) Takeaway : works for simple few-shot problems, but (anecdotally) not for more complex meta-learning problems. Can we compute the meta-gradient without differen-a-ng through the op-miza-on path ? -> whiteboard Idea : Derive meta-gradient using the implicit func&on theorem (Rajeswaran, Finn, Kakade, Levine. Implicit MAML ’19) � 9

  10. Op.miza.on-Based Inference Can we compute the meta-gradient without differen-a-ng through the op-miza-on path ? Idea : Derive meta-gradient using the implicit func&on theorem (Rajeswaran, Finn, Kakade, Levine. Implicit MAML) Memory and computa.on trade-offs Allows for second-order op.mizers in inner loop A very recent development (NeurIPS ’19) 
 (thus, all the typical caveats with recent work) � 10

  11. Op.miza.on-Based Inference Key idea : Acquire through op&miza&on. Takeaways : Construct bi-level op-miza-on problem. + posi&ve induc&ve bias at the start of meta-learning + consistent procedure, tends to extrapolate beYer + maximally expressive with sufficiently deep network + model-agnos&c (easy to combine with your favorite architecture) - typically requires second-order op&miza&on - usually compute and/or memory intensive Can we embed a learning procedure without a second-order op&miza&on? � 11

  12. Plan for Today Optimization-Based Meta-Learning - Recap & discuss advanced topics 
 Non-Parametric Few-Shot Learning - Siamese networks, matching networks, prototypical networks 
 Properties of Meta-Learning Algorithms - Comparison of approaches � 12

  13. So far : Learning parametric models. In low data regimes, non-parametric methods are simple, work well. During meta-test Bme : few-shot learning <-> low data regime During meta-training : s&ll want to be parametric Can we use parametric meta-learners that produce effec&ve non-parametric learners ? Note: some of these methods precede parametric approaches � 13

  14. Non-parametric methods Key Idea : Use non-parametric learner. D tr test datapoint training data i Compare test image with training images In what space do you compare? With what distance metric? pixel space, l 2 distance? � 14

  15. In what space do you compare? With what distance metric? pixel space, l 2 distance? Zhang et al. (arXiv 1801.03924) 15

  16. Non-parametric methods Key Idea : Use non-parametric learner. D tr test datapoint training data i Compare test image with training images In what space do you compare? With what distance metric? pixel space, l 2 distance? pixel space, l 2 distance? Learn to compare using meta-training data! � 16

  17. Non-parametric methods Key Idea : Use non-parametric learner. train Siamese network to predict whether or not two images are the same class label 0 Koch et al., ICML ‘15 17

  18. Non-parametric methods Key Idea : Use non-parametric learner. train Siamese network to predict whether or not two images are the same class label 1 Koch et al., ICML ‘15 18

  19. Non-parametric methods Key Idea : Use non-parametric learner. train Siamese network to predict whether or not two images are the same class label 0 Koch et al., ICML ‘15 19

  20. Non-parametric methods Key Idea : Use non-parametric learner. train Siamese network to predict whether or not two images are the same class label label 1 D tr Meta-test &me: compare image to each image in j Meta-training : Binary classifica&on Can we match meta-train & meta-test? Meta-test : N-way classifica&on Koch et al., ICML ‘15 20

  21. Non-parametric methods Key Idea : Use non-parametric learner. Can we match meta-train & meta-test? Nearest neighbors in learned embedding space D tr i bidirec.onal f θ ( x ts , x k ) y ) y k LSTM y ts = X f θ ( x ts , x k ) y k e ˆ x k ,y k ∈ D tr convolu.onal 
 Trained end-to-end . encoder Meta-train & meta-test &me match . D ts Vinyals et al. Matching Networks, NeurIPS ‘16 21 i

  22. Non-parametric methods Key Idea : Use non-parametric learner. General Algorithm : Amor&zed approach Non-parametric approach (matching networks) 1. Sample task T i (or mini batch of tasks) 2. Sample disjoint datasets D tr i , D test from D i i (Parameters � integrated ϕ y ts = X f θ ( x ts , x k ) y k Compute ˆ 3. Compute φ i ← f θ ( D tr i ) out, hence non-parametric ) x k ,y k ∈ D tr 4. Update θ using r θ L ( φ i , D test y ts , y ts ) ) Update θ using r θ L (ˆ i Matching networks will perform comparisons independently What if >1 shot ? Can we aggregate class informa.on to create a prototypical embedding ? � 22

  23. Non-parametric methods Key Idea : Use non-parametric learner. c n = 1 X ( y = n ) f θ ( x ) K ( x,y ) ∈ D tr i exp( − d ( f θ ( x ) , c n )) p θ ( y = n | x ) = P n 0 exp( d ( f θ ( x ) , c n 0 )) d: Euclidean, or cosine distance Snell et al. Prototypical Networks, NeurIPS ‘17 � 23

  24. Non-parametric methods So far : Siamese networks, matching networks, prototypical networks Embed, then nearest neighbors. Challenge What if you need to reason about more complex rela&onships between datapoints? Idea : Learn non-linear rela&on Idea : Learn infinite Idea : Perform message module on embeddings mixture of prototypes. passing on embeddings (learn d in PN) Sung et al. Rela&on Net Allen et al. IMP, ICML ‘19 Garcia & Bruna, GNN � 24

  25. Plan for Today Optimization-Based Meta-Learning - Recap & discuss advanced topics 
 Non-Parametric Few-Shot Learning - Siamese networks, matching networks, prototypical networks 
 Properties of Meta-Learning Algorithms - Comparison of approaches How can we think about how these methods compare? � 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend