bayesian meta learning
play

Bayesian Meta-Learning CS 330 1 Reminders Homework 2 due next - PowerPoint PPT Presentation

Bayesian Meta-Learning CS 330 1 Reminders Homework 2 due next Friday. Project group form due today Project proposal due in one week . Project proposal presentations in one week . (full schedule released on Friday) 2 Plan for Today Why be


  1. Bayesian Meta-Learning CS 330 1

  2. Reminders Homework 2 due next Friday. Project group form due today Project proposal due in one week . Project proposal presentations in one week . (full schedule released on Friday) 2

  3. Plan for Today Why be Bayesian? Bayesian meta-learning approaches - black-box approaches - op8miza8on-based approaches How to evaluate Bayesian meta-learners. Goals for by the end of lecture: - Understand the interpreta8on of meta-learning as Bayesian inference - Understand techniques for represen2ng uncertainty over parameters, predic8ons 3

  4. Disclaimers Bayesian meta-learning is an ac2ve area of research (like most of the class content) More ques2ons than answers. This lecture covers some of the most advanced & mathiest topics of the course. So ask ques2ons ! 4

  5. Recap from last week. Computa(on graph perspec,ve Black-box Op,miza,on-based Non-parametric y ts = f θ ( D tr i , x ts ) � f θ ( x ts ) , c n � = softmax( − d ) y ts where c n = 1 X ( y = n ) f θ ( x ) K ( x,y ) ∈ D tr x ts i 5

  6. Recap from last week. Algorithmic proper(es perspec,ve the ability for f to represent a range of learning procedures Expressive power Why? scalability, applicability to a range of domains learned learning procedure will solve task with enough data Consistency reduce reliance on meta-training tasks, Why? good OOD task performance These proper2es are important for most applica2ons! 6

  7. Recap from last week. Algorithmic proper(es perspec,ve the ability for f to represent a range of learning procedures Expressive power Why? scalability, applicability to a range of domains learned learning procedure will solve task with enough data Consistency reduce reliance on meta-training tasks, Why? good OOD task performance ability to reason about ambiguity during learning Uncertainty awareness ac@ve learning, calibrated uncertainty, RL Why? principled Bayesian approaches *this lecture* 7

  8. Plan for Today Why be Bayesian? Bayesian meta-learning approaches - black-box approaches - op8miza8on-based approaches How to evaluate Bayesian meta-learners. 8

  9. Mul,-Task & Meta-Learning Principles Training and tes8ng must match. Tasks must share “structure.” What does “structure” mean? sta8s8cal dependence on shared latent informa8on θ If you condi8on on that informa8on, - task parameters become independent i.e. ϕ i 1 ⊥ ⊥ ϕ i 2 ∣ θ and are not otherwise independent ϕ i 1 ⊥ ⊥ / ϕ i 2 - hence, you have a lower entropy i.e. ℋ ( p ( ϕ i | θ )) < ℋ ( p ( ϕ i )) Thought exercise #1 : If you can iden8fy (i.e. with meta-learning) , θ when should learning be faster than learning from scratch? ϕ i Thought exercise #2 : what if ℋ ( p ( ϕ i | θ )) = 0 ∀ i ? 9

  10. Mul,-Task & Meta-Learning Principles Training and tes8ng must match. Tasks must share “structure.” What does “structure” mean? sta8s8cal dependence on shared latent informa8on θ What informa8on might contain… θ …in a toy sinusoid problem? corresponds to family of sinusoid funcAons θ (everything but phase and amplitude) …in mul8-language machine transla8on? θ corresponds to the family of all language pairs Note that is narrower than the space of all possible funcAons. θ Thought exercise #3 : What if you meta-learn without a lot of tasks? “meta-overfiKng” to the family of training func8ons 10

  11. Recall parametric approaches: Use determinis2c p ( φ i |D tr (i.e. a point es@mate) i , θ ) - Why/when is this a problem? + Few-shot learning problems may be ambiguous . (even with prior) Can we learn to generate hypotheses about the underlying func@on? p ( φ i |D tr i , θ ) i.e. sample from - safety-cri,cal few-shot learning (e.g. medical imaging) Important for: - learning to ac,vely learn - learning to explore in meta-RL Ac2ve learning w/ meta-learning : Woodward & Finn ’16, Konyushkova et al. ’17, Bachman et al. ’17 11

  12. Plan for Today Why be Bayesian? Bayesian meta-learning approaches - black-box approaches - op8miza8on-based approaches How to evaluate Bayesian meta-learners. 12

  13. Computa(on graph perspec,ve Black-box Op,miza,on-based Non-parametric y ts = f θ ( D tr i , x ts ) � f θ ( x ts ) , c n � = softmax( − d ) y ts where c n = 1 X ( y = n ) f θ ( x ) K ( x,y ) ∈ D tr x ts i y ts Version 0: Let output the parameters of a distribu8on over . f - probability values of discrete categorical distribu2on For example: - mean and variance of a Gaussian - means, variances, and mixture weights of a mixture of Gaussians y ts : parameters of a sequence of - for mul8-dimensional distribu2ons (i.e. autoregressive model) Then, op8mize with maximum likelihood. 13

  14. y ts Version 0: Let output the parameters of a distribu8on over . f - probability values of discrete categorical distribu2on For example: - mean and variance of a Gaussian - means, variances, and mixture weights of a mixture of Gaussians y ts : parameters of a sequence of - for mul8-dimensional distribu2ons (i.e. autoregressive model) Then, op8mize with maximum likelihood . Pros : + simple + can combine with variety of methods Cons : - can’t reason about uncertainty over the underlying func@on [to determine how uncertainty across datapoints relate] y ts - limited class of distribu@ons over can be expressed - tends to produce poorly-calibrated uncertainty es@mates Thought exercise #4 : Can you do the same maximum likelihood training for ? ϕ 14

  15. The Bayesian Deep Learning Toolbox a broad one-slide overview (CS 236 provides a thorough treatment) Goal : represent distribu@ons with neural networks Latent variable models + variaAonal inference (Kingma & Welling ‘13, Rezende et al. ‘14) : - approximate likelihood of latent variable model with varia8onal lower bound Bayesian ensembles (Lakshminarayanan et al. ‘17) : - par8cle-based representa8on: train separate models on bootstraps of the data Bayesian neural networks (Blundell et al. ‘15) : - explicit distribu8on over the space of network parameters Normalizing Flows (Dinh et al. ‘16) : - inver8ble func8on from latent distribu8on to data distribu8on Energy-based models & GANs (LeCun et al. ’06, Goodfellow et al. ‘14) : We’ll see how we can leverage data - es8mate unnormalized density the first two. everything The others could be useful in else developing new methods. 15

  16. Background: The Varia,onal Lower Bound Observed variable , latent variable x z log p ( x ) ≥ 𝔽 q ( z | x ) [ log p ( x , z ) ] + ℋ ( q ( z | x )) ELBO: = 𝔽 q ( z | x ) [ log p ( x | z ) ] − D KL ( q ( z | x ) ∥ p ( z ) ) Can also be wriaen as: p ( x | z ) represented w/ neural net, p : model model parameters , θ p ( z ) represented as 𝒪 ( 0 , I ) varia8onal parameters ϕ q ( z | x ) : inference network, varia8onal distribu8on Problem : need to backprop through sampling Reparametriza,on trick For Gaussian q ( z | x ) : q ( z | x ) = μ q + σ q ϵ i.e. compute derivaAve of 𝔽 q w.r.t. q where ϵ ∼ 𝒪 ( 0 , I ) Can we use amor,zed varia,onal inference for meta-learning? 16

  17. Bayesian black-box meta-learning with standard, deep varia@onal inference y ts q ( ϕ i | 𝒠 tr i ) D tr ϕ i neural net What should condi8on on? q i max 𝔽 q ( ϕ | 𝒠 tr ) [ log p ( 𝒠 | ϕ ) ] − D KL ( q ( ϕ | 𝒠 tr ) ∥ p ( ϕ ) ) x ts Standard VAE: Observed variable , latent variable x z max 𝔽 q ( ϕ | 𝒠 tr ) [ log p ( y ts | x ts , ϕ ) ] − D KL ( q ( ϕ | 𝒠 tr ) ∥ p ( ϕ ) ) ELBO: 𝔽 q ( z | x ) [ log p ( x | z ) ] − D KL ( q ( z | x ) ∥ p ( z ) ) p : model, represented by a neural net : inference network, varia8onal distribu8on q What about the meta-parameters ? θ 𝔽 q ( ϕ | 𝒠 tr , θ ) [ log p ( y ts | x ts , ϕ ) ] − D KL ( q ( ϕ | 𝒠 tr , θ ) ∥ p ( ϕ | θ ) ) Meta-learning: max θ Observed variable 𝒠 , latent variable ϕ Can also condi8on on here θ max 𝔽 q ( ϕ ) [ log p ( 𝒠 | ϕ ) ] − D KL ( q ( ϕ ) ∥ p ( ϕ ) ) 𝔽 𝒰 i [ 𝔽 q ( ϕ i | 𝒠 tr i , ϕ i ) ] − D KL ( q ( ϕ i | 𝒠 tr i , θ ) ∥ p ( ϕ i | θ ) ) ] i , θ ) [ log p ( y ts i | x ts Final objec8ve (for completeness): max θ 17

  18. Bayesian black-box meta-learning with standard, deep varia@onal inference y ts q ( ϕ i | 𝒠 tr i ) D tr ϕ i neural net i x ts 𝔽 𝒰 i [ 𝔽 q ( ϕ i | 𝒠 tr i , ϕ i ) ] − D KL ( q ( ϕ i | 𝒠 tr i , θ ) ∥ p ( ϕ i | θ ) ) ] i , θ ) [ log p ( y ts i | x ts max θ Pros : y ts + can represent non-Gaussian distribu@ons over + produces distribu@on over func@ons Cons : - Can only represent Gaussian distribu@ons p ( ϕ i | θ ) 18

  19. What about Bayesian op,miza,on-based meta-learning? Recas&ng Gradient-Based Meta-Learning as Hierarchical Bayes (Grant et al. ’18) task-specific parameters (empirical Bayes) MAP es@mate How to compute MAP es2mate? Gradient descent with early stopping = MAP inference under meta-parameters Gaussian prior with mean at ini@al parameters [Santos ’96] (exact in linear case, approximate in nonlinear case) Provides a Bayesian interpreta2on of MAML. p ( ϕ i | θ , 𝒠 tr i ) But, we can’t sample from ! 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend