Bayesian Meta-Learning CS 330 1 Reminders Homework 2 due next - PowerPoint PPT Presentation

Bayesian Meta-Learning CS 330 1

Reminders Homework 2 due next Friday. Project group form due today Project proposal due in one week . Project proposal presentations in one week . (full schedule released on Friday) 2

Plan for Today Why be Bayesian? Bayesian meta-learning approaches - black-box approaches - op8miza8on-based approaches How to evaluate Bayesian meta-learners. Goals for by the end of lecture: - Understand the interpreta8on of meta-learning as Bayesian inference - Understand techniques for represen2ng uncertainty over parameters, predic8ons 3

Disclaimers Bayesian meta-learning is an ac2ve area of research (like most of the class content) More ques2ons than answers. This lecture covers some of the most advanced & mathiest topics of the course. So ask ques2ons ! 4

Recap from last week. Computa(on graph perspec,ve Black-box Op,miza,on-based Non-parametric y ts = f θ ( D tr i , x ts ) � f θ ( x ts ) , c n � = softmax( − d ) y ts where c n = 1 X ( y = n ) f θ ( x ) K ( x,y ) ∈ D tr x ts i 5

Recap from last week. Algorithmic proper(es perspec,ve the ability for f to represent a range of learning procedures Expressive power Why? scalability, applicability to a range of domains learned learning procedure will solve task with enough data Consistency reduce reliance on meta-training tasks, Why? good OOD task performance These proper2es are important for most applica2ons! 6

Recap from last week. Algorithmic proper(es perspec,ve the ability for f to represent a range of learning procedures Expressive power Why? scalability, applicability to a range of domains learned learning procedure will solve task with enough data Consistency reduce reliance on meta-training tasks, Why? good OOD task performance ability to reason about ambiguity during learning Uncertainty awareness ac@ve learning, calibrated uncertainty, RL Why? principled Bayesian approaches *this lecture* 7

Plan for Today Why be Bayesian? Bayesian meta-learning approaches - black-box approaches - op8miza8on-based approaches How to evaluate Bayesian meta-learners. 8

Mul,-Task & Meta-Learning Principles Training and tes8ng must match. Tasks must share “structure.” What does “structure” mean? sta8s8cal dependence on shared latent informa8on θ If you condi8on on that informa8on, - task parameters become independent i.e. ϕ i 1 ⊥ ⊥ ϕ i 2 ∣ θ and are not otherwise independent ϕ i 1 ⊥ ⊥ / ϕ i 2 - hence, you have a lower entropy i.e. ℋ ( p ( ϕ i | θ )) < ℋ ( p ( ϕ i )) Thought exercise #1 : If you can iden8fy (i.e. with meta-learning) , θ when should learning be faster than learning from scratch? ϕ i Thought exercise #2 : what if ℋ ( p ( ϕ i | θ )) = 0 ∀ i ? 9

Mul,-Task & Meta-Learning Principles Training and tes8ng must match. Tasks must share “structure.” What does “structure” mean? sta8s8cal dependence on shared latent informa8on θ What informa8on might contain… θ …in a toy sinusoid problem? corresponds to family of sinusoid funcAons θ (everything but phase and amplitude) …in mul8-language machine transla8on? θ corresponds to the family of all language pairs Note that is narrower than the space of all possible funcAons. θ Thought exercise #3 : What if you meta-learn without a lot of tasks? “meta-overfiKng” to the family of training func8ons 10

Recall parametric approaches: Use determinis2c p ( φ i |D tr (i.e. a point es@mate) i , θ ) - Why/when is this a problem? + Few-shot learning problems may be ambiguous . (even with prior) Can we learn to generate hypotheses about the underlying func@on? p ( φ i |D tr i , θ ) i.e. sample from - safety-cri,cal few-shot learning (e.g. medical imaging) Important for: - learning to ac,vely learn - learning to explore in meta-RL Ac2ve learning w/ meta-learning : Woodward & Finn ’16, Konyushkova et al. ’17, Bachman et al. ’17 11

Plan for Today Why be Bayesian? Bayesian meta-learning approaches - black-box approaches - op8miza8on-based approaches How to evaluate Bayesian meta-learners. 12

Computa(on graph perspec,ve Black-box Op,miza,on-based Non-parametric y ts = f θ ( D tr i , x ts ) � f θ ( x ts ) , c n � = softmax( − d ) y ts where c n = 1 X ( y = n ) f θ ( x ) K ( x,y ) ∈ D tr x ts i y ts Version 0: Let output the parameters of a distribu8on over . f - probability values of discrete categorical distribu2on For example: - mean and variance of a Gaussian - means, variances, and mixture weights of a mixture of Gaussians y ts : parameters of a sequence of - for mul8-dimensional distribu2ons (i.e. autoregressive model) Then, op8mize with maximum likelihood. 13

y ts Version 0: Let output the parameters of a distribu8on over . f - probability values of discrete categorical distribu2on For example: - mean and variance of a Gaussian - means, variances, and mixture weights of a mixture of Gaussians y ts : parameters of a sequence of - for mul8-dimensional distribu2ons (i.e. autoregressive model) Then, op8mize with maximum likelihood . Pros : + simple + can combine with variety of methods Cons : - can’t reason about uncertainty over the underlying func@on [to determine how uncertainty across datapoints relate] y ts - limited class of distribu@ons over can be expressed - tends to produce poorly-calibrated uncertainty es@mates Thought exercise #4 : Can you do the same maximum likelihood training for ? ϕ 14

The Bayesian Deep Learning Toolbox a broad one-slide overview (CS 236 provides a thorough treatment) Goal : represent distribu@ons with neural networks Latent variable models + variaAonal inference (Kingma & Welling ‘13, Rezende et al. ‘14) : - approximate likelihood of latent variable model with varia8onal lower bound Bayesian ensembles (Lakshminarayanan et al. ‘17) : - par8cle-based representa8on: train separate models on bootstraps of the data Bayesian neural networks (Blundell et al. ‘15) : - explicit distribu8on over the space of network parameters Normalizing Flows (Dinh et al. ‘16) : - inver8ble func8on from latent distribu8on to data distribu8on Energy-based models & GANs (LeCun et al. ’06, Goodfellow et al. ‘14) : We’ll see how we can leverage data - es8mate unnormalized density the first two. everything The others could be useful in else developing new methods. 15

Background: The Varia,onal Lower Bound Observed variable , latent variable x z log p ( x ) ≥ 𝔽 q ( z | x ) [ log p ( x , z ) ] + ℋ ( q ( z | x )) ELBO: = 𝔽 q ( z | x ) [ log p ( x | z ) ] − D KL ( q ( z | x ) ∥ p ( z ) ) Can also be wriaen as: p ( x | z ) represented w/ neural net, p : model model parameters , θ p ( z ) represented as 𝒪 ( 0 , I ) varia8onal parameters ϕ q ( z | x ) : inference network, varia8onal distribu8on Problem : need to backprop through sampling Reparametriza,on trick For Gaussian q ( z | x ) : q ( z | x ) = μ q + σ q ϵ i.e. compute derivaAve of 𝔽 q w.r.t. q where ϵ ∼ 𝒪 ( 0 , I ) Can we use amor,zed varia,onal inference for meta-learning? 16

Bayesian black-box meta-learning with standard, deep varia@onal inference y ts q ( ϕ i | 𝒠 tr i ) D tr ϕ i neural net What should condi8on on? q i max 𝔽 q ( ϕ | 𝒠 tr ) [ log p ( 𝒠 | ϕ ) ] − D KL ( q ( ϕ | 𝒠 tr ) ∥ p ( ϕ ) ) x ts Standard VAE: Observed variable , latent variable x z max 𝔽 q ( ϕ | 𝒠 tr ) [ log p ( y ts | x ts , ϕ ) ] − D KL ( q ( ϕ | 𝒠 tr ) ∥ p ( ϕ ) ) ELBO: 𝔽 q ( z | x ) [ log p ( x | z ) ] − D KL ( q ( z | x ) ∥ p ( z ) ) p : model, represented by a neural net : inference network, varia8onal distribu8on q What about the meta-parameters ? θ 𝔽 q ( ϕ | 𝒠 tr , θ ) [ log p ( y ts | x ts , ϕ ) ] − D KL ( q ( ϕ | 𝒠 tr , θ ) ∥ p ( ϕ | θ ) ) Meta-learning: max θ Observed variable 𝒠 , latent variable ϕ Can also condi8on on here θ max 𝔽 q ( ϕ ) [ log p ( 𝒠 | ϕ ) ] − D KL ( q ( ϕ ) ∥ p ( ϕ ) ) 𝔽 𝒰 i [ 𝔽 q ( ϕ i | 𝒠 tr i , ϕ i ) ] − D KL ( q ( ϕ i | 𝒠 tr i , θ ) ∥ p ( ϕ i | θ ) ) ] i , θ ) [ log p ( y ts i | x ts Final objec8ve (for completeness): max θ 17

Bayesian black-box meta-learning with standard, deep varia@onal inference y ts q ( ϕ i | 𝒠 tr i ) D tr ϕ i neural net i x ts 𝔽 𝒰 i [ 𝔽 q ( ϕ i | 𝒠 tr i , ϕ i ) ] − D KL ( q ( ϕ i | 𝒠 tr i , θ ) ∥ p ( ϕ i | θ ) ) ] i , θ ) [ log p ( y ts i | x ts max θ Pros : y ts + can represent non-Gaussian distribu@ons over + produces distribu@on over func@ons Cons : - Can only represent Gaussian distribu@ons p ( ϕ i | θ ) 18

What about Bayesian op,miza,on-based meta-learning? Recas&ng Gradient-Based Meta-Learning as Hierarchical Bayes (Grant et al. ’18) task-specific parameters (empirical Bayes) MAP es@mate How to compute MAP es2mate? Gradient descent with early stopping = MAP inference under meta-parameters Gaussian prior with mean at ini@al parameters [Santos ’96] (exact in linear case, approximate in nonlinear case) Provides a Bayesian interpreta2on of MAML. p ( ϕ i | θ , 𝒠 tr i ) But, we can’t sample from ! 19

Bayesian Meta-Learning CS 330 1 Reminders Homework 2 due next - PowerPoint PPT Presentation

Bayesian Meta-Learning CS 330 1 Reminders Homework 2 due next Friday. Project group form due today Project proposal due in one week . Project proposal presentations in one week . (full schedule released on Friday) 2 Plan for Today Why be

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Meta-Learning CS 330 1 Logistics Homework 2 due next Wednesday. Project proposal due in

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

Meta Learning Shengchao Liu Background Meta Learning (AKA Learning to Learn) A

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Comparison of Bayesian Network Meta-Analysis Models for Survival Data Purvi Prajapati James

Meta-Bayesian Analysis Jun Yang joint work with Daniel M. Roy Department of Statistical Sciences

A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017 Meta

The Meta-Learning Problem & Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,

MetaFun: Meta-Learning with Iterative Functional Updates Jin Xu, Jean-Francois Ton, Hyunjik Kim,

CS5412/LECTURE 23 Ken Birman HARDWARE ACCELERATORS CS5412 Spring 2020

CS 225 Data Structures Ma March 9 AV AVL Applications G G Carl Evans AV AVL Runtime

S U P E R S O N I C . S U B A T O M I C . J A V A . S U P E R S O N I C . S U B A T O M I C .

Week 12.3, Friday, Nov 8 Homework 6 Due: November 14 th at 11:59PM (Gradescope) 1 7.11 Project

Speaking Data: Simple, Functional Programming with Clojure Paul deGrandis :: @ohpauleez

Advisory Board Meeting Friday, April 3 rd 10:00am 11:30am CST Overview New board members

ZIGDIGGITY ZIGBEE HACKING TOOLKIT AUGUST 7-11, 2019 http://zigdiggity.c om NEW ZIGBEE HACKING

Security through Distrusting Joanna Rutkowska Invisible Things Lab &

Bayesian Meta-Learning CS 330 1 Reminders Homework 2 due next - PowerPoint PPT Presentation

Bayesian Meta-Learning CS 330 1 Reminders Homework 2 due next Friday. Project group form due today Project proposal due in one week . Project proposal presentations in one week . (full schedule released on Friday) 2 Plan for Today Why be

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Meta-Learning CS 330 1 Logistics Homework 2 due next Wednesday. Project proposal due in

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

Meta Learning Shengchao Liu Background Meta Learning (AKA Learning to Learn) A

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Comparison of Bayesian Network Meta-Analysis Models for Survival Data Purvi Prajapati James

Meta-Bayesian Analysis Jun Yang joint work with Daniel M. Roy Department of Statistical Sciences

A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017 Meta

The Meta-Learning Problem &amp; Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,

MetaFun: Meta-Learning with Iterative Functional Updates Jin Xu, Jean-Francois Ton, Hyunjik Kim,

CS5412/LECTURE 23 Ken Birman HARDWARE ACCELERATORS CS5412 Spring 2020

CS 225 Data Structures Ma March 9 AV AVL Applications G G Carl Evans AV AVL Runtime

S U P E R S O N I C . S U B A T O M I C . J A V A . S U P E R S O N I C . S U B A T O M I C .

Week 12.3, Friday, Nov 8 Homework 6 Due: November 14 th at 11:59PM (Gradescope) 1 7.11 Project

Speaking Data: Simple, Functional Programming with Clojure Paul deGrandis :: @ohpauleez

Advisory Board Meeting Friday, April 3 rd 10:00am 11:30am CST Overview New board members

ZIGDIGGITY ZIGBEE HACKING TOOLKIT AUGUST 7-11, 2019 http://zigdiggity.c om NEW ZIGBEE HACKING

Security through Distrusting Joanna Rutkowska Invisible Things Lab &amp;

The Meta-Learning Problem & Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,

Security through Distrusting Joanna Rutkowska Invisible Things Lab &