Meta-Learning with Shared Amortized Variational Inference Ekaterina - - PowerPoint PPT Presentation

β–Ά
meta learning with shared amortized variational inference
SMART_READER_LITE
LIVE PREVIEW

Meta-Learning with Shared Amortized Variational Inference Ekaterina - - PowerPoint PPT Presentation

Meta-Learning with Shared Amortized Variational Inference Ekaterina Iakovleva Jakob Verbeek Karteek Alahari Inria Facebook Inria ICML | 2020 Thirty-seventh International Conference on Machine Learning Standard classification task pipeline


slide-1
SLIDE 1

Meta-Learning with Shared Amortized Variational Inference

Ekaterina Iakovleva Jakob Verbeek Karteek Alahari

ICML | 2020

Thirty-seventh International Conference

  • n Machine Learning

Inria Facebook Inria

slide-2
SLIDE 2

Standard classification task pipeline

ICML | 2020

2

slide-3
SLIDE 3

ICML | 2020

Meta-learning classification task pipeline

Meta test data

3

Schmidhuber 1999, Ravi & Larochelle ICLR’17

slide-4
SLIDE 4

ICML | 2020

Overview

  • This work focuses on the empirical Bayes meta-learning approach.
  • We propose a novel scheme for amortized variational inference.
  • We demonstrate that earlier work based on Monte-Carlo approximation

underestimates model variance.

  • We show the advantage of our approach on miniImageNet and FC100.

4

slide-5
SLIDE 5

ICML | 2020

Meta-learning classification task definition

  • K - shot N - way classification task
  • Episodic training: each task t is sampled from a distribution over tasks π‘ž 𝛢
  • Support data 𝐸! =

(𝑦",$

! , 𝑧",$ ! ) ",$%& ',(

  • Query data *

𝐸! = (+ 𝑦),$

!

, + 𝑧),$

!

) ),$%&

*,(

5

slide-6
SLIDE 6

ICML | 2020

Meta-learning approaches

  • Distance-based classifiers

v Learned metric relies on the distance to individual samples or class prototypes. v E.g. Prototypical Networks [1], Matching Nets [2].

6

[1] – Snell et al. NeurIPS’17, [2] – Vinyals et al. NeurIPS’16

slide-7
SLIDE 7

ICML | 2020

Meta-learning approaches

  • Distance-based classifiers

v Learned metric relies on the distance to individual samples or class prototypes. v E.g. Prototypical Networks [1], Matching Nets [2].

  • Optimization-based approaches

v Vanilla SGD approach is replaced by a trainable update mechanism. v E.g. MAML [3], Meta LSTM [4].

6

[1] – Snell et al. NeurIPS’17, [2] – Vinyals et al. NeurIPS’16, [3] – Finn et al. ICML’17, [4] – Ravi & Larochelle ICLR’17

slide-8
SLIDE 8

ICML | 2020

Meta-learning approaches

  • Distance-based classifiers

v Learned metric relies on the distance to individual samples or class prototypes. v E.g. Prototypical Networks [1], Matching Nets [2].

  • Optimization-based approaches

v Vanilla SGD approach is replaced by a trainable update mechanism. v E.g. MAML [3], Meta LSTM [4].

  • Latent variable models

v The model parameters are treated as latent variables. v Their variance is explicitly modeled in a Bayesian framework. v E.g. Neural Processes [5], VERSA [6].

6

[1] – Snell et al. NeurIPS’17, [2] – Vinyals et al. NeurIPS’16, [3] – Finn et al. ICML’17, [4] – Ravi & Larochelle ICLR’17, [5] – Garnelo et al. ICML’18, [6] – Gordon et al. ICLR’19

slide-9
SLIDE 9

ICML | 2020

Multi-task generative model

  • The multi-task graphical model includes:
  • task-agnostic parameters πœ„
  • task-specific latent parameters {π‘₯!}!%&

+

7

slide-10
SLIDE 10

ICML | 2020

Multi-task generative model

  • The multi-task graphical model includes:
  • task-agnostic parameters πœ„
  • task-specific latent parameters {π‘₯!}!%&

+

Marginal likelihood of the query labels 0 𝑍 = {0 𝑍!}!%&

+

given query samples 0 π‘Œ = { 0 π‘Œ!}!%&

+

and the support sets 𝐸 = {𝐸!}!%&

+

= (π‘Œ!, 𝑍!) !

+

π‘ž 0 𝑍| 0 π‘Œ, 𝐸, πœ„ = 4

!%& +

5 π‘ž 0 𝑍 0 π‘Œ, π‘₯! π‘ž, π‘₯! 𝐸!, πœ„ 𝑒π‘₯! Intractable integral requires approximation for training and prediction.

7

slide-11
SLIDE 11

ICML | 2020

Multi-task generative model

  • The multi-task graphical model includes:
  • task-agnostic parameters πœ„
  • task-specific latent parameters {π‘₯!}!%&

+

Marginal likelihood of the query labels 0 𝑍 = {0 𝑍!}!%&

+

given query samples 0 π‘Œ = { 0 π‘Œ!}!%&

+

and the support sets 𝐸 = {𝐸!}!%&

+

= (π‘Œ!, 𝑍!) !

+

π‘ž 0 𝑍| 0 π‘Œ, 𝐸, πœ„ = 4

!%& +

5 π‘ž 0 𝑍 0 π‘Œ, π‘₯! π‘ž, π‘₯! 𝐸!, πœ„ 𝑒π‘₯! Intractable integral requires approximation for training and prediction.

7

slide-12
SLIDE 12

ICML | 2020

Multi-task generative model

  • The multi-task graphical model includes:
  • task-agnostic parameters πœ„
  • task-specific latent parameters {π‘₯!}!%&

+

Marginal likelihood of the query labels 0 𝑍 = {0 𝑍!}!%&

+

given query samples 0 π‘Œ = { 0 π‘Œ!}!%&

+

and the support sets 𝐸 = {𝐸!}!%&

+

= (π‘Œ!, 𝑍!) !

+

π‘ž 0 𝑍| 0 π‘Œ, 𝐸, πœ„ = 4

!%& +

5 π‘ž 0 𝑍 0 π‘Œ, π‘₯! π‘ž, π‘₯! 𝐸!, πœ„ 𝑒π‘₯! Intractable integral requires approximation for training and prediction.

7

slide-13
SLIDE 13

ICML | 2020

Monte Carlo approximation

  • Monte Carlo approximation of the marginal log-likelihood using π‘₯-

!~π‘ž, π‘₯! 𝐸!, πœ„ :

log π‘ž 0 𝑍! 0 π‘Œ!, 𝐸!, πœ„ β‰ˆ 1 π‘ˆπ‘ ?

!%& +

?

)%& *

log 1 𝑀 ?

  • %&

.

π‘ž + 𝑧)

! +

𝑦)

! , π‘₯- !

.

  • This objective function has been used in VERSA [1].
  • Our experiments show that this approach learns degenerate prior π‘ž, π‘₯! 𝐸!, πœ„ .

8

[1] – Gordon et al. ICLR’19

slide-14
SLIDE 14

ICML | 2020

Amortized variational inference

  • Variational evidence lower bound (ELBO) with the amortized approximate posterior [1]

parameterized by πœ”: log π‘ž 0 𝑍!| 0 π‘Œ!, 𝐸!, πœ„ β‰₯ 𝔽/! log π‘ž 0 𝑍! 0 π‘Œ!, π‘₯! βˆ’ 𝛾𝒠'. π‘Ÿ0 π‘₯! 0 𝑍!, 0 π‘Œ!, 𝐸!, πœ„ ||π‘ž, π‘₯! 𝐸!, πœ„

9

[1] – Kingma & Welling ICLR’14

slide-15
SLIDE 15

ICML | 2020

Amortized variational inference

  • Variational evidence lower bound (ELBO) with the amortized approximate posterior [1]

parameterized by πœ”: log π‘ž 0 𝑍!| 0 π‘Œ!, 𝐸!, πœ„ β‰₯ 𝔽/! log π‘ž 0 𝑍! 0 π‘Œ!, π‘₯! βˆ’ 𝛾𝒠'. π‘Ÿ0 π‘₯! 0 𝑍!, 0 π‘Œ!, 𝐸!, πœ„ ||π‘ž, π‘₯! 𝐸!, πœ„

9

[1] – Kingma & Welling ICLR’14 Reconstruction loss

slide-16
SLIDE 16

ICML | 2020

Amortized variational inference

  • Variational evidence lower bound (ELBO) with the amortized approximate posterior [1]

parameterized by πœ”: log π‘ž 0 𝑍!| 0 π‘Œ!, 𝐸!, πœ„ β‰₯ 𝔽/! log π‘ž 0 𝑍! 0 π‘Œ!, π‘₯! βˆ’ 𝛾𝒠'. π‘Ÿ0 π‘₯! 0 𝑍!, 0 π‘Œ!, 𝐸!, πœ„ ||π‘ž, π‘₯! 𝐸!, πœ„

9

[1] – Kingma & Welling ICLR’14 Regularization

slide-17
SLIDE 17

ICML | 2020

Amortized variational inference

  • Variational evidence lower bound (ELBO) with the amortized approximate posterior [1]

parameterized by πœ”: log π‘ž 0 𝑍!| 0 π‘Œ!, 𝐸!, πœ„ β‰₯ 𝔽/! log π‘ž 0 𝑍! 0 π‘Œ!, π‘₯! βˆ’ 𝛾𝒠'. π‘Ÿ0 π‘₯! 0 𝑍!, 0 π‘Œ!, 𝐸!, πœ„ ||π‘ž, π‘₯! 𝐸!, πœ„

  • We use regularization coefficient 𝛾 [2] to weight KL term.

9

[1] – Kingma & Welling ICLR’14, [2] – Higgins et al. ICLR’17

slide-18
SLIDE 18

ICML | 2020

Amortized variational inference

  • Variational evidence lower bound (ELBO) with the amortized approximate posterior [1]

parameterized by πœ”: log π‘ž 0 𝑍!| 0 π‘Œ!, 𝐸!, πœ„ β‰₯ 𝔽/! log π‘ž 0 𝑍! 0 π‘Œ!, π‘₯! βˆ’ 𝛾𝒠'. π‘Ÿ0 π‘₯! 0 𝑍!, 0 π‘Œ!, 𝐸!, πœ„ ||π‘ž, π‘₯! 𝐸!, πœ„

  • We use regularization coefficient 𝛾 [2] to weight KL term.
  • Predictions are made via Monte Carlo sampling from the learned prior:

π‘ž + 𝑧)

! +

𝑦)

! , 𝐸!, πœ„ β‰ˆ 1

𝑀 ?

  • %&

.

π‘ž + 𝑧)

! +

𝑦)

! , π‘₯- ! ,

where π‘₯-

!~π‘ž, π‘₯! 𝐸!, πœ„ .

9

[1] – Kingma & Welling ICLR’14, [2] – Higgins et al. ICLR’17

slide-19
SLIDE 19

ICML | 2020

Shared amortized variational inference: SAMOVAR

  • Both prior and posterior are conditioned on labeled sets.
  • The inference network can be shared between prior and posterior.

log π‘ž 0 𝑍!| 0 π‘Œ!, 𝐸!, πœ„ β‰₯ 𝔽/" log π‘ž 0 𝑍! 0 π‘Œ!, π‘₯! βˆ’ 𝛾𝒠'. π‘Ÿ0 π‘₯! 0 𝑍!, 0 π‘Œ!, 𝐸!, πœ„ ||π‘ž, π‘₯! 𝐸!, πœ„

10

slide-20
SLIDE 20

ICML | 2020

Shared amortized variational inference: SAMOVAR

  • Both prior and posterior are conditioned on labeled sets.
  • The inference network can be shared between prior and posterior.

log π‘ž 0 𝑍!| 0 π‘Œ!, 𝐸!, πœ„ β‰₯ 𝔽/" log π‘ž 0 𝑍! 0 π‘Œ!, π‘₯! βˆ’ 𝛾𝒠'. π‘Ÿ, π‘₯! 0 𝑍!, 0 π‘Œ!, 𝐸!, πœ„ ||π‘ž, π‘₯! 𝐸!, πœ„

10

slide-21
SLIDE 21

ICML | 2020

Shared amortized variational inference: SAMOVAR

  • Both prior and posterior are conditioned on labeled sets.
  • The inference network can be shared between prior and posterior.

log π‘ž 0 𝑍!| 0 π‘Œ!, 𝐸!, πœ„ β‰₯ 𝔽/" log π‘ž 0 𝑍! 0 π‘Œ!, π‘₯! βˆ’ 𝛾𝒠'. π‘Ÿ, π‘₯! 0 𝑍!, 0 π‘Œ!, 𝐸!, πœ„ ||π‘ž, π‘₯! 𝐸!, πœ„

  • Sharing reduces memory footprint, and encourages learning non-degenerate prior.

10

slide-22
SLIDE 22

ICML | 2020

SAMOVAR design based on VERSA

  • Task-agnostic feature extractor 𝑔

1 produces embeddings of the input images 𝑦.

  • Task-specific linear classifier π‘₯! predicts labels for query samples +

𝑦: π‘ž + 𝑧)

! +

𝑦)

! , π‘₯! = softmax π‘₯!𝑔 1 +

𝑦)

!

.

  • Shared amortized inference network 𝑕, returns the parameters {𝜈$

! , 𝜏$ !} of a

Gaussian over weight vector π‘₯$

! for each class π‘œ:

π‘ž π‘₯$

! 𝐸!, πœ„ = π’ͺ 𝜈$ ! , diag(𝜏$ !) ,

where 2#

$

3#

$

= 𝑕,

& ' βˆ‘"%& '

𝑔

1(𝑦",$ ! )

11

VERSA – Gordon et al. ICLR’19

slide-23
SLIDE 23

ICML | 2020

Improved architectural design based on TADAM

  • Scaled cosine similarity (-SC).

The linear classifier is replaced with the cosine similarity classifier scaled with Ξ±.

  • Task encoding network (-TEN).

TEN provides task-conditioned batch norm parameters for feature maps in 𝑔

1.

  • Auxiliary co-training (-AT).

𝑔

1 is shared with an auxiliary classification task across all meta-train classes.

12

TADAM – Oreshkin et al. NeurIPS’18

slide-24
SLIDE 24

ICML | 2020

Experiments: synthetic task

  • Hierarchical generative process: π‘ž π‘₯! = π’ͺ 0, 1 , and π‘ž 𝑧! π‘₯! = π’ͺ(π‘₯!, 𝜏4

5).

  • π‘ˆ = 250 sampled tasks, 𝐿 = 5 support observations 𝐸! = 𝑧"

! "%& '

and 𝑁 = 15 query

  • bservations *

𝐸! = + 𝑧)

! "%& *

.

  • Posterior over latent variable π‘Ÿ,(π‘₯!| 𝐸!) = π’ͺ(𝜈/, 𝜏/

5).

  • Parameters of the posterior are obtained via inference network with parameters 𝜚:

𝜈/ log 𝜏/

5

= 𝜚& ?

"%& '

𝑧"

! + 𝜚5

13

slide-25
SLIDE 25

ICML | 2020

Results: synthetic task

  • Exact marginal log-likelihood log π‘ž(*

𝐸!|𝐸!).

  • Monte Carlo estimation of log π‘ž(*

𝐸!|𝐸!) with L samples from the prior.

  • Variational inference for log π‘ž(*

𝐸!|𝐸!) with L samples from the posteror.

  • Monte Carlo requires large sample sets compared to variational inference.

(b) 𝜏! = 0.5

14

estimated / true

slide-26
SLIDE 26

ICML | 2020

Experimental setup for real data

  • 5-shot and 1-shot, 5-way classification tasks.
  • Test data contains 15 query samples per class.
  • Evaluation is performed on 5,000 randomly sampled tasks.
  • We report the mean accuracy over these tasks, and 95% confidence intervals.

15

slide-27
SLIDE 27

ICML | 2020

Comparison with VERSA on miniImageNet

  • SAMOVAR-base and VERSA train the same meta-learning model.
  • SAMOVAR with separate prior and posterior is inferior to other models.
  • SAMOVAR is comparable with VERSA on 1-shot task, and outperforms it on 5-shot task.

16

VERSA – Gordon et al. ICLR’19

slide-28
SLIDE 28

ICML | 2020

Comparison with VERSA on miniImageNet

  • SAMOVAR-base and VERSA train the same meta-learning model.
  • SAMOVAR with separate prior and posterior is inferior to other models.
  • SAMOVAR is comparable with VERSA on 1-shot task, and outperforms it on 5-shot task.

16

VERSA – Gordon et al. ICLR’19

slide-29
SLIDE 29

ICML | 2020

Comparison with VERSA on miniImageNet

  • SAMOVAR-base and VERSA train the same meta-learning model.
  • SAMOVAR with separate prior and posterior is inferior to other models.
  • SAMOVAR is comparable with VERSA on 1-shot task, and outperforms it on 5-shot task.

16

VERSA – Gordon et al. ICLR’19

slide-30
SLIDE 30

ICML | 2020

Comparison with TADAM on miniImageNet

  • Both models are trained with auxiliary co-training.
  • SAMOVAR consistently improves TADAM across all ablations.

TADAM – Oreshkin et al. NeurIPS’18

17

𝛽: cosine scaling, AT: auxiliary co-training, TEN: task embedding network Additional ablations can be found in the paper.

slide-31
SLIDE 31

ICML | 2020

Comparison with TADAM on miniImageNet

  • Both models are trained with auxiliary co-training.
  • SAMOVAR consistently improves TADAM across all ablations.

Comparison with TADAM on miniImageNet

  • Both models are trained with auxiliary co-training.
  • SAMOVAR consistently improves TADAM across all ablations.

𝛽: cosine scaling, AT: auxiliary co-training, TEN: task embedding network Additional ablations can be found in the paper.

17

TADAM – Oreshkin et al. NeurIPS’18

slide-32
SLIDE 32

ICML | 2020

Comparison with state of the art on miniImageNet

  • SAMOVAR demonstrates competitive results with and without data augmentation.
  • SAMOVAR is complementary to approaches like CTM [Li et al. CVPR’19] or [Gidaris et al.

ICCV’19]

†: Transductive methods.

18

slide-33
SLIDE 33

ICML | 2020

Comparison with state of the art on miniImageNet

  • SAMOVAR demonstrates competitive results with and without data augmentation.
  • SAMOVAR is complementary to approaches like CTM [Li et al. CVPR’19] or [Gidaris et al.

ICCV’19]

†: Transductive methods.

18

slide-34
SLIDE 34

ICML | 2020

Comparison with state of the art on miniImageNet

  • SAMOVAR demonstrates competitive results with and without data augmentation.
  • SAMOVAR is complementary to approaches like CTM [Li et al. CVPR’19] or [Gidaris et al.

ICCV’19]

†: Transductive methods.

18

slide-35
SLIDE 35

ICML | 2020

Summary

  • Monte Carlo approximation underestimatea the variance in model parameters.
  • We propose SAMOVAR, a meta-learning model based on shared amortized variational

inference.

  • Task on synthetic data shows that VI approach preserves stochasticity.
  • SAMOVAR combined with TADAM shows competitive results on miniImageNet, FC100.

19

slide-36
SLIDE 36

Thank you!

ICML | 2020

Thirty-seventh International Conference

  • n Machine Learning