meta learning
play

Meta Learning Shengchao Liu Background Meta Learning (AKA Learning - PowerPoint PPT Presentation

Meta Learning Shengchao Liu Background Meta Learning (AKA Learning to Learn) A fast-learning algorithm: quickly adapted from the source tasks to the target tasks Key terminologies Support Set & Query Set C-Way K-Shot


  1. Meta Learning Shengchao Liu

  2. Background • Meta Learning (AKA Learning to Learn) • A fast-learning algorithm: quickly adapted from the source tasks to the target tasks • Key terminologies • Support Set & Query Set • C-Way K-Shot Learning: C classes and each with K samples • Pre-training & Fine-tuning

  3. Meta-Learning Metric-Based Model-Based Gradient-Based Siamese Meta GNN NN Meta Hyper MANN Networks Networks Relation Prototypical MAML Reptile ANIL Matching Network Networks (FOMAML) Network

  4. Meta-Learning Metric-Based Model-Based Gradient-Based Siamese Meta GNN NN Meta Hyper MANN Networks Networks Relation Prototypical MAML Reptile ANIL Matching Network Networks (FOMAML) Network

  5. 1. Metric-Based • Similar ideas to nearest neighborhoods algorithm ∑ , where is the kernel function p θ ( y | x , S ) = k θ ( x , x i ) y i k θ • ( x i , y i ) ∈ S • Siamese Neural Networks for One-shot Image Recognition, ICML 2015 • Learning to Compare: Relation Network for Few-Shot Learning, CVPR 2018 • Matching Network for One-Shot Learning, NIPS 2016 • Prototypical Networks for Few-Shot Learning, NeurIPS 2017 • Few-Shot Learning with Graph Neural Networks, ICLR 2018

  6. Siamese Neural Network • Few-Shot Learning • Twin network • L1-distance as the metric

  7. Siamese Neural Network

  8. Relation Network • Few-Shot Learning • Similar to Siamese Network • Di ff erence: concatenation and CNN as the relation module

  9. Matching Network • S = { x i , y i } k Given a training set (k samples per class): i =1 k k exp[ cosine ( f ( ̂ x ), g ( x i ))] ∑ ∑ Goal: P ( ̂ y | ̂ a ( ̂ x , S ) = x , x i ) y i = y i • ∑ k j =1 exp[ cosine ( f ( ̂ x ), g ( x j ))] i =1 i =1 • Two embedding methods are tested for . f , g • Episodic Training • Support Set (C-Way K-Shot)

  10. ̂ Matching Network • Simple Embedding : with some CNN model and f = g • Full Context Embedding : • applies bidirectional LSTM g ( x i ) • f ( ̂ applies attention-LSTM x ) 1. First encodes through CNN to get f ′ ( ̂ x ) 2. Then an attention-LSTM is trained with a read attention over the full support set S h k , c k = LSTM ( f ′ ( ̂ x ), [ h k − 1 , r k − 1 ], c k − 1 ) h k = ̂ h k + f ′ ( ̂ x ) | S | ∑ a ( h k − 1 , g ( x i )) ⋅ g ( x i ) r k = i =1 | S | ∑ a ( h k − 1 , g ( x i )) = exp{ h T exp{ h T where k − 1 g ( x i )}/ k − 1 g ( x j )} j =1 3. Finally , where is # of read. f ( x ) = h K K

  11. Prototypical Network • For each class: • Sample a support set 1 ∑ c k = f ϕ ( x i ) | S k | ( x i , y i ) ∈ S k • Sample a query set exp( − d ( f ϕ ( x ), c k )) p ( y = k | x ) = ∑ k ′ exp( − d ( f ϕ ( x ), c k ′ ))

  12. Prototypical Network

  13. Prototypical Network • When viewed as a clustering algorithm, then the Bregman divergences can achieve the minimum distance to the center point in S ) T ∇ ϕ ( z ′ d ϕ ( z , z ′ ) = ϕ ( z ) − ϕ ( z ′ ) − ( z − z ′ ) • Viewed as the linear regression when the Euclidean distance is used. • Comparison between Matching Network & Prototypical Network: • equal in the one-shot learning, not in the K-shot learning • Matching Network: k k exp[ cosine ( f ( ̂ x ), g ( x i ))] ∑ ∑ P ( ̂ y | ̂ a ( ̂ x ) = x , x i ) y i = y i ∑ k j =1 exp[ cosine ( f ( ̂ x ), g ( x j ))] i =1 i =1 • Prototypical Network: exp( − d ( f ϕ ( x ), c k )) p ( y = k | x ) = ∑ k ′ exp( − d ( f ϕ ( x ), c k ′ ))

  14. Meta GNN

  15. Meta GNN • For the -th layer: k • i = GCN ( x k − 1 ) x k • A k i , j = ϕ ( x k i , x k j ) = MLP ( abs | x k i − x k j | )

  16. Metric-Based • Comments: • Highly depends on the metric function. • Robustness: more troublesome when the new task diverges from the source tasks.

  17. Meta-Learning Metric-Based Model-Based Gradient-Based Siamese Meta GNN NN Meta Hyper MANN Networks Networks Relation Prototypical MAML Reptile ANIL Matching Network Networks (FOMAML) Network

  18. 2. Model-Based • Goal: to learn a model f θ • Solution: learning another model to parameterize f θ

  19. 2. Model-Based • Goal: to learn a base model f θ • Solution: learning a meta model to parameterize f θ

  20. 2. Model-Based • Goal: to learn a base model f θ • Solution: learning a meta model to parameterize f θ

  21. 2. Model-Based • Goal: to learn a base model f θ • Solution: learning a meta model to parameterize f θ • Meta-Learning with Memory-Augmented Neural Networks, ICML 2016 • Meta Networks, ICML 2017 • HyperNetworks, ArXiv 2016

  22. Memory-Augmented Neural Networks (MANN) • Basic idea (Neural Turning Machine): • Store the useful information of the new task using an external memory. • The true label of the last time step is used. • External memory.

  23. Memory-Augmented Neural Networks (MANN) • Example:

  24. Addressing Mechanism • key vector at step t is generated from input , memory matrix at step t is , memory at step t is k t x t M t r t • w r w u w w read weights , usage weights , write weights t t t • Read k t M t ( i ) t ( i ) = softmax ( exp(( ∥ k t ∥∥ M t ( i ) ∥ )) ) w r N ∑ w r r t = t M t ( i ) i =1 • Write (Least Recently Used Access, LRUA) w u t = γ w u t − 1 + w r t + w w t w w t = σ ( α ) w r t − 1 + (1 − σ ( α )) w lu t − 1 t = { 0, if w u t ( i ) > m ( w u t , n ) w ul m ( w u w u , where is the -th smallest element in vector t , n ) n t 1, otherwise M t ( i ) = M t − 1 ( i ) + w w t ( i ) k t , ∀ i

  25. Meta-Learning Metric-Based Model-Based Gradient-Based Siamese Meta GNN NN Meta Hyper MANN Networks Networks Relation Prototypical MAML Reptile ANIL Matching Network Networks (FOMAML) Network

  26. 3. Gradient-Based Model-Based: • Goal: to learn a base model f θ • Solution: learning a meta model to parameterize f θ

  27. 3. Gradient-Based Model-Based: • Goal: to learn a base model f θ • Solution: learning a meta model to parameterize f θ Gradient-Based: • Goal: to learn a base model f θ • Solution: learning to parameterize without a meta model f θ

  28. 3. Gradient-Based • Learning to learn with Gradients • MAML (Model-Agnostic Meta Learning ) & FOMAML, ICML 2017 • Reptile, ArXiv 2018 • ANIL (Almost No Inner Loop), ICLR 2020

  29. MAML • Model-Agnostic Meta-Learning (MAML) • Motivation • find a model parameter that are sensitive to changes in the task • small changes in the parameters can get large improvements

  30. MAML • Outer loop: • Inner loop: • Sample batch of tasks τ i • Sample samples K i ) = ∑ ∑ Meta-object: min ℓ τ i ( f θ ′ ℓ τ i ( f θ − α ∇ θ ℓ τ i ( f θ ) ) • θ τ i ∼ P ( τ ) τ i ∼ P ( τ ) θ = θ − β ∇ θ ∑ i ) = θ − β ∇ θ ∑ SGD: ℓ τ i ( f θ ′ ℓ τ i ( f θ − α ∇ θ ℓ τ i ( f θ ) ) • τ i ∼ p ( τ ) τ i ∼ p ( τ )

  31. FOMAML • Involves a gradient through a gradient: θ = θ − β ∇ θ ∑ ℓ τ i ( f θ − α ∇ θ ℓ τ i ( f θ ) ) τ i ∼ p ( τ ) • First-order approximation, A.K.A. first-order MAML (FOMAML) • Omit the second-order derivatives • Still compute the meta-gradient at the post-update parameter θ ′ i θ = θ − β ∇ θ ∑ ℓ τ i ( f θ ′ i ) • τ i ∼ p ( τ ) • Almost the same performance, but ~33% faster • Notice: this meta-objective is multi-task learning.

  32. MAML • Outer loop: • Inner loop: • Sample batch of tasks τ i • Sample samples K i ) = ∑ ∑ Meta-object: min ℓ τ i ( f θ ′ ℓ τ i ( f θ − α ∇ θ ℓ τ i ( f θ ) ) • θ τ i ∼ P ( τ ) τ i ∼ P ( τ ) θ = θ − β ∇ θ ∑ i ) = θ − β ∇ θ ∑ SGD: ℓ τ i ( f θ ′ ℓ τ i ( f θ − α ∇ θ ℓ τ i ( f θ ) ) • τ i ∼ p ( τ ) τ i ∼ p ( τ )

  33. FOMAML • Outer loop: • Inner loop: • Sample batch of tasks τ i • Sample samples K i ) = ∑ ∑ Meta-object: min ℓ τ i ( f θ ′ ℓ τ i ( f θ − α ∇ θ ℓ τ i ( f θ ) ) • θ τ i ∼ P ( τ ) τ i ∼ P ( τ ) • SGD: θ ′ = θ − α ∇ θ ℓ τ i f ( θ ) θ = θ − β ∇ θ ∑ SGD: ℓ τ i ( f θ ′ i ) • τ i ∼ p ( τ )

  34. Reptile • Same motivation: • pre-training: learn a initialization • fine-tuning: able to quickly be adapted on new tasks

  35. Reptile • For each iteration, do: • Sample task τ • Get the corresponding loss ℓ τ • ˜ θ = U k Compute , with steps of SGD/Adam τ ( θ ) k n θ = θ + ϵ 1 ∑ θ = θ + ϵ (˜ (˜ Update or θ − θ ) θ i − θ ) • n i =1

  36. Reptile • If , Reptile is similar to min 𝔽 τ [ L τ ] k = 1 g Reptile , k =1 = θ − ˜ θ = θ − U τ , A ( θ ) = θ − ( θ − ∇ θ L τ , A ( θ )) = ∇ θ L τ , A ( θ ) • If , Reptile diverges from min 𝔽 τ [ L τ ] k > 1 θ − U τ , A ( θ ) ≠ θ − ( θ − ∇ θ L τ , A ( θ ))

  37. ANIL • ANIL (Almost No Inner Loop) • The reason why MAML works: rapid learning or feature reuse

  38. ANIL • ANIL (Almost No Inner Loop) • The reason why MAML works: rapid learning or feature reuse

  39. ANIL • ANIL: Only update the head (last layer) in the inner loop

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend