Me Meta Lear Learnin ing A Bri Brief Introduct ction
Xiachong Feng TG Ph.D. Student 2018-12-01
Me Meta Lear Learnin ing A Bri Brief Introduct ction Xiachong - - PowerPoint PPT Presentation
Me Meta Lear Learnin ing A Bri Brief Introduct ction Xiachong Feng TG Ph.D. Student 2018-12-01 Ou Outline Introduction to Meta Learning Types of Meta-Learning Models Papers: Optimization as a model for few-shot
Xiachong Feng TG Ph.D. Student 2018-12-01
Deep NetworksICML2017
TranslationEMNLP2018
Machine Learning
Reinforcement learning L L Meta Learning R
D
Meta Learning/Learning to learn https://zhuanlan.zhihu.com/p/28639662
A
Learning to Learn https://zhuanlan.zhihu.com/p/27629294
model
Learning rate Dacay …… Learner
Learner Machine or Deep learning Meta learning
tailored to specific circumstances.
follow the same techniques.
What’s New in Deep Learning Research: Understanding Meta-Learning
datasets mimicking --> (learn from tiny data)
ICLR2017
ICML2016
NetworksICML2016
better accomplish a task.
learns to update another network (the learner) so that the learner effectively learns the task.
descent (NIPS 2016)
particularly efficient. This approach can be seen as a subset of few shots meta-learning in which we used a learned metric space to evaluate the quality
Recognition
Efficient Learning
will process a dataset sequentially and then process new inputs from the task
Networks
Reinforcement Learning
effectively fine-tuned from a small number of examples
Deep NetworksICML 2017
TranslationEMNLP2018
Few Shots Meta-LearningRecurrent Model Meta- LearningOptimizer Meta-LearningInitializations Meta-LearningSupervised Meta Learning Optimization As a Model For Few Shot Learning (ICLR2017) Meta Learning in NLP Meta-Learning for Low-Resource Neural Machine Translation (EMNLP2018) Modern Meta Learning Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (ICML2017)
Twitter, Sachin Ravi, Hugo Larochelle ICLR2017
" examples, ! = $%, '% , … $), ') ,
new cell state
forgetting the things we decided to forget earlier new candidate values
https://www.jianshu.com/p/9dc9f41f0b29
NN
SGD Adam ……
!(#) → &
image label
problem at two levels. (Thrun, 1998; Schmidhuber et al.,
1997)
separate task presented. (Fast adaption)
slower extraction of information learned across all the tasks.(Learning)
and ADAM) in high capacity classifiers requires many
iterative steps over many examples to perform well.
Propose an LSTM based meta-learner model to learn the exact optimization algorithm used to train another learner neural network classifier in the few-shot regime. LSTM cell-state update Gradient based update
Learner Neural network classifier Meta-learner Learn optimization algorithm Current parameter!"#$ Gradient ∇&'()ℒ New parameter !" Gradient-based optimization: Meta-learner optimization: !" = metalearner(!"#$, ∇&'()ℒ) knowing how to quickly optim the parameters LSTM-based meta-learner
network classifier.
Given by learner Given by learner
Used to train learner Used to train meta-learner episode
Learner Neural network classifier (!"#$) Loss ℒ
Gradient ∇1234ℒ Meta-learner Learn optimization algorithm(Θ6#$) Loss ℒ Gradient ∇1234ℒ Output of meta learner 7" Output of meta learner 7" Learner Neural network classifier (!")
Update learner
Learner Neural network classifier (!") Loss ℒ",-" Θ6 = Θ6#$ − :∇;<34ℒ",-" Current param !"#$ Learner Update Meta-Learner Update
determine the optimal initial weights of the learner
Learner (Init with !", Current !#$%) Loss ℒ
Gradient ∇5678ℒ Meta-learner learn optimization algorithm(Θ) Output of meta learner :# Output of meta learner :# Learner Neural network classifier(!#)
Update learner
Learner Neural network classifier Metric Loss ℒ Gradient ∇5678ℒ Current param !#$% Learner Update Testing
Learner Update Meta-Learner Update
networks, which consist of tens of thousands of parameters, to prevent an explosion of meta-learner parameters we need to employ some sort of parameter sharing.
internal covariate shift within the learner’s hidden layers.
determine the optimal initial weights of the learner
University of California, Berkeley Chelsea Finn, Pieter Abbeel, Sergey Levine ICML 2017
function or learning rule
tasks, simply fine-tuning the parameters slightly can produce good results.
sensitivity of the loss functions of new tasks with respect to the parameters: when the sensitivity is high, small local changes to the parameters can lead to large improvements in the task loss.
model that can quickly adapt to a new task using
variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples.
loss function distribution over initial observations transition distribution an episode length
"#$
Sample tasks from p(!)
Train (
) on the *+,-." using
gradient based method
!
/
!0 !0: !
/:
2/
3
20
3
()4
5
()6
5
ℒ8
4(()4 5 )
ℒ8
6(()6 5 )
2 2
Update ! by:
"#$
%
"#&
%
ℒ(
$("#$ % )
ℒ(
&("#& % )
Object function
+ is easy to fine-tune !
algorithm for meta-learning that trains a model’s parameters such that a small number of gradient updates will lead to fast learning on a new task.
Jiatao Gu , Yong Wang , Yun Chen , Kyunghyun Cho and Victor O.K. Li The University of Hong Kong New York University
Hong Kong
University
adaptation on new training data.”
learning has been on few-shot (or one-shot) learning.
adaptation
meta-learning algorithm for low resource neural machine translation (NMT).
parameters that facilitate fast adaptation for a new language pair with a minimal amount of training examples.
Target Tasks: !" Source Tasks: {!$, !&, … , !(} !$ : GermanàEnglish !& : FrenceàEnglish ….. !* : DutchàEnglish ….. !( : PolishàEnglish !" :TurkishàEnglish
Sample train dataset!"# and test dataset !"#
$
Sample one Task %& from Source Tasks: {%(, %*, … , %,} ./ : DutchàEnglish %( : GermanàEnglish %* : FrenceàEnglish ….. ./ : DutchàEnglish ….. %, : PolishàEnglish Train 0./ 1:DutchàEnglish 2:DutchàEnglish …… Test 0./
$
1:DutchàEnglish 2:DutchàEnglish ……
Train !"# 1:DutchàEnglish 2:DutchàEnglish …… Test !"#
$
1:DutchàEnglish 2:DutchàEnglish ……
NMT(%) NMT(%&
$ )
MAML MAML
!" : GermanàEnglish !# : FrenceàEnglish !$ : DutchàEnglish !% : PolishàEnglish
….. ….. ….. ….. MALML
Target Tasks: !" !" :TurkishàEnglish initial parameters #"
Objective function Given
maximum likelihood criterion often used for training a usual NMT system discourages the newly learned model from deviating too much from the initial parameters
Target Tasks: !" Source Tasks: {!$, !&, … , !(} !$ : GermanàEnglish !& : FrenceàEnglish ….. !* : DutchàEnglish ….. !( : PolishàEnglish !" :TurkishàEnglish
and finetunes the system for each target language pair (RoEn, Lv- En).
language pairs (Fr-En, Pt-En, Es-En)
including the source and target tasks.
Low Resource Languages NAACL 2018
|V
#|
|V$| |V%| & & & ….. Language 1 Language 2 Language k Query Embedding Key Embedding Universal Embedding & ' ' & …..
(El), Spanish (Es), Estonian (Et), French (Fr), Hungarian (Hu), Italian (It), Lithuanian (Lt), Dutch (Nl), Polish (Pl), Portuguese (Pt), Slovak (Sk), Slovene (Sl) and Swedish (Sv), Russian (Ru)
learning
with Adam.
to form a low-resource task.
for each language.
each experiment and report the average score
stopped on a validation set and evaluated on a test set.
the parameters of the decoder (emb+enc)
target tasks regardless of which target task was used for early stopping
approaches.
size of the target task’s training set
for different target languages
and eventually degrades, as the model overfits to the source tasks.
adaptation