A few meta learning papers
Machine Learning Journal Club, September 2017 Guy Gur-Ari
A few meta learning papers Guy Gur-Ari Machine Learning Journal - - PowerPoint PPT Presentation
A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017 Meta Learning Mechanisms for faster, better adaptation to new tasks Integrate prior experience with a small amount of new information
Machine Learning Journal Club, September 2017 Guy Gur-Ari
information’
game player applied to new games, …
approximations sometimes work well
Learning to learn by gradient descent by gradient descent Andrychowicz et al.
1606.04474
Target parameters Optimizer parameters Target (‘optimizee’) loss function Recurrent Neural Network m with parameters ɸ
[1606.04474]
ht = tanh (Whht−1 + Wxxt) yt = Wyht
[Karpathy]
t xt yt ht
ht−1
xt−1 yt−1 m m m t − 1 t + 1
Backpropagation through time
Ideal In practice
rt = rθf(θt)
wt ≡ 1
Optimal target parameters for given optimizer RNN (2-layer LSTM) RNN hidden state
[1606.04474]
to momentum
through time
rt = rθf(θt)
wt ≡ 1
[1606.04474]
(100 steps, unroll for 20)
[1606.04474]
[1606.04474]
Graph used for computing the gradient of the optimizer (with respect to ɸ)
(φ)
[1606.04474]
(φ) (φ)
target parameters
parameter ordering
separate hidden state
rφrθf = 0
[1606.04474]
[1606.04474]
Variability is in initial target parameters and choice of mini-batches
[1606.04474]
Separate optimizers for convolutional and fully-connected layers
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks Finn, Abbeel, Levine
1703.03400
tasks (‘few-shot learning’)
significantly improve the loss
Ti p(T )
[1703.03400]
(to avoid overfitting?)
[1703.03400]
descent (e.g. regression, reinforcement learning)
[1703.03400]
Single task = compute sine with given underlying amplitude and phase Pretrained = compute a single model
Model is FC network with 2 hidden layers
[1703.03400]
Each classification class is a single task
[1703.03400]
Reward = negative square distance from goal position. For each task, goal is placed randomly.
[1703.03400]
Overcoming catastrophic forgetting in neural networks Kirkpatrick et al.
1612.00796
A followed by task B, it typically forgets A
important for A
[1612.00796]
hyperparameter diagonal of Fisher information matrix
for task A
Fi ≈ ∂2LA ∂θ2
i
L(θ) = − log(θ|DA, DB) = − log p(DB|θ) − log p(θ) − log p(DA|θ) + log p(DA, DB) ∼ LB(θ) − log p(DA|θ)
now suppose pθ∗ = pA then
Fij = Ex∼pθ[rθi log pθ(x)rθj log pθ(x)] − log p(DA|θ) = − X
i
log pθ(xi) ∼ − X
x
pA(x) log pθ(x) − X
x
pθ∗(x) log pθ∗+dθ(x) = S(pθ∗) + 1 2dθT Fdθ + · · ·
L(θ) ∼ LB(θ) + 1 2dθT Fdθ dθ = θ − θ∗
A