CSC2547: Learning to Search
Lecture 2: Background and gradient esitmators Sept 20, 2019
CSC2547: Learning to Search Lecture 2: Background and gradient - - PowerPoint PPT Presentation
CSC2547: Learning to Search Lecture 2: Background and gradient esitmators Sept 20, 2019 Admin Course email: learn.search.2547@gmail.com Piazza: piazza.com/utoronto.ca/fall2019/csc2547hf Good place to find project partners
Lecture 2: Background and gradient esitmators Sept 20, 2019
questions, bring helpful or knowledgable (for letters of rec)
participate in projects or presentations?
paired with someone enrolled. Use piazza to find partners.
project?
fake / toy data
RELAX, discuss where and why it stalled out
intermediate quantities (hidden units, latent variables) to model or produce high-dimensional data (images, sounds, text)
Discrete hiddens or parameters are a no-no
words
action to take.
arbitrary size, like programs, graphs, or large texts.
Adversarial Generation of Natural Language. Sai Rajeswar, Sandeep Subramanian, Francis Dutil, Christopher Pal, Aaron Courville, 2017
“We successfully trained the RL-NTM to solve a number of algorithmic tasks that are simpler than the ones solvable by the fully differentiable NTM.” Reinforcement Learning Neural Turing Machines Wojciech Zaremba, Ilya Sutskever, 2015
information the more parameters you have
the original function
problem than you think
Source: xkcd
can’t use backdrop to get gradients
that we don’t know which direction to move to improve
the structure of the function being optimized
function
An illustration of the search space of a sequential tagging example that assigns a part-of-speech tag sequence to the sentence “John saw Mary.” Each state represents a partial
search space to specify the next state.
can’t be evaluated on partial inputs.
satisfiable
cons
to try and when
solutions
learning can all be cast as optimizing and optimization procedure.
θ1 initialize params update params rL θt−1 update params rL θ2 θt evaluate validation loss regularization params
params training data validation data validation loss
1. Snoek, Larochelle and Adams, Practical Bayesian Optimization of Machine Learning Algorithms, NIPS 2012 2. Golovin et al., Google Vizier: A Service for Black-Box Optimization, SIGKDD 2017 3. Bengio, Gradient-Based Optimization of Hyperparameters, Neural Computation 2000 4. Domke, Generic Methods for Optimization-Based Modeling, AISTATS 2012
θ1 initialize params update params rL θt−1 update params rL θ2 θt evaluate validation loss regularization params
params training data validation data validation loss
1. Snoek, Larochelle and Adams, Practical Bayesian Optimization of Machine Learning Algorithms, NIPS 2012 2. Golovin et al., Google Vizier: A Service for Black-Box Optimization, SIGKDD 2017 3. Bengio, Gradient-Based Optimization of Hyperparameters, Neural Computation 2000 4. Domke, Generic Methods for Optimization-Based Modeling, AISTATS 2012
Optimized training schedules
40 80 Schedule index 2 4 6 Learning rate Layer 1 Layer 2 Layer 3 Layer 4
P(digit | image)
up training of GANs. E.g. iMaml
Bayesian decision theory: Planning over what you’ll learn and do.
reduction)
did one-step lookahead with a simple approximation, a lot you could take from the approximate dynamic programming literature to make things work better in practice with roughly linear slowdowns I think.”
flexible
standard gradient estimators, optimizing a surrogate function
gradients through discrete optimization, involving a local discrete search
−1 X𝖴y
= [Iσ2]
−1 X𝖴y
= [Iσ2]
−1(ϵσ)𝖴y
= ∑
i
ϵiyi σ
= ∑
i
ϵi f(ϵiσ) σ
ϵ ∼ 𝒪(0, I)
each step
experience replay
any gradient-free optimization algorithm
(Tony) Wu, Jaiming Song
functions
RL approaches (trial and error) don’t work
correspondence)
dependent types
about that.
about that.
beliefs about mathematic statements)
software updates of yourself that will provably improve expected performance. Run one whenever found.
computable prior: set of all computable programs.
Will Grathwohl Dami Choi Yuhuai Wu Geoff Roeder David Duvenaud
is embarrassingly competitive
Shahriari et al., 2016
differentiable or even unknown
∂θ log p(b|θ),
ˆ gREINFORCE[f] = f (b) ∂
∂θ log p(b|θ),
b ∼ p(b|θ)
huge continuous models
and differentiable
differentiable
∂b ∂b ∂θ
f(b)
p(b|θ)
Source: Kingma’s NIPS 2015 workshop slides
∂b ∂b ∂θ
discrete models
differentiable
ˆ gconcrete[f] =
∂f ∂σ(z/t) ∂σ(z/t) ∂θ
z = T(✓, ✏), ✏ ∼ p(✏)
p(z|θ)
f(b)
f(b)
corr(g, c) > 0
ˆ gLAX = gREINFORCE[f] − gREINFORCE[cφ] + greparam[cφ]
∂θ log p(b|✓) + ∂ ∂θcφ(b)
ˆ gLAX = gREINFORCE[f] − gREINFORCE[cφ] + greparam[cφ]
estimates for the gradient of the variance of
ˆ g cφ
with stochastic optimization
∂θ log p(b|✓) + ∂ ∂θcφ(b)
ˆ gLAX
cφ
H(z) = b ∼ p(b|θ)
ˆ gRELAX = [f(b) − cφ(˜ z)] ∂
∂θ log p(b|θ) + ∂ ∂θcφ(z) − ∂ ∂θcφ(˜
z)
surrogate only adapts temperature param.
balances REINFORCE variance and reparameterization variance
always smooth
combining REBAR and RELAX:
log p(x) ≥ L(θ) = Eq(b|x)[log p(x|b) + log p(b) − log q(b|x)] cφ(z) = f(σλ(z)) + rρ(z)
estimate of the value function as a control variate
cφ
ˆ gAC =
T
X
t=1
∂ log π(at|st, θ) ∂θ " T X
t0=t
rt0 − cφ(st) #
squared error from reward or value function. We directly minimize variance. ˆ gLAX =
T
X
t=1
∂ log π(at|st, θ) ∂θ " T X
t0=t
rt0 − cφ(st, at)
t)
# + ∂ ∂θcφ(st, at)
Dependent Baselines in Reinforcement Learning” (Tucker et al. 2018). Better experiments would examine high-dimensional action regime.
unknown, or not differentiable
discrete
f(b) p(b|θ)
need to wait for surrogate to adapt
still awkward in TF and PyTorch
well on categorical variables.
variables between decision branches without making true function jagged + hard to relax