Deep Reinforcement Learning
- M. Soleymani
Sharif University of Technology Fall 2017 Slides are based on Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Surguy Levin lectures, cs294-112, Berkeley 2016.
Deep Reinforcement Learning M. Soleymani Sharif University of - - PowerPoint PPT Presentation
Deep Reinforcement Learning M. Soleymani Sharif University of Technology Fall 2017 Slides are based on Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Surguy Levin lectures, cs294-112, Berkeley 2016. Supervised Learning
Sharif University of Technology Fall 2017 Slides are based on Fei Fei Li and colleagues lectures, cs231n, Stanford 2017 and some from Surguy Levin lectures, cs294-112, Berkeley 2016.
– x is data – y is label
– Just data, no labels!
– Concerned with taking sequences
an agent interacting with an environment, which provides numeric reward signals
– Observations: camera images, joint angles – Actions: joint torques – Rewards: stay balanced, navigate to target locations, serve and protect humans
– You don't have full access to the function you're trying to optimize
– Interacting with a stateful world: input 𝑦𝑢 depend on your previous actions
– Agent selects action 𝑏𝑢 – Environment samples reward 𝑠
𝑢~𝑆(. |𝑡𝑢, 𝑏𝑢)
– Environment samples next state 𝑡𝑢+1~𝑄(. |𝑡𝑢, 𝑏𝑢) – Agent receives reward 𝑠
𝑢 and next state 𝑡𝑢+1
𝑠
0 + 𝛿𝑠 1 + 𝛿2𝑠 2 + ⋯ = 𝑙=0 ∞
𝛿𝑙𝑠
𝑙
– Maximize the expected sum of rewards!
22
∞
𝑢 𝑡0 = 𝑡, 𝜌
∞
𝑢 𝑡0 = 𝑡, 𝑏0 = 𝑏 , 𝜌
– It is simply the expected sum of discounted rewards upon starting in state s and taking actions according to 𝜌
Bellman Equations
23
𝑏∈(𝑡)𝐹 𝑠 + 𝛿𝑊∗(𝑡′)|𝑡, 𝑏
𝑏′ 𝑅∗ 𝑡′, 𝑏′
25
𝜌∗ 𝑡 = argmax
𝑏∈(𝑡)
𝑅∗ 𝑡, 𝑏
27
𝑅(𝑡, 𝑏) arbitrarily
𝑅
𝑅 𝑡, 𝑏 + 𝛽 𝑠 + 𝛿 max
𝑏′
𝑅 𝑡′, 𝑏′ − 𝑅 𝑡, 𝑏
e.g., greedy, ε-greedy
– Must compute Q(s,a) for every state-action pair.
– E.g. a neural network!
Iteratively try to make the Q- value close to the target value (yi ) it should have (according to Bellman Equations). [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]
[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]
[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015] Last FC layer has 4-d output (if 4 actions) Q(st ,a1 ), Q(st,a2 ), Q(st,a3 ), Q(st,a4 ) Number of actions between 4-18 depending on Atari game A single feedforward pass to compute Q-values for all actions from the current state => efficient!
– Samples are correlated => inefficient learning – Current Q-network parameters determines next training samples
– Continually update a replay memory table of transitions (𝑡𝑢, 𝑏𝑢, 𝑠
𝑢, 𝑡𝑢+1)
– Train Q-network on random minibatches of transitions from the replay memory Each transition can also contribute to multiple weight updates => greater data efficiency
Smoothing out learning and avoiding oscillations or divergence in the parameters
[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]
Initialize replay memory, Q-network [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]
Play M episodes (full games) [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]
Initialize state (starting game screen pixels) at the beginning of each episode [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]
For each time-step of game [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]
With small probability, select a random action (explore),
[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]
Take the selected action observe the reward and next state [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]
Store transition in replay memory [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]
Sample a random minibatch
gradient descent step [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]
[V. Mnih et al., Human-level control through deep reinforcement learning, Nature 2015]
– The Q-function can be very complicated!
the policy that must be learnt
Can estimate with Monte Carlo sampling
𝛼𝜄𝐾(𝜄) ≈ 1 𝑂
𝑜=1 𝑂 𝑢≥0
𝑠 𝑡𝑢
𝑜 , 𝑏𝑢 𝑜 𝑢≥0
𝛼𝜄 log 𝜌𝜄 𝑏𝑢
𝑜 |𝑡𝑢 (𝑜)
𝛼𝜄𝐾(𝜄) ≈ 1 𝑂
𝑜=1 𝑂
𝑠(𝜐(𝑜)) 𝛼𝜄 log 𝑞 𝜐(𝑜); 𝜄
– Sample 𝜐(𝑜) from 𝜌𝜄 𝑏|𝑡 (run the policy) – 𝛼𝜄𝐾(𝜄) ≈
1 𝑂 𝑜=1 𝑂
𝑢≥0 𝑠 𝑡𝑢
𝑜 , 𝑏𝑢 𝑜
𝑢≥0 𝛼𝜄 log 𝜌𝜄 𝑏𝑢
𝑜 |𝑡𝑢 (𝑜)
– 𝜄 ← 𝜄 + 𝛽𝛼𝜄𝐾(𝜄)
𝑜=1 𝑂 𝑢≥0
𝑜 , 𝑏𝑢 𝑜 𝑢≥0
𝑜 |𝑡𝑢 (𝑜)
𝑡𝑢 𝑏𝑢 𝜌𝜄 𝑏𝑢|𝑡𝑢
𝑠 𝜐(𝑜)
𝛼𝜄𝐾(𝜄) ≈ 1 𝑂
𝑜=1 𝑂 𝑢≥0
𝑠 𝑡𝑢
𝑜 , 𝑏𝑢 𝑜 𝑢≥0
𝛼𝜄 log 𝜌𝜄 𝑏𝑢
𝑜 |𝑡𝑢 (𝑜)
𝛼𝜄𝐾(𝜄) ≈ 1 𝑂
𝑜=1 𝑂 𝑢≥0
𝛼𝜄 log 𝜌𝜄 𝑏𝑢
𝑜 |𝑡𝑢 (𝑜)
𝑡𝑢 𝑏𝑢 𝜌𝜄 𝑏𝑢|𝑡𝑢 𝜌𝜄 𝑏𝑢|𝑡𝑢
𝑠 𝜐(𝑜)
𝑜=1 𝑂
𝑢≥0
𝑜 |𝑡𝑢 (𝑜)
𝑞𝜄 𝑞𝜄 𝛼𝜄𝐾 𝜄 ≈ 1 𝑂
𝑜=1 𝑂
𝑠 𝜐 𝑜 𝛼𝜄 log 𝑞𝜄 𝜐 𝑜 𝑢≥0 𝛼𝜄 log 𝜌𝜄 𝑏𝑢
𝑜 |𝑡𝑢 (𝑜)
𝛼𝜄𝐾𝑁𝑀 𝜄 ≈ 1 𝑂
𝑜=1 𝑂
𝛼𝜄 log 𝑞𝜄 𝜐 𝑜
𝑜=1 𝑂 𝑢≥0
𝑜 , 𝑏𝑢 𝑜 𝑢≥0
𝑜 |𝑡𝑢 (𝑜)
𝑜=1 𝑂 𝑢≥0 𝑢′≥𝑢
𝑜 , 𝑏𝑢′ 𝑜
𝑜 |𝑡𝑢 (𝑜)
𝛼𝜄𝐾 𝜄 ≈ 1 𝑂
𝑜=1 𝑂 𝑢≥0
𝑠 𝑡𝑢
𝑜 , 𝑏𝑢 𝑜 𝑢≥0
𝛼𝜄 log 𝜌𝜄 𝑏𝑢
𝑜 |𝑡𝑢 (𝑜)
𝛼𝜄𝐾 𝜄 ≈ 1 𝑂
𝑜=1 𝑂 𝑢≥0 𝑢′≥𝑢
𝑠 𝑡𝑢′
𝑜 , 𝑏𝑢′ 𝑜
𝛼𝜄 log 𝜌𝜄 𝑏𝑢
𝑜 |𝑡𝑢 (𝑜)
𝛼𝜄𝐾 𝜄 ≈ 1 𝑂
𝑜=1 𝑂 𝑢≥0 𝑢′≥𝑢
𝛿𝑢′−𝑢𝑠 𝑡𝑢′
𝑜 , 𝑏𝑢′ 𝑜
𝛼𝜄 log 𝜌𝜄 𝑏𝑢
𝑜 |𝑡𝑢 (𝑜)
– For example, if rewards are all positive, you keep pushing up probabilities of actions.
– Whether a reward is better or worse than what you expect to get
𝛼𝜄𝐾 𝜄 ≈ 1 𝑂
𝑜=1 𝑂 𝑢≥0 𝑢′≥𝑢
𝛿𝑢′−𝑢𝑠 𝑡𝑢′
𝑜 , 𝑏𝑢′ 𝑜
− 𝑐 𝑡𝑢
(𝑜)
𝛼𝜄 log 𝜌𝜄 𝑏𝑢
𝑜 |𝑡𝑢 (𝑜)
Simple baseline: 𝑐 = 1
𝑂 𝑜=1 𝑂
𝑠 𝜐 𝑜
average reward is not the best baseline, but it’s pretty good!
– This isn’t the same as supervised learning! – Gradients will be really noisy!
– Adaptive step size rules like ADAM can be OK-ish – policy gradient-specific learning rate adjustment methods
– Sample 𝜐(𝑜) from 𝜌𝜄 𝑏𝑢|𝑡𝑢 (run the policy) – 𝛼𝜄𝐾(𝜄) ≈
1 𝑂 𝑜=1 𝑂
𝑢≥0 𝛿𝑢′−𝑢𝑠 𝑡𝑢
𝑜 , 𝑏𝑢 𝑜
𝑢≥0 𝛼𝜄 log 𝜌𝜄 𝑏𝑢
𝑜 |𝑡𝑢 (𝑜)
– 𝜄 ← 𝜄 + 𝛽𝛼𝜄𝐾(𝜄)
𝛼𝜄𝐾(𝜄) ≈ 1 𝑂
𝑜=1 𝑂 𝑢≥0 𝑢′≥𝑢
𝛿𝑢′−𝑢𝑠 𝑡𝑢′
𝑜 , 𝑏𝑢′ 𝑜
𝛼𝜄 log 𝜌𝜄 𝑏𝑢
𝑜 |𝑡𝑢 (𝑜)
𝑅𝑢
(𝑜)
Reward to go
– if this action was better than the expected value of what we should get from that state.
𝛼𝜄𝐾(𝜄) ≈ 1 𝑂
𝑜=1 𝑂 𝑢≥0 𝑢′≥𝑢
𝛿𝑢′−𝑢𝑠 𝑡𝑢′
𝑜 , 𝑏𝑢′ 𝑜
𝛼𝜄 log 𝜌𝜄 𝑏𝑢
𝑜 |𝑡𝑢 (𝑜)
𝑅𝑢
(𝑜)
Reward to go
(𝑜): estimate of expected reward if we take action 𝑏𝑢 (𝑜) in state 𝑡𝑢 𝑜
– True expected reward to go
1 𝑂 𝑜=1 𝑂
𝑢≥0 𝑅 𝑡𝑢
𝑜 , 𝑏𝑢 (𝑜) − 𝑊 𝑡𝑢 𝑜
𝛼𝜄 log 𝜌𝜄 𝑏𝑢
𝑜 |𝑡𝑢 (𝑜)
– True expected reward to go
– Total reward from 𝑡𝑢
– How much better 𝑏𝑢 is
1 𝑂 𝑜=1 𝑂
𝑢≥0 𝐵𝜌 𝑡𝑢
(𝑜), 𝑏𝑢 (𝑜) 𝛼𝜄 log 𝜌𝜄 𝑏𝑢 𝑜 |𝑡𝑢 (𝑜)
Remark: we can define by the advantage function how much an action was better than expected
𝛼𝜄𝐾 𝜄 ≈ 1 𝑂
𝑜=1 𝑂 𝑢≥0
𝛿𝑢′−𝑢𝑠 𝑡𝑢
(𝑜), 𝑏𝑢 (𝑜) − 𝑐 𝑢≥0
𝛼𝜄 log 𝜌𝜄 𝑏𝑢
𝑜 |𝑡𝑢 (𝑜)
𝛼𝜄𝐾(𝜄) ≈ 1 𝑂
𝑜=1 𝑂 𝑢≥0
𝐵𝜌 𝑡𝑢
(𝑜), 𝑏𝑢 (𝑜) 𝛼𝜄 log 𝜌𝜄 𝑏𝑢 𝑜 |𝑡𝑢 (𝑜) Instead of using this unbiased, but high variance single-sample estimate, use 𝐵𝜌 that is an estimation of expectation
– Yes, like Q-learning!
– The actor decides which action to take – the critic tells the actor how good its action was and how it should adjust – Also alleviates the task of the critic as it only has to learn the values of (state, action) pairs generated by the policy
𝑧𝑢 ≈ 𝑠 𝑡𝑢, 𝑏𝑢 + 𝑊
𝜚 𝜌 𝑡𝑢+1
ℒ 𝜚 =
𝑢
𝑊
𝜚 𝜌 𝑡𝑢 − 𝑧𝑢 2
repeat
– Better convergence properties – Effective in high dimensional or continuous action spaces – Can learn stochastic policies
– Typically converges to a local rather than global optimum – Evaluating a policy is typically inefficient and high variance
– Observation: current image window – Action: where to look – Reward: classification
– Observations: words in source language – Actions: emit word in target language – Rewards: sentence-level metric, e.g. BLEU score
neural networks“, 2015.
– Inspiration from human perception and eye movements – Saves computational resources => scalability – Able to ignore clutter / irrelevant parts of image
[Mnih et al. 2014]
where to look next in image
classified, 0 otherwise
=> learn policy for how to take glimpse actions using REINFORCE
[Mnih et al. 2014]
[Mnih et al. 2014]
[Mnih et al. 2014]
– because these are not differentiable – BLEU: comparing the sequence of actions from the current policy against the
– It pursues not only one but k next word candidates at each point.
– greedy policy is obtaind by maximum likelihood on training data
𝑠
𝑢 is estimated by a linear regressor which takes as input the
hidden states ℎ𝑢 of the RNN
XE+R
XENT
– Featurize the board (stone color, move legality, bias, …) – Initialize policy network with supervised training from professional go games, then continue training using policy gradient
– Also learn value network (critic) – Finally, combine policy and value networks in a Monte Carlo Tree Search algorithm to select actions by lookahead search
– Challenge: sample-efficiency
– Challenge: exploration
– Policy Gradients: Converges to a local minima of 𝐾(𝜄), often good enough! – Q-learning: Zero guarantees since you are approximating Bellman equation with a complicated function approximator