CS 4803 / 7643: Deep Learning
Zsolt Kira Georgia Tech
Topic:
– Reinforcement Learning (RL) – Overview – Markov Decision Processes
CS 4803 / 7643: Deep Learning Topic: Reinforcement Learning (RL) - - PowerPoint PPT Presentation
CS 4803 / 7643: Deep Learning Topic: Reinforcement Learning (RL) Overview Markov Decision Processes Zsolt Kira Georgia Tech Administrative PS3/HW3 due March 15 th ! Projects 2 new FB projects up
– Reinforcement Learning (RL) – Overview – Markov Decision Processes
– 2 new FB projects up (https://www.cc.gatech.edu/classes/AY2020/cs7643_spring/fb_projects.html)
– Tentative FB plan:
– Fill out spreadsheet: https://gtvault-
my.sharepoint.com/:x:/g/personal/sdharur3_gatech_edu/EVXbNc4oxelMmj1T5WsEIRQBE4Hn532GeLQVcmOnWdG2 Jg?e=dIGNfX
3
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Last lecture: – Focus on MDP’s – No learning (deep
Slide Credit: David Silver
4
– Executes action at – Receives observation ot – Receives scalar reward rt
– Receives action at – Emits observation ot+1 – Emits scalar reward rt+1
5
Defined by: : set of possible states [start state = s0, optional terminal / absorbing state] : set of possible actions : distribution of reward given (state, action, next state) tuple : transition probability distribution, also written as : discount factor
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
6
using history, of beliefs of world state, or an RNN, …
Slide Credit: Emma Brunskill, Byron Boots
7
8
agent North (if there is no wall)
West; 10% East
(negative)
Slide credit: Pieter Abbeel
– Deterministic – Stochastic
9
10
(Typically for a fixed horizon T)
11
Slide Credit: Byron Boots, CS 7641 Reward at every non- terminal state (living reward/ penalty)
– How good is a state? – Am I screwed? Am I winning this game?
– How good is a state action-pair? – Should I do this now?
12
13
Following policy that produces sample trajectories s0, a0, r0, s1, a1, …
How good is a state? The value function at state s, is the expected cumulative reward from state s (and following the policy thereafter): How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s (and following the policy thereafter):
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
14
Given optimal policy that produces sample trajectories s0, a0, r0, s1, a1, …
How good is a state? The optimal value function at state s, and acting optimally thereafter How good is a state-action pair? The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
15
Slide credit: Byron Boots, CS 7641
16
Slide credit: Byron Boots, CS 7641
17
Slide credit: Byron Boots, CS 7641
18
Slide credit: Byron Boots, CS 7641
19
Slide credit: Byron Boots, CS 7641
– Initialize values of all states V0(s) = 0 – Update: – Repeat until convergence (to )
20
Slide credit: Byron Boots, CS 7641
– Initialize values of all states V0(s) = 0 – Update: – Repeat until convergence (to )
21
Slide credit: Byron Boots, CS 7641
– Initialize values of all states V0(s) = 0 – Update: – Repeat until convergence (to )
– Guaranteed for – Sketch: Approximations get refined towards optimal values – In practice, policy may converge before values do
22
Slide credit: Byron Boots, CS 7641
23
Slide credit: Pieter Abbeel [NOTE: Here we are showing calculations for the action we know is argmax (go right), but in general we have to calculate this for each actions and return max]
24
25
Noise = 0.2 Discount = 0.9 Living reward = 0
Slide Credit: http://ai.berkeley.edu
– It’s not obvious!
by the values
Slide Credit: http://ai.berkeley.edu
Noise = 0.2 Discount = 0.9 Living reward = 0
Slide Credit: http://ai.berkeley.edu
q-values:
– Completely trivial to decide!
easier to select from q-values than values!
Slide Credit: http://ai.berkeley.edu
– Policy Iteration
– Value-based RL
Slide Credit: David Silver
31
(C) Dhruv Batra 32
33
– Policy Evaluation: Compute (similar to VI) – Policy Refinement: Greedily change actions as per
34
– Policy Evaluation: Compute (similar to VI) – Policy Refinement: Greedily change actions as per
–
35
– Bellman update to state value estimates
– Bellman update to (state, action) value estimates
– Policy evaluation + refinement
36
37
– unknown, how actions affect the environment. – unknown, what/when are the good actions?
38
– unknown, how actions affect the environment. – unknown, what/when are the good actions?
– Gather experience (data) by performing actions. – Approximate unknown quantities from data.
39
(C) Dhruv Batra 40
– https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html
– https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_td.html
(s) s s, (s) '
1
s '
2
s '
3
s s, (s),s’ s '
Almost! But we can’t rewind time to get sample after sample from state s.
What’s the difficulty of this algorithm?
– Update V(s) each time we experience a transition (s, a, s’, r) – Likely outcomes s’ will contribute updates more often
– Policy still fixed, still doing evaluation! – Move values toward value of whatever successor occurs: running average
(s) s s, (s) s’ Sample of V(s): Update to V(s): Same update:
– The running interpolation update: – Makes recent samples more important: – Forgets about the past
Why do we want to forget about the past? (distant past values were wrong anyway)
– But can’t compute this update without knowing T, R
– Receive a sample transition (s,a,r,s’) – This sample suggests – But we want to average over results from (s,a) – So keep a running average
even if you’re acting suboptimally!
– You have to explore enough – You have to eventually make the learning rate small enough – … but not decrease it too quickly – Basically, in the limit, it doesn’t matter how you select actions (!)
46
47
– Not scalable to high dimensional states e.g.: RGB images.
48
– Not scalable to high dimensional states e.g.: RGB images.
– Use deep neural networks to learn low-dimensional representations.
49
(C) Dhruv Batra 50
– (Deep) Q-Learning, approximating with a deep Q-network
(C) Dhruv Batra 51
– (Deep) Q-Learning, approximating with a deep Q-network
– Directly approximate optimal policy with a parametrized policy
(C) Dhruv Batra 52
– (Deep) Q-Learning, approximating with a deep Q-network
– Directly approximate optimal policy with a parametrized policy
– Approximate transition function and reward function – Plan by looking ahead in the (approx.) future!
(C) Dhruv Batra 53
– (Deep) Q-Learning, approximating with a deep Q-network
– Directly approximate optimal policy with a parametrized policy
– Approximate transition function and reward function – Plan by looking ahead in the (approx.) future!
(C) Dhruv Batra 54
– Has some theoretical guarantees
56
– But can’t compute this update without knowing T, R
– Receive a sample transition (s,a,r,s’) – This sample suggests – But we want to average over results from (s,a) – So keep a running average
about every single state!
– Too many states to visit them all in training – Too many states to hold the q-tables in memory
– Learn about some small number of training states from experience – Generalize that experience to new, similar situations – This is the fundamental idea in machine learning!
[demo – RL pacman]
Let’s say we discover through experience that this state is bad: In naïve q-learning, we know nothing about this state: Or even this one!
features (properties)
– Features are functions from states to real numbers (often 0/1) that capture important properties of the state – Example features:
– Can also describe a q-state (s, a) with features (e.g. action moves closer to food)
for any state using a few weights:
value!
Approximate q update explained: Imagine we had only one point x, with features f(x), target value y, and weights w: “target” “prediction”
– Has some theoretical guarantees
– Works well in practice – Q-Network can take RGB images
63
Image Credits: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
– Stack of 4 image frames, grayscale conversion, down-sampling and cropping to (84 x 84 x 4)
dimensions (predicts Q-values)
64
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
65
66
Q-Value Bellman Optimality Target Q-Value Predicted Q-Value
67
State Q-Network Q-Values per action
68
State Q-Network Q-Values per action State Q-Network
69
State Q-Network Q-Values per action
70
State Q-Network Q-Values per action
– Freeze and update parameters – Set at regular intervals
71
Environment Data
Train
Environment Data
Challenge 1: Exploration vs Exploitation Challenge 2: Non iid, highly correlated data
Train
– Greedy? -> Local minimas, no exploration
75
– Greedy? -> Local minimas, no exploration
–
76
– e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size.
77
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
– A replay buffer stores transitions
78
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
– A replay buffer stores transitions – Continually update replay buffer as game (experience) episodes are played, older samples discarded
79
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
– A replay buffer stores transitions – Continually update replay buffer as game (experience) episodes are played, older samples discarded – Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples
80
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
81
Epsilon-greedy Q Update Experience Replay
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
82
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
– Stack of 4 image frames, grayscale conversion, down-sampling and cropping to (84 x 84 x 4)
dimensions (predicts Q-values)
83
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
84
https://www.youtube.com/watch?v=V1eYniJ0Rnk Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
– Q-Value Iteration – Policy Iteration
– The challenges of (deep) learning based methods – Value-based RL algorithms
– Policy-based RL algorithms
85