CS 4803 / 7643: Deep Learning
Nirbhay Modhe Georgia Tech
Topics:
– Dynamic Programming (Q-Value Iteration) – Reinforcement Learning (Intro, Q-Learning, DQNs)
CS 4803 / 7643: Deep Learning Topics: Dynamic Programming (Q-Value - - PowerPoint PPT Presentation
CS 4803 / 7643: Deep Learning Topics: Dynamic Programming (Q-Value Iteration) Reinforcement Learning (Intro, Q-Learning, DQNs) Nirbhay Modhe Georgia Tech Topics well cover Overview of RL RL vs other forms of learning
– Dynamic Programming (Q-Value Iteration) – Reinforcement Learning (Intro, Q-Learning, DQNs)
2
3
4
– Defined by
: set of possible states [start state = s0, optional terminal / absorbing state] : set of possible actions : distribution of reward given (state, action, next state) tuple : transition probability distribution, also written as : discount factor
5
– Defined by
: set of possible states [start state = s0, optional terminal / absorbing state] : set of possible actions : distribution of reward given (state, action, next state) tuple : transition probability distribution, also written as : discount factor
– Value Iteration
6
7
Following policy that produces sample trajectories s0, a0, r0, s1, a1, …
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
8
Following policy that produces sample trajectories s0, a0, r0, s1, a1, …
How good is a state? The value function at state s, is the expected cumulative reward from state s (and following the policy thereafter):
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
9
Following policy that produces sample trajectories s0, a0, r0, s1, a1, …
How good is a state? The value function at state s, is the expected cumulative reward from state s (and following the policy thereafter): How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s (and following the policy thereafter):
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
10
Given optimal policy that produces sample trajectories s0, a0, r0, s1, a1, …
How good is a state? The optimal value function at state s, and acting optimally thereafter
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
11
Given optimal policy that produces sample trajectories s0, a0, r0, s1, a1, …
How good is a state? The optimal value function at state s, and acting optimally thereafter How good is a state-action pair? The optimal Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and acting optimally thereafter
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
12
13
14
15
16
17
18
– Initialize values of all states – While not converged:
– Repeat until convergence (no change in values)
19
Time complexity per iteration
Homework
20
21
(C) Dhruv Batra 22
23
– Policy Evaluation: Compute (similar to VI) – Policy Refinement: Greedily change actions as per
24
– Policy Evaluation: Compute (similar to VI) – Policy Refinement: Greedily change actions as per
–
25
– Bellman update to state value estimates
– Bellman update to (state, action) value estimates
– Policy evaluation + refinement
26
27
– unknown, how actions affect the environment. – unknown, what/when are the good actions?
28
– unknown, how actions affect the environment. – unknown, what/when are the good actions?
– Gather experience (data) by performing actions. – Approximate unknown quantities from data.
29
(C) Dhruv Batra 30
– https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html
– https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_td.html
31
32
– Not scalable to high dimensional states e.g.: RGB images.
33
– Not scalable to high dimensional states e.g.: RGB images.
– Use deep neural networks to learn low-dimensional representations.
34
(C) Dhruv Batra 35
– (Deep) Q-Learning, approximating with a deep Q-network
(C) Dhruv Batra 36
– (Deep) Q-Learning, approximating with a deep Q-network
– Directly approximate optimal policy with a parametrized policy
(C) Dhruv Batra 37
– (Deep) Q-Learning, approximating with a deep Q-network
– Directly approximate optimal policy with a parametrized policy
– Approximate transition function and reward function – Plan by looking ahead in the (approx.) future!
(C) Dhruv Batra 38
– (Deep) Q-Learning, approximating with a deep Q-network
– Directly approximate optimal policy with a parametrized policy
– Approximate transition function and reward function – Plan by looking ahead in the (approx.) future!
(C) Dhruv Batra 39
– Has some theoretical guarantees
41
– Has some theoretical guarantees
– Works well in practice – Q-Network can take RGB images
42
Image Credits: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
43
44
Q-Value Bellman Optimality Target Q-Value Predicted Q-Value
45
State Q-Network Q-Values per action
46
State Q-Network Q-Values per action State Q-Network
47
State Q-Network Q-Values per action
48
State Q-Network Q-Values per action
49
State Q-Network Q-Values per action
– Freeze and update parameters – Set at regular intervals
50
Environment Data
Train
Environment Data
Challenge 1: Exploration vs Exploitation Challenge 2: Non iid, highly correlated data
Train
– Greedy? -> Local minimas, no exploration
54
– Greedy? -> Local minimas, no exploration
–
55
– e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size.
56
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
– A replay buffer stores transitions
57
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
– A replay buffer stores transitions – Continually update replay buffer as game (experience) episodes are played, older samples discarded
58
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
– A replay buffer stores transitions – Continually update replay buffer as game (experience) episodes are played, older samples discarded – Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples
59
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
60
Epsilon-greedy Q Update Experience Replay
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
61
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
– Stack of 4 image frames, grayscale conversion, down-sampling and cropping to (84 x 84 x 4)
dimensions (predicts Q-values)
62
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
63
https://www.youtube.com/watch?v=V1eYniJ0Rnk Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
– Q-Value Iteration – Policy Iteration
– The challenges of (deep) learning based methods – Value-based RL algorithms
– Policy-based RL algorithms
64
(C) Dhruv Batra 65