CSC 411 Lectures 21–22: Reinforcement Learning
Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla
University of Toronto
UofT CSC 411: 21&22-Reinforcement Learning 1 / 44
CSC 411 Lectures 2122: Reinforcement Learning Roger Grosse, - - PowerPoint PPT Presentation
CSC 411 Lectures 2122: Reinforcement Learning Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 21&22-Reinforcement Learning 1 / 44 Reinforcement Learning Problem In supervised learning, the
UofT CSC 411: 21&22-Reinforcement Learning 1 / 44
An agent
world takes an action and its states changes with the goal of achieving long-term rewards.
UofT CSC 411: 21&22-Reinforcement Learning 2 / 44
UofT CSC 411: 21&22-Reinforcement Learning 3 / 44
UofT CSC 411: 21&22-Reinforcement Learning 4 / 44
UofT CSC 411: 21&22-Reinforcement Learning 5 / 44
UofT CSC 411: 21&22-Reinforcement Learning 6 / 44
UofT CSC 411: 21&22-Reinforcement Learning 7 / 44
UofT CSC 411: 21&22-Reinforcement Learning 8 / 44
UofT CSC 411: 21&22-Reinforcement Learning 9 / 44
UofT CSC 411: 21&22-Reinforcement Learning 10 / 44
UofT CSC 411: 21&22-Reinforcement Learning 11 / 44
UofT CSC 411: 21&22-Reinforcement Learning 12 / 44
UofT CSC 411: 21&22-Reinforcement Learning 13 / 44
UofT CSC 411: 21&22-Reinforcement Learning 14 / 44
UofT CSC 411: 21&22-Reinforcement Learning 15 / 44
UofT CSC 411: 21&22-Reinforcement Learning 16 / 44
t≥0
t≥0
UofT CSC 411: 21&22-Reinforcement Learning 17 / 44
π Qπ(s, a)
a
UofT CSC 411: 21&22-Reinforcement Learning 18 / 44
UofT CSC 411: 21&22-Reinforcement Learning 19 / 44
∞
a′∈A Q(s′, a′)
UofT CSC 411: 21&22-Reinforcement Learning 20 / 44
Q) ≈ Qπ∗ = Q∗
a∈A
UofT CSC 411: 21&22-Reinforcement Learning 21 / 44
UofT CSC 411: 21&22-Reinforcement Learning 22 / 44
a′∈A Q(s′, a′)
UofT CSC 411: 21&22-Reinforcement Learning 23 / 44
Bellman operator T ∗
Qk
Qk+1 ← T ∗Qk Q∗
a′∈A Qk(s′, a′)
a′∈A Qk(s′, a′)
UofT CSC 411: 21&22-Reinforcement Learning 24 / 44
UofT CSC 411: 21&22-Reinforcement Learning 25 / 44
P(ds′|s, a) max
a′∈A
Q1(s′, a′)
P(ds′|s, a) max
a′∈A
Q2(s′, a′)
P(ds′|s, a)
a′∈A
Q1(s′, a′) − max
a′∈A
Q2(s′, a′)
P(ds′|s, a) max
a′∈A
max
(s′,a′)∈S×A
P(ds′|s, a)
UofT CSC 411: 21&22-Reinforcement Learning 26 / 44
(s,a)∈S×A
(s,a)∈S×A
UofT CSC 411: 21&22-Reinforcement Learning 27 / 44
UofT CSC 411: 21&22-Reinforcement Learning 28 / 44
UofT CSC 411: 21&22-Reinforcement Learning 29 / 44
i )}n i=1
i ∼ P(·|Si, Ai)
UofT CSC 411: 21&22-Reinforcement Learning 30 / 44
i ) from the dataset Dn.
i , a′).
a′∈A Q(S′ i , a′)|Si, Ai
a′∈A Q(s′, a′) = (T ∗Q)(Si, Ai)
i , a′) is a noisy version of (T ∗Q)(Si, Ai). Fitting
UofT CSC 411: 21&22-Reinforcement Learning 31 / 44
Bellman operator T ∗
i )}n i=1 and an action-value function
i=1 with
i , a′).
i , a′)|Si, Ai] = (T ∗Qk)(Si, Ai) we can
UofT CSC 411: 21&22-Reinforcement Learning 32 / 44
Bellman operator T ∗
Qk+1 ← T ∗Qk
i )}n i=1 and an action-value function
Q∈F
n
a′∈A Qk(S′ i , a)
UofT CSC 411: 21&22-Reinforcement Learning 33 / 44
UofT CSC 411: 21&22-Reinforcement Learning 34 / 44
UofT CSC 411: 21&22-Reinforcement Learning 35 / 44
UofT CSC 411: 21&22-Reinforcement Learning 36 / 44
UofT CSC 411: 21&22-Reinforcement Learning 37 / 44
a′∈A Q(St+1, a′) − Q(St, At)
CSC 411: 21&22-Reinforcement Learning 38 / 44
UofT CSC 411: 21&22-Reinforcement Learning 39 / 44
UofT CSC 411: 21&22-Reinforcement Learning 40 / 44
a′∈A Q(S′, a′) − Q(S, A)
a′∈A Q(S′, a′) − Q(S, A)|S, A
UofT CSC 411: 21&22-Reinforcement Learning 41 / 44
UofT CSC 411: 21&22-Reinforcement Learning 42 / 44
Policy Model Value
UofT CSC 411: 21&22-Reinforcement Learning 43 / 44
UofT CSC 411: 21&22-Reinforcement Learning 44 / 44