Dynamic Programming and Reinforcement Learning
Daniel Russo Columbia Business School Decision Risk and Operations Division
Fall, 2017
Daniel Russo (Columbia) Fall 2017 1 / 34
Dynamic Programming and Reinforcement Learning Daniel Russo - - PowerPoint PPT Presentation
Dynamic Programming and Reinforcement Learning Daniel Russo Columbia Business School Decision Risk and Operations Division Fall, 2017 Daniel Russo (Columbia) Fall 2017 1 / 34 Supervised Machine Learning Learning from datasets A passive
Daniel Russo (Columbia) Fall 2017 1 / 34
Daniel Russo (Columbia) Fall 2017 2 / 34
Environment
Action Outcome
Reward
Daniel Russo (Columbia) Fall 2017 3 / 34
Daniel Russo (Columbia) Fall 2017 4 / 34
Daniel Russo (Columbia) Fall 2017 5 / 34
◮ The data one gathers depends on the actions they take.
◮ Rather than maximize the immediate benefit from the
Daniel Russo (Columbia) Fall 2017 6 / 34
Daniel Russo (Columbia) Fall 2017 7 / 34
*Picture shamelessly lifted from a slide of Emma Brunskill’s
Daniel Russo (Columbia) Fall 2017 8 / 34
Daniel Russo (Columbia) Fall 2017 9 / 34
Daniel Russo (Columbia) Fall 2017 10 / 34
◮ Hope is to enable direct training of control systems
◮ DeepMind’s DQN learns to play Atari from pixels,
Daniel Russo (Columbia) Fall 2017 10 / 34
◮ Hope is to enable direct training of control systems
◮ DeepMind’s DQN learns to play Atari from pixels,
Daniel Russo (Columbia) Fall 2017 10 / 34
1
2
3
Daniel Russo (Columbia) Fall 2017 11 / 34
1
2
◮ Dynamic programming, stochastic approximation,
3
Daniel Russo (Columbia) Fall 2017 12 / 34
1
2
3
Daniel Russo (Columbia) Fall 2017 13 / 34
Daniel Russo (Columbia) Fall 2017 14 / 34
Daniel Russo (Columbia) Fall 2017 14 / 34
Daniel Russo (Columbia) Fall 2017 15 / 34
Daniel Russo (Columbia) Fall 2017 16 / 34
Holding cost
Order cost
Daniel Russo (Columbia) Fall 2017 16 / 34
N
Daniel Russo (Columbia) Fall 2017 17 / 34
N
Daniel Russo (Columbia) Fall 2017 17 / 34
N
◮ Over fixed sequences of controls u0, u1, . . .? ◮ No, over policies (adaptive ordering strategies). Daniel Russo (Columbia) Fall 2017 17 / 34
N
◮ Over fixed sequences of controls u0, u1, . . .? ◮ No, over policies (adaptive ordering strategies).
◮ Decisions have delayed consequences. ◮ Relevant information is revealed during the decision
Daniel Russo (Columbia) Fall 2017 17 / 34
Daniel Russo (Columbia) Fall 2017 18 / 34
Daniel Russo (Columbia) Fall 2017 19 / 34
N
Daniel Russo (Columbia) Fall 2017 19 / 34
N
ks.
Daniel Russo (Columbia) Fall 2017 20 / 34
π∈Π Jπ(x0)
Daniel Russo (Columbia) Fall 2017 21 / 34
◮ Can always take gN(x, u, w) to be independent of u, w.
◮ This can be embedded in the functions fk, gk. Daniel Russo (Columbia) Fall 2017 22 / 34
Daniel Russo (Columbia) Fall 2017 23 / 34
Daniel Russo (Columbia) Fall 2017 23 / 34
N(x) =
u∈UN(x) E[gN(x, u, w)]
k(x) =
u∈Uk(x) E[gk(x, u, w)+J∗ k+1(fk(x, u, w))]
Daniel Russo (Columbia) Fall 2017 24 / 34
0(x). The optimal cost to go is
u∈UN(x)
k(x) ∈ arg min u∈Uk(x) E[gk(x, u, w) + J∗ k+1(fk(x, u, w))].
Daniel Russo (Columbia) Fall 2017 25 / 34
Daniel Russo (Columbia) Fall 2017 26 / 34
u∈U(x1) E[g1(x1, u, w1)|x1]]
1(x1)]
1(f0(x0, µ0(x0), w0))]
u∈U(x0) E[g0(x0, u, w0) + J∗ 1(f0(x0, u, w0)]
0(x0)
Daniel Russo (Columbia) Fall 2017 27 / 34
Daniel Russo (Columbia) Fall 2017 28 / 34
Daniel Russo (Columbia) Fall 2017 28 / 34
Daniel Russo (Columbia) Fall 2017 29 / 34
Daniel Russo (Columbia) Fall 2017 30 / 34
◮ No! ◮ Transition probabilities depend on the order that is
Daniel Russo (Columbia) Fall 2017 30 / 34
◮ No! ◮ Transition probabilities depend on the order that is
Daniel Russo (Columbia) Fall 2017 30 / 34
Daniel Russo (Columbia) Fall 2017 31 / 34
Daniel Russo (Columbia) Fall 2017 31 / 34
Daniel Russo (Columbia) Fall 2017 31 / 34
◮ The asset is sold once an offer is accepted. ◮ Offers are no longer available once declined.
Daniel Russo (Columbia) Fall 2017 32 / 34
◮ The asset is sold once an offer is accepted. ◮ Offers are no longer available once declined.
1
2
Daniel Russo (Columbia) Fall 2017 32 / 34
Daniel Russo (Columbia) Fall 2017 33 / 34
k(t) = 0
N(x) = x
k(x) = max{(1 + r)N−kx, E[J∗ k+1(wk)]}
k+1(wk)]
Daniel Russo (Columbia) Fall 2017 34 / 34