Dynamic Programming and Reinforcement Learning Daniel Russo - - PowerPoint PPT Presentation

dynamic programming and reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Dynamic Programming and Reinforcement Learning Daniel Russo - - PowerPoint PPT Presentation

Dynamic Programming and Reinforcement Learning Daniel Russo Columbia Business School Decision Risk and Operations Division Fall, 2017 Daniel Russo (Columbia) Fall 2017 1 / 34 Supervised Machine Learning Learning from datasets A passive


slide-1
SLIDE 1

Dynamic Programming and Reinforcement Learning

Daniel Russo Columbia Business School Decision Risk and Operations Division

Fall, 2017

Daniel Russo (Columbia) Fall 2017 1 / 34

slide-2
SLIDE 2

Supervised Machine Learning

Learning from datasets A passive paradigm Focus on pattern recognition

Daniel Russo (Columbia) Fall 2017 2 / 34

slide-3
SLIDE 3

Reinforcement Learning

Environment

Action Outcome

Reward

Learning to attain a goal through interaction with a poorly understood environment.

Daniel Russo (Columbia) Fall 2017 3 / 34

slide-4
SLIDE 4

Canonical (and toy) RL environments

Cart Pole Mountain Car

Daniel Russo (Columbia) Fall 2017 4 / 34

slide-5
SLIDE 5

Impressive new (and toy) RL environments

Atari from pixels

Daniel Russo (Columbia) Fall 2017 5 / 34

slide-6
SLIDE 6

Challenges in Reinforcement Learning

Partial Feedback

◮ The data one gathers depends on the actions they take.

Delayed Consequences

◮ Rather than maximize the immediate benefit from the

next interaction, one must consider the impact on future interactions.

Daniel Russo (Columbia) Fall 2017 6 / 34

slide-7
SLIDE 7

Dream Application: Management of Chronic Diseases

Various researchers are working on mobile health interventions

Daniel Russo (Columbia) Fall 2017 7 / 34

slide-8
SLIDE 8

Dream Application: Intelligent Tutoring Systems

*Picture shamelessly lifted from a slide of Emma Brunskill’s

Daniel Russo (Columbia) Fall 2017 8 / 34

slide-9
SLIDE 9

Dream Application: Beyond Myopia in E-Commerce

Online marketplaces and web services have repeated interactions with users, but are deigned to optimize the next interaction. RL provides a framework for optimizing the cumulative value generated by such interactions. How useful will this turn out to be?

Daniel Russo (Columbia) Fall 2017 9 / 34

slide-10
SLIDE 10

Deep Reinforcement Learning

RL where function approximation is performed using a deep neural network, instead of using linear models, kernel methods, shallow neural networks, etc.

Daniel Russo (Columbia) Fall 2017 10 / 34

slide-11
SLIDE 11

Deep Reinforcement Learning

RL where function approximation is performed using a deep neural network, instead of using linear models, kernel methods, shallow neural networks, etc. Justified excitement

◮ Hope is to enable direct training of control systems

based on complex sensory inputs (e.g. visual or auditory sensors)

◮ DeepMind’s DQN learns to play Atari from pixels,

without learning vision first.

Daniel Russo (Columbia) Fall 2017 10 / 34

slide-12
SLIDE 12

Deep Reinforcement Learning

RL where function approximation is performed using a deep neural network, instead of using linear models, kernel methods, shallow neural networks, etc. Justified excitement

◮ Hope is to enable direct training of control systems

based on complex sensory inputs (e.g. visual or auditory sensors)

◮ DeepMind’s DQN learns to play Atari from pixels,

without learning vision first.

Also a lot of less justified hype.

Daniel Russo (Columbia) Fall 2017 10 / 34

slide-13
SLIDE 13

Warning

1

This is an advanced PhD course.

2

It will be primarily theoretical. We will prove theorems when we can. The emphasis will be on precise understand of why methods work and why they may fail completely in simple cases.

3

There are tons of engineering tricks to Deep RL. I won’t cover these.

Daniel Russo (Columbia) Fall 2017 11 / 34

slide-14
SLIDE 14

My Goals

1

Encourage great students to do research in this area.

2

Provide a fun platform for introducing technical tools to operations PhD students.

◮ Dynamic programming, stochastic approximation,

exploration algorithms and regret analysis.

3

Sharpen my own understanding.

Daniel Russo (Columbia) Fall 2017 12 / 34

slide-15
SLIDE 15

Tentative Course Outline

1

Foundational Material on MDPs

2

Estimating Long Run Value

3

Exploration Algorithms * Additional topics as time permits Policy gradients and actor critic Rollout and Monte-Carlo tree search.

Daniel Russo (Columbia) Fall 2017 13 / 34

slide-16
SLIDE 16

Markov Decision Processes: A warmup

On the white-board Shortest path in a directed graph

Daniel Russo (Columbia) Fall 2017 14 / 34

slide-17
SLIDE 17

Markov Decision Processes: A warmup

On the white-board Shortest path in a directed graph Imagine while traversing the shortest path, you discover

  • ne of the routes is closed. How should you adjust your

path?

Daniel Russo (Columbia) Fall 2017 14 / 34

slide-18
SLIDE 18

Example: Inventory Control

Stochastic demand Orders have lead time Non-perishable inventory Inventory holding costs Finite selling season

Daniel Russo (Columbia) Fall 2017 15 / 34

slide-19
SLIDE 19

Example: Inventory Control

Periods k = 0, 1, 2, . . . , N xk ∈ {0, . . . , 1000} current inventory uk ∈ {0, . . . , 1000 − xk} inventory order wk ∈ {0, 1, 2, . . .} demand (i.i.d w/ known dist.) xk+1 = ⌊xk − wk⌋ + uk Transition dynamics

Daniel Russo (Columbia) Fall 2017 16 / 34

slide-20
SLIDE 20

Example: Inventory Control

Periods k = 0, 1, 2, . . . , N xk ∈ {0, . . . , 1000} current inventory uk ∈ {0, . . . , 1000 − xk} inventory order wk ∈ {0, 1, 2, . . .} demand (i.i.d w/ known dist.) xk+1 = ⌊xk − wk⌋ + uk Transition dynamics Cost function g(x, u, w) = cHx

Holding cost

+ cL⌊w − x⌋

  • Lost sales

+ cO(u)

Order cost

Daniel Russo (Columbia) Fall 2017 16 / 34

slide-21
SLIDE 21

Example: Inventory Control

Objective: min E

  N

  • k=0

g(xk, uk, wk)

 

Daniel Russo (Columbia) Fall 2017 17 / 34

slide-22
SLIDE 22

Example: Inventory Control

Objective: min E

  N

  • k=0

g(xk, uk, wk)

 

Minimize over what?

Daniel Russo (Columbia) Fall 2017 17 / 34

slide-23
SLIDE 23

Example: Inventory Control

Objective: min E

  N

  • k=0

g(xk, uk, wk)

 

Minimize over what?

◮ Over fixed sequences of controls u0, u1, . . .? ◮ No, over policies (adaptive ordering strategies). Daniel Russo (Columbia) Fall 2017 17 / 34

slide-24
SLIDE 24

Example: Inventory Control

Objective: min E

  N

  • k=0

g(xk, uk, wk)

 

Minimize over what?

◮ Over fixed sequences of controls u0, u1, . . .? ◮ No, over policies (adaptive ordering strategies).

Sequential decision making under uncertainty where

◮ Decisions have delayed consequences. ◮ Relevant information is revealed during the decision

process.

Daniel Russo (Columbia) Fall 2017 17 / 34

slide-25
SLIDE 25

Further Examples

Dynamic pricing (over a selling season) Trade execution (with market impact) Queuing admission control Consumption-savings models in economics Search models in economics Timing of maintenance and repairs

Daniel Russo (Columbia) Fall 2017 18 / 34

slide-26
SLIDE 26

Finite Horizon MDPs: formulation

A discrete time dynamic system xk+1 = fk(xk, uk, wk) k = 0, 1, ..., N where xk ∈ Xk state uk ∈ Uk(xk) control wk (i.i.d w/ known dist.) Assume state and control spaces are finite.

Daniel Russo (Columbia) Fall 2017 19 / 34

slide-27
SLIDE 27

Finite Horizon MDPs: formulation

A discrete time dynamic system xk+1 = fk(xk, uk, wk) k = 0, 1, ..., N where xk ∈ Xk state uk ∈ Uk(xk) control wk (i.i.d w/ known dist.) Assume state and control spaces are finite. The total cost incurred is

N

  • k=0

gk(xk, uk, wk)

  • cost in period k

Daniel Russo (Columbia) Fall 2017 19 / 34

slide-28
SLIDE 28

Finite Horizon MDPs: formulation

A policy is a sequence π = (µ0, µ1, ..., µN) where µk : xk → uk ∈ Uk(xk). Expected cost of following π from state x0 is Jπ(x0) = E

  N

  • k=0

gk(xk, uk, wk)

 

where xk+1 = fk(xk, uk, wk) and E[·] is over the w ′

ks.

Daniel Russo (Columbia) Fall 2017 20 / 34

slide-29
SLIDE 29

Finite Horizon MDPs: formulation

The optimal expected cost to go from x0 is J∗(x0) = min

π∈Π Jπ(x0)

where Π consists of all feasible policies. We will see the same policy π∗ is optimal for all initial

  • states. So

J∗(x) = Jπ∗(x) ∀x

Daniel Russo (Columbia) Fall 2017 21 / 34

slide-30
SLIDE 30

Minor differences with Bertsekas Vol. I

Bertsekas Uses a special terminal cost function gN(xN)

◮ Can always take gN(x, u, w) to be independent of u, w.

Lets the distribution of wk depend on k and xk.

◮ This can be embedded in the functions fk, gk. Daniel Russo (Columbia) Fall 2017 22 / 34

slide-31
SLIDE 31

Principle of Optimality

Regardless of the consequences of initial decisions, an

  • ptimal policy should be optimal in the sub-problem

beginning in the current state and time period.

Daniel Russo (Columbia) Fall 2017 23 / 34

slide-32
SLIDE 32

Principle of Optimality

Regardless of the consequences of initial decisions, an

  • ptimal policy should be optimal in the sub-problem

beginning in the current state and time period. Sufficiency: Such policies exist and minimize total expected cost from any initial state. Necessity: A policy that is optimal from some initial state must behave optimally in any subproblem that is reached with positive probability.

Daniel Russo (Columbia) Fall 2017 23 / 34

slide-33
SLIDE 33

The Dynamic Programming Algorithm

Set J∗

N(x) =

min

u∈UN(x) E[gN(x, u, w)]

∀x ∈ Xn For k = N − 1, N − 2, . . . 0, set J∗

k(x) =

min

u∈Uk(x) E[gk(x, u, w)+J∗ k+1(fk(x, u, w))]

∀x ∈ Xk.

Daniel Russo (Columbia) Fall 2017 24 / 34

slide-34
SLIDE 34

The Dynamic Programming Algorithm

Proposition

For all x ∈ X0, J∗(x) = J∗

0(x). The optimal cost to go is

attained by a policy π∗ = (µ0, ..., µN) where µN(x) ∈ arg min

u∈UN(x)

E[gN(x, u, w)] ∀x ∈ XN and for all k ∈ {0, . . . , N − 1}, x ∈ Xk µ∗

k(x) ∈ arg min u∈Uk(x) E[gk(x, u, w) + J∗ k+1(fk(x, u, w))].

Daniel Russo (Columbia) Fall 2017 25 / 34

slide-35
SLIDE 35

The Dynamic Programming Algorithm

Class Exercise

Argue this is true for a 2 period problem (N=1). Hint, recall the tower property of conditional expectation. E[Y ] = E[E[Y |X]]

Daniel Russo (Columbia) Fall 2017 26 / 34

slide-36
SLIDE 36

A Tedious Proof

For any policy π = (µ0, µ1) and initial state x0, Eπ[g0(x0, µ0(x0), w0) + g1(x1, µ1(x1), w1)] = Eπ[g0(x0, µ0(x0), w0) + E[g1(x1, µ1(x1), w1)|x1]] ≥ Eπ[g0(x0, µ0(x0), w0) + min

u∈U(x1) E[g1(x1, u, w1)|x1]]

= Eπ[g0(x0, µ0(x0), w0) + J∗

1(x1)]

= Eπ[g0(x0, µ0(x0), w0) + J∗

1(f0(x0, µ0(x0), w0))]

≥ min

u∈U(x0) E[g0(x0, u, w0) + J∗ 1(f0(x0, u, w0)]

= J∗

0(x0)

Under π∗, every inequality is an equality.

Daniel Russo (Columbia) Fall 2017 27 / 34

slide-37
SLIDE 37

Markov Property

Markov Chain

A stochastic process (X0, X1, X2, . . .) is a Markov chain if for each n ∈ N, conditioned on Xn−1, Xn is independent

  • f (X0, . . . , Xn−2). That is

P(Xn = ·|Xn−1) = P(Xn = ·|X0, . . . , Xn−1)

Daniel Russo (Columbia) Fall 2017 28 / 34

slide-38
SLIDE 38

Markov Property

Markov Chain

A stochastic process (X0, X1, X2, . . .) is a Markov chain if for each n ∈ N, conditioned on Xn−1, Xn is independent

  • f (X0, . . . , Xn−2). That is

P(Xn = ·|Xn−1) = P(Xn = ·|X0, . . . , Xn−1) Without loss of generality we can view a Markov chain as the output of a stochastic recursion Xn+1 = fn(Xn, Wn) for an i.i.d sequence of disturbances (W0, W1, . . .).

Daniel Russo (Columbia) Fall 2017 28 / 34

slide-39
SLIDE 39

Markov Property

Our problem is called a Markov decision process because P(xn+1 = x|x0, u0, w0, . . . , xn, un) = P(fn(xn, un, wn) = x|xn, un) = P(xn+1 = x|xn, un) Requires the encoding of the state is sufficiently rich.

Daniel Russo (Columbia) Fall 2017 29 / 34

slide-40
SLIDE 40

Inventory Control Revisited

Suppose that inventory has a lead time of 2 periods. Orders can still be placed in any period. Is this an MDP with state=current inventory?

Daniel Russo (Columbia) Fall 2017 30 / 34

slide-41
SLIDE 41

Inventory Control Revisited

Suppose that inventory has a lead time of 2 periods. Orders can still be placed in any period. Is this an MDP with state=current inventory?

◮ No! ◮ Transition probabilities depend on the order that is

currently in transit.

Daniel Russo (Columbia) Fall 2017 30 / 34

slide-42
SLIDE 42

Inventory Control Revisited

Suppose that inventory has a lead time of 2 periods. Orders can still be placed in any period. Is this an MDP with state=current inventory?

◮ No! ◮ Transition probabilities depend on the order that is

currently in transit.

This is an MDP if we augment the state space so xn = (current inventory, inventory arriving next period).

Daniel Russo (Columbia) Fall 2017 30 / 34

slide-43
SLIDE 43

State Augmentation

In the extreme, choosing the state to be the full history ˜ xn−1 = (x0, u0, . . . , un−2, xn−1) suffices since P(˜ xn = ·|˜ xn−1, un−1) = P(˜ xn = ·|x0, u0, . . . , xn−1, un−1).

Daniel Russo (Columbia) Fall 2017 31 / 34

slide-44
SLIDE 44

State Augmentation

In the extreme, choosing the state to be the full history ˜ xn−1 = (x0, u0, . . . , un−2, xn−1) suffices since P(˜ xn = ·|˜ xn−1, un−1) = P(˜ xn = ·|x0, u0, . . . , xn−1, un−1). For the next few weeks we will assume the Markov property holds.

Daniel Russo (Columbia) Fall 2017 31 / 34

slide-45
SLIDE 45

State Augmentation

In the extreme, choosing the state to be the full history ˜ xn−1 = (x0, u0, . . . , un−2, xn−1) suffices since P(˜ xn = ·|˜ xn−1, un−1) = P(˜ xn = ·|x0, u0, . . . , xn−1, un−1). For the next few weeks we will assume the Markov property holds. Computational tractability usually requires a compact state representation.

Daniel Russo (Columbia) Fall 2017 31 / 34

slide-46
SLIDE 46

Example: selling an asset

An instance of optimal stopping. Deadline to sell within N periods. Potential buyers make offers in sequence. The agent chooses to accept or reject each offer

◮ The asset is sold once an offer is accepted. ◮ Offers are no longer available once declined.

Offers are statistically independent. Profits can be invested with interest rate r > 0 per period.

Daniel Russo (Columbia) Fall 2017 32 / 34

slide-47
SLIDE 47

Example: selling an asset

An instance of optimal stopping. Deadline to sell within N periods. Potential buyers make offers in sequence. The agent chooses to accept or reject each offer

◮ The asset is sold once an offer is accepted. ◮ Offers are no longer available once declined.

Offers are statistically independent. Profits can be invested with interest rate r > 0 per period.

Class Exercise

1

Formulate this as a finite horizon MDP.

2

Write down the DP algorithm.

Daniel Russo (Columbia) Fall 2017 32 / 34

slide-48
SLIDE 48

Example: selling an asset

Special terminal state t (costless and absorbing) xk = t is the offer considered at time k. x0 = 0 is fictitious null offer. gk(xk, sell) = (1 + r)N−kxk. xk = wk−1 for independent w0, w1, . . .

Daniel Russo (Columbia) Fall 2017 33 / 34

slide-49
SLIDE 49

Example: selling an asset

DP Algorithm J∗

k(t) = 0

∀k J∗

N(x) = x

J∗

k(x) = max{(1 + r)N−kx, E[J∗ k+1(wk)]}

A threshold policy is optimal: Sell ⇐ ⇒ xk ≥ E[J∗

k+1(wk)]

(1 + r)N−k

Daniel Russo (Columbia) Fall 2017 34 / 34