Introduction to Reinforcement Learning LEC 01 : Dynamic Programming - - PowerPoint PPT Presentation

introduction to reinforcement learning lec 01 dynamic
SMART_READER_LITE
LIVE PREVIEW

Introduction to Reinforcement Learning LEC 01 : Dynamic Programming - - PowerPoint PPT Presentation

Introduction to Reinforcement Learning LEC 01 : Dynamic Programming Professor Scott Moura University of California, Berkeley Tsinghua-Berkeley Shenzhen Institute Summer 2019 Prof. Moura | UC Berkeley | TBSI CE 295 | LEC 01 - Dynamic


slide-1
SLIDE 1

Introduction to Reinforcement Learning LEC 01 : Dynamic Programming

Professor Scott Moura University of California, Berkeley Tsinghua-Berkeley Shenzhen Institute

Summer 2019

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 1

slide-2
SLIDE 2

Motivating Example: Traveling Salesman

What is the shortest path to loop through N cities?

[http://www.informatik.uni-leipzig.de/ meiler] [http://www.superbasescientific.com/]

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 2

slide-3
SLIDE 3

Traveling Salesmen

What is the shortest path to loop through N cities? 500 cities, random solution

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 3

slide-4
SLIDE 4

Traveling Salesmen

What is the shortest path to loop through N cities? 500 cities, a better solution

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 3

slide-5
SLIDE 5

Traveling Salesmen

What is the shortest path to loop through N cities? 500 cities, best solution

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 3

slide-6
SLIDE 6

When to use DP?

When decisions are made in stages Sometimes, decisions cannot be made in isolation. One needs to balance immediate cost with future costs. Applications Maps. Robot navigation. Urban traffic planning. Network routing protocols. Optimal trace routing in PCBs. Optimal energy management. HR scheduling and project management. Routing of telecommunications messages. Optimal truck routing through given traffic congestion pattern.

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 4

slide-7
SLIDE 7

Richard Bellman, Ph.D. | 1920 - 1984 University of Southern California RAND Corporation

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 5

slide-8
SLIDE 8

Coining “Dynamic Programming”

“I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes ... The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named [Charles] Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word research. I’m not using the term lightly; I’m using it

  • precisely. His face would suffuse, he would turn red, and he would get violent if

people used the term research in his presence. You can imagine how he felt, then, about the term mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss,

  • essentially. Hence, I felt I had to do something to shield Wilson and the Air Force

from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word “programming”. I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying... Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities.” Eye of the Hurricane: An Autobiography (1984)

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 6

slide-9
SLIDE 9

Finite-time Formulation

Discrete-time system xk+1 = f(xk, uk), k = 0, 1, · · · , N − 1 k : discrete time index xk : state - summarizes current configuration of system at time k uk : control - decision applied at time k N : time horizon - number of times control is applied Additive Cost J =

N−1

  • k=0

ck(xk, uk) + cN(xN) ck : instantaneous cost - instantaneous cost incurred at time k cN : final cost - incurred at time N

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 7

slide-10
SLIDE 10

EX 1: Inventory Control

Order items to meet demand, while minimizing costs xk Items in stock, at the beginning of period k uk Items ordered & delivered immediately at the beginning of period k dk Demand of items during period k (assume deterministic)

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 8

slide-11
SLIDE 11

EX 1: Inventory Control

Order items to meet demand, while minimizing costs xk Items in stock, at the beginning of period k uk Items ordered & delivered immediately at the beginning of period k dk Demand of items during period k (assume deterministic) Stock evolves according to xk+1 = xk + uk − dk where negative stock corresponds to backlogged demand.

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 8

slide-12
SLIDE 12

EX 1: Inventory Control

Order items to meet demand, while minimizing costs xk Items in stock, at the beginning of period k uk Items ordered & delivered immediately at the beginning of period k dk Demand of items during period k (assume deterministic) Stock evolves according to xk+1 = xk + uk − dk where negative stock corresponds to backlogged demand. Three types of cost: (a) r(xk) is penalty for positive stock (holding costs) or negative stock (shortage cost) (b) The purchasing cost ckuk, where ck is the cost per unit order at time k. (c) Terminal cost R(xN) for excess stock or unfulfilled orders at time N. Total cost: J =

N−1

  • k=0

[r(xk) + ckuk] + R(xN)

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 8

slide-13
SLIDE 13

EX 1: Inventory Control

Order items to meet demand, while minimizing costs xk Items in stock, at the beginning of period k uk Items ordered & delivered immediately at the beginning of period k dk Demand of items during period k (assume deterministic) Stock evolves according to xk+1 = xk + uk − dk where negative stock corresponds to backlogged demand. Three types of cost: (a) r(xk) is penalty for positive stock (holding costs) or negative stock (shortage cost) (b) The purchasing cost ckuk, where ck is the cost per unit order at time k. (c) Terminal cost R(xN) for excess stock or unfulfilled orders at time N. Total cost: J =

N−1

  • k=0

[r(xk) + ckuk] + R(xN)

Minimize cost by proper choice of {u0, u1, · · · , uN−1} subject to uk ≥ 0.

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 8

slide-14
SLIDE 14

Principle of Optimality (in words)

Break the multistage decision problem into subproblems. At time step k, assume you know optimal decisions for time steps k + 1, · · · , N − 1. Compute best solution for current time step, and pair with future decisions. Start from end. Work backwards recursively. In the words of French researcher Kaufmann: An optimal policy contains only optimal subpolicies.

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 9

slide-15
SLIDE 15

Principle of Optimality (in math)

Define Vk(xk) as the optimal “value” from time step k to the end of the time horizon N, given the current state is xk.

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 10

slide-16
SLIDE 16

Principle of Optimality (in math)

Define Vk(xk) as the optimal “value” from time step k to the end of the time horizon N, given the current state is xk. Then the principle of optimality (PoO) can be written in recursive form as: Vk(xk) = min

uk {ck(xk, uk) + Vk+1(xk+1)}

[a.k.a. “Bellman Equation”]

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 10

slide-17
SLIDE 17

Principle of Optimality (in math)

Define Vk(xk) as the optimal “value” from time step k to the end of the time horizon N, given the current state is xk. Then the principle of optimality (PoO) can be written in recursive form as: Vk(xk) = min

uk {ck(xk, uk) + Vk+1(xk+1)}

[a.k.a. “Bellman Equation”] with the boundary condition VN(xN) = cN(xN)

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 10

slide-18
SLIDE 18

Principle of Optimality (in math)

Define Vk(xk) as the optimal “value” from time step k to the end of the time horizon N, given the current state is xk. Then the principle of optimality (PoO) can be written in recursive form as: Vk(xk) = min

uk {ck(xk, uk) + Vk+1(xk+1)}

[a.k.a. “Bellman Equation”] with the boundary condition VN(xN) = cN(xN) Admittedly awkward aspects: You solve the problem backward! You solve the problem recursively!

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 10

slide-19
SLIDE 19

EX 2: Shortest Path Problem

A B C D E F G H 2 4 4 6 5 1 5 11 4 7 1 3 2 2 5 Let V(i) be the shortest path from node i to node H. Ex: V(H) = 0.

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 11

slide-20
SLIDE 20

EX 2: Shortest Path Problem

A B C D E F G H 2 4 4 6 5 1 5 11 4 7 1 3 2 2 5 Let V(i) be the shortest path from node i to node H. Ex: V(H) = 0. Let c(i, j) denote the cost of traveling from node i to node j. Ex: c(C, E) = 7.

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 11

slide-21
SLIDE 21

EX 2: Shortest Path Problem

A B C D E F G H 2 4 4 6 5 1 5 11 4 7 1 3 2 2 5 Let V(i) be the shortest path from node i to node H. Ex: V(H) = 0. Let c(i, j) denote the cost of traveling from node i to node j. Ex: c(C, E) = 7. c(i, j) + V(j) is cost on path from node i to j, and then from j to H along shortest path.

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 11

slide-22
SLIDE 22

EX 2: Shortest Path Problem - Dijkstra’s Algorithm

Principle of optimality & boundary condition: V(i)

=

min

j∈Nd

i

{c(i, j) + V(j)}

V(H)

=

V(G) = c(G, H) + V(H) = 2 + 0 = 2 V(E) = min {c(E, G) + V(G), c(E, H) + V(H)} = min {3 + 2, 4 + 0} = 4 V(F) = min {c(F, G) + V(G), c(F, H) + V(H), c(F, E) + V(E)}

= min {2 + 2, 5 + 0, 1 + 4} = 4

V(D) = min {c(D, E) + V(E), c(D, H) + V(H)} = min {5 + 4, 11 + 0} = 9 V(C) = min {c(C, F) + V(F), c(C, E) + V(E), c(C, D) + V(D)}

= min {5 + 4, 7 + 4, 1 + 9} = 9

V(B) = c(B, F) + V(F) = 6 + 4 = 10 V(A) = min {c(A, B) + V(B), c(A, C) + V(C), c(A, D) + V(D)}

= min {2 + 10, 4 + 9, 4 + 9} = 12

Optimal Path: A → B → F → G → H

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 12

slide-23
SLIDE 23

EX 3: Optimal Consumption & Saving

Given: Consumer lives over periods k = 0, 1, · · · , N Consumes finite resource. Consumption each period is uk Utility of consuming uk units of resource is ln(uk) Let xk represent remaining resources in period k Dynamics: xk+1 = xk − uk, x0 is initial resource level Constraints: xk ≥ 0 Total Utility: J =

N−1

  • k=0

ln(uk) + 0

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 13

slide-24
SLIDE 24

EX 3: Optimal Consumption & Saving

Step 1 (Define Value Function): Let Vk(xk) denote the maximum total utility from time step k to terminal time step N, where the resource level in step k is xk. Step 2 (PoO Equation): Vk(xk) = max

0≤uk≤xk {ln(uk) + Vk+1(xk+1)} ,

k = 0, 1, · · · , N − 1 Step 3 (Boundary Condition): VN(xN) = 0

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 14

slide-25
SLIDE 25

EX 3: Optimal Consumption & Saving

For k = N − 1: VN−1

=

max

0≤uN−1≤xN−1 {ln(uN−1) + VN(xN)}

. . . VN−1

=

ln(xN−1), u⋆

N−1 = xN−1

For k = N − 2: VN−2

=

max

0≤uN−2≤xN−2 {ln(uN−2) + VN−1(xN−1)}

. . . VN−2

=

ln(1 4x2

N−2),

u⋆

N−2 = 1

2xN−2

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 15

slide-26
SLIDE 26

EX 3: Optimal Consumption & Saving

For k = N − 3: VN−3

=

max

0≤uN−3≤xN−3 {ln(uN−3) + VN(xN−2)}

. . . VN−3

=

ln

  • 1

27x3

N−3

  • ,

u⋆

N−3 = 1

3xN−3 Pattern emerges. Use mathematical induction to show: u⋆

k =

1 N − k xk, k = 0, · · · , N − 1 This is a “state feedback control policy”, in the form u⋆

k = π⋆(xk).

Can show policy in open-loop form is: u⋆

k = 1

Nx0, k = 0, · · · , N − 1

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 16

slide-27
SLIDE 27

EX 4: Smart Appliance Scheduling

Objective: Schedule a “smart” dishwasher to complete its cycles in a minimal cost way, given time-varying electricity prices. xk cycle power 1 prewash 1.5 kW 2 main wash 2.0 kW 3 rinse 1 0.5 kW 4 rinse 2 0.5 kW 5 dry 1.0 kW

00:00 04:00 08:00 12:00 16:00 20:00 24:00 5 10 15 20 25 30 35 Time of Day Electricity Cost [cents/kWh]

[LEFT] Dishwasher cycles and corresponding power consumption. [RIGHT] Time-varying electricity price. The goal is to determine the dishwasher schedule between some initial time and 24:00 that minimizes the total cost

  • f electricity consumed.
  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 17

slide-28
SLIDE 28

EX 4: Smart Appliance Scheduling

Problem Formulation

xk State: Index of last completed cycle xk ∈ {0, 1, 2, 3, 4, 5} uk Control: wait or continue to next cycle uk ∈ {0, 1} ck Electricity price in period k [USD/kW/15-min] minimize

N−1

  • k=0

ck · p(xk+1) · uk subject to: xk+1 = xk + uk, k = 0, · · · , N − 1 x0 = 0 xN = 5 uk ∈ {0, 1}, k = 0, · · · , N − 1

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 18

slide-29
SLIDE 29

EX 4: Smart Appliance Scheduling

Dynamic Programming

Step 1 (Define Value Function): Let Vk(xk) denote the minimum total cost from time step k to terminal time step N, where the smart dishwasher in step k is in cycle xk. Step 2 (PoO Equation): Vk(xk)

=

min

uk∈{0,1} {ck · p(xk+1) · uk + Vk+1(xk+1)}

min

uk∈{0,1} {ck · p(xk + uk) · uk + Vk+1(xk + uk)}

min {Vk+1(xk), ck · p(xk + 1) + Vk+1(xk + 1)} Step 3 (Boundary Equation): VN(5) = 0, VN(i) = ∞ for i = 5 Optimal Control Action: u⋆(xk) = arg min

uk∈{0,1} {ck · p(xk + uk) · uk + Vk+1(xk + uk)}

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 19

slide-30
SLIDE 30

EX 4: Smart Appliance Scheduling

Output of DP

00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 24:00

Time of Day

1 2 3 4 5 State Optimal Action (blue = wait, yellow = run next cycle) 00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 24:00

Time of Day

1 2 3 4 5 State Value Function [cUSD]

5 10 15 20

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 20

slide-31
SLIDE 31

EX 4: Smart Appliance Scheduling

Optimal Dishwasher Profile | Start at 07:00 2 4 6 8

Electricity Cost [cents/kW/15min]

Price run next cycle

00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 24:00 Time of Day 1 2 3 4 5 Dishwasher State

State Start Time

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 21

slide-32
SLIDE 32

EX 4: Smart Appliance Scheduling

Optimal Dishwasher Profile | Start at 10:00 2 4 6 8

Electricity Cost [cents/kW/15min]

Price run next cycle

00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 24:00 Time of Day 1 2 3 4 5 Dishwasher State

State Start Time

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 22

slide-33
SLIDE 33

EX 4: Smart Appliance Scheduling

Optimal Dishwasher Profile | Start at 17:00 2 4 6 8

Electricity Cost [cents/kW/15min]

Price run next cycle

00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 24:00 Time of Day 1 2 3 4 5 Dishwasher State

State Start Time

  • Prof. Moura | UC Berkeley | TBSI

CE 295 | LEC 01 - Dynamic Programming Slide 23