Budgeted Reinforcement Learning in Continuous State Space Nicolas - - PowerPoint PPT Presentation

budgeted reinforcement learning in continuous state space
SMART_READER_LITE
LIVE PREVIEW

Budgeted Reinforcement Learning in Continuous State Space Nicolas - - PowerPoint PPT Presentation

Budgeted Reinforcement Learning in Continuous State Space Nicolas Carrara 1 , Edouard Leurent 1,2 , Tanguy Urvoy 3 , Romain Laroche 4 , Odalric Maillard 1 , Olivier Pietquin 1,5 1 Inria SequeL, 2 Renault Group, 3 Orange Labs, 4 Microsoft Montr


slide-1
SLIDE 1

Budgeted Reinforcement Learning in Continuous State Space

Nicolas Carrara1, Edouard Leurent1,2, Tanguy Urvoy3, Romain Laroche4, Odalric Maillard1, Olivier Pietquin1,5

1Inria SequeL, 2Renault Group, 3Orange Labs, 4Microsoft Montr´

eal,

5Google Research, Brain Team

slide-2
SLIDE 2

2 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

01.. Motivation and Setting 02.. Budgeted Dynamic Programming 03.. Budgeted Reinforcement Learning 04.. Experiments

Contents

slide-3
SLIDE 3

3 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Motivation and Setting

01

slide-4
SLIDE 4

Learning to act

Optimal Decision-Making Which action at should we choose in state st to maximise a cumulative reward R? max

π

E

at∼π(at|st)

  • t=0

γtR(st, at)

  • 4 -Budgeted Reinforcement Learning- Carrara N., Leurent E.
slide-5
SLIDE 5

Learning to act

Optimal Decision-Making Which action at should we choose in state st to maximise a cumulative reward R? max

π

E

at∼π(at|st)

  • t=0

γtR(st, at)

  • A very general formulation

Widely used in the industry

4 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-6
SLIDE 6

Learning to act

Optimal Decision-Making Which action at should we choose in state st to maximise a cumulative reward R? max

π

E

at∼π(at|st)

  • t=0

γtR(st, at)

  • A very general formulation

✗ Not widely used in the industry

4 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-7
SLIDE 7

Learning to act

Optimal Decision-Making Which action at should we choose in state st to maximise a cumulative reward R? max

π

E

at∼π(at|st)

  • t=0

γtR(st, at)

  • A very general formulation

✗ Not widely used in the industry

> Sample efficiency > Trial and error > Unpredictable behaviour

4 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-8
SLIDE 8

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R

5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-9
SLIDE 9

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R A convenient formulation, but;

5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-10
SLIDE 10

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R A convenient formulation, but; ✗ R is not always easy to design.

5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-11
SLIDE 11

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically:

Task completion vs Safety

5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-12
SLIDE 12

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically:

Task completion vs Safety

For example...

5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-13
SLIDE 13

Example problems with conflicts

Dialogue systems A slot-filling problem: the agent fills a form by asking the user each slot. It can either:

  • ask to answer using voice (safe/slow);
  • ask to answer with a numeric pad (unsafe/fast).

6 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-14
SLIDE 14

Example problems with conflicts

Dialogue systems A slot-filling problem: the agent fills a form by asking the user each slot. It can either:

  • ask to answer using voice (safe/slow);
  • ask to answer with a numeric pad (unsafe/fast).

Autonomous Driving The agent is driving on a two-way road with a car in front of it,

  • it can stay behind (safe/slow);
  • it can overtake (unsafe/fast).

6 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-15
SLIDE 15

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically:

Task completion vs Safety

For example... For a fixed reward function R, no control over the Task Completion

Safety

trade-off π∗ is only guaranteed to lie on a Pareto-optimal curve Π∗

7 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-16
SLIDE 16

The Pareto-optimal curve

Task Completion 𝐻1 = ∑𝛿𝑢𝑆1

𝑢

Safety 𝐻2 = ∑𝛿𝑢𝑆2

𝑢

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑢(𝑆1, 𝑆2)

8 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-17
SLIDE 17

From maximal safety to minimal risk

Task Completion 𝐻𝑠 Risk 𝐻𝑑

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑢(𝑆𝑠, −𝑆𝑑)

9 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-18
SLIDE 18

The optimal policy can move freely along Π∗

Task Completion 𝐻𝑠 Risk 𝐻𝑑

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑢(𝑆𝑠, −𝑆𝑑) 𝜌∗

10 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-19
SLIDE 19

How to choose a desired trade-off

Task Completion 𝐻𝑠 Risk 𝐻𝑑

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑠

𝑢

𝑡. 𝑢. ∑𝛿𝑢𝑆𝑑

𝑢 < 𝛾

𝜌∗

𝛾

11 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-20
SLIDE 20

Constrained Reinforcement Learning

Markov Decision Process An MDP is a tuple (S, A, P, Rr, γ) with:

  • Rewards Rr ∈ RS×A

Objective Maximise rewards maxπ∈M(A)S E [∞

t=0 γtRr(st, at) | s0 = s]

12 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-21
SLIDE 21

Constrained Reinforcement Learning

Constrained Markov Decision Process A CMDP is a tuple (S, A, P, Rr, Rc, γ, β) with:

  • Rewards Rr ∈ RS×A
  • Costs Rc ∈ RS×A
  • Budget β

Objective Maximise rewards while keeping costs under a fixed budget maxπ∈M(A)S E [∞

t=0 γtRr(st, at) | s0 = s]

s.t. E [∞

t=0 γtRc(st, at) | s0 = s] ≤ β

12 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-22
SLIDE 22

We want to learn Π∗ rather than π∗

β

Task Completion 𝐻𝑠 Risk 𝐻𝑑

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑠

𝑢

𝑡. 𝑢. ∑𝛿𝑢𝑆𝑑

𝑢 < 𝛾

𝜌∗

𝛾

13 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-23
SLIDE 23

We want to learn Π∗ rather than π∗

β

Task Completion 𝐻𝑠 Risk 𝐻𝑑

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑠

𝑢

𝑡. 𝑢. ∑𝛿𝑢𝑆𝑑

𝑢 < 𝛾

𝜌∗

𝛾

13 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-24
SLIDE 24

Budgeted Reinforcement Learning

Budgeted Markov Decision Process A BMDP is a tuple (S, A, P, Rr, Rc, γ, B) with:

  • Rewards Rr ∈ RS×A
  • Costs Rc ∈ RS×A
  • Budget space B

Objective Maximise rewards while keeping costs under an adjustable budget. ∀β ∈ B, maxπ∈M(A×B)S×B E [∞

t=0 γtRr(st, at) | s0 = s, β0 = β]

s.t. E [∞

t=0 γtRc(st, at) | s0 = s, β0 = β] ≤ β

14 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-25
SLIDE 25

Problem formulation

Budgeted policies π

  • Take a budget β as an additional input
  • Output a next budget β′
  • π : (s, β)

s

→ (a, β′)

a

Augment the spaces with the budget β

15 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-26
SLIDE 26

Augmented Setting

Definition (Augmented spaces)

  • States S = S × B.
  • Actions A = A × B.
  • Dynamics P

state (s, β), action (a, βa) → next state

  • s′ ∼ P(s′|s, a)

β′ = βa Definition (Augmented signals)

  • 1. Rewards R = (Rr, Rc)
  • 2. Returns Gπ = (Gπ

r , Gπ c ) def

= ∞

t=0 γtR(st, at)

  • 3. Value V π(s) = (V π

r , V π c ) def

= E [Gπ | s0 = s]

  • 4. Q-Value Qπ(s, a) = (Qπ

r , Qπ c ) def

= E [Gπ | s0 = s, a0 = a]

16 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-27
SLIDE 27

17 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Budgeted Dynamic Programming

02

slide-28
SLIDE 28

Policy Evaluation

Proposition (Budgeted Bellman Expectation) The Bellman Expectation equations are preserved V π(s) =

  • a∈A

π(a|s)Qπ(s, a) Qπ(s, a) = R(s, a) + γ

  • s′∈S

P

  • s′

s, a

  • V π(s′)

18 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-29
SLIDE 29

Policy Evaluation

Proposition (Budgeted Bellman Expectation) The Bellman Expectation equations are preserved V π(s) =

  • a∈A

π(a|s)Qπ(s, a) Qπ(s, a) = R(s, a) + γ

  • s′∈S

P

  • s′

s, a

  • V π(s′)

Proposition (Contraction) The Bellman Expectation Operator T π is a γ-contraction. T πQ(s, a) def = R(s, a) + γ

  • s′∈S
  • a′∈A

P(s′|s, a)π(a′|s′)Q(s′, a′) We can evaluate a budgeted policy π

18 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-30
SLIDE 30

Budgeted Optimality

Definition (Budgeted Optimality) In that order, we want to: (i) Respect the budget β: Πa(s) def = {π ∈ Π : V π

c (s, β)≤ β}

(ii) Maximise the rewards: V ∗

r (s) def

= maxπ∈Πa(s)V π

r (s)

Πr(s) def = arg maxπ∈Πa(s)V π

r (s)

(iii) Minimise the costs: V ∗

c (s) def

= minπ∈Πr(s)V π

c (s),

Π∗(s) def = arg minπ∈Πr(s)V π

c (s)

We define the budgeted action-value function Q∗ similarly

19 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-31
SLIDE 31

Budgeted Optimality

Theorem (Budgeted Bellman Optimality Equation) Q∗ verifies the following equation: Q∗(s, a) = T Q∗(s, a)

def

= R(s, a) + γ

  • s′∈S

P(s′|s, a)

  • a′∈A

πgreedy(a′|s′; Q∗)Q∗(s′, a′) where the greedy policy πgreedy is defined by: πgreedy(a|s; Q) ∈arg minρ∈ΠQ

r

E

a∼ρ Qc(s, a),

where ΠQ

r def

=arg maxρ∈M(A) E

a∼ρ Qr(s, a)

s.t. E

a∼ρ Qc(s, a)≤ β

20 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-32
SLIDE 32

The optimal policy

Proposition (Optimality of the policy) πgreedy(· ; Q∗) is simultaneously optimal in all states s ∈ S: πgreedy(· ; Q∗) ∈ Π∗(s) In particular, V πgreedy(·;Q∗) = V ∗ and Qπgreedy(·;Q∗) = Q∗. Proposition (Solving the non-linear program) πgreedy can be computed efficiently, as a mixture πhull of two points that lie on the convex hull of Q. πgreedy = πhull

21 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-33
SLIDE 33

Solving the non-linear program: intuition

  • dominated points

( , )

  • ⎯⎯

⎯  ⎯ ⎯ ⎯⎯⎯ 22 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-34
SLIDE 34

Solving the non-linear program: intuition

  • dominated points

( , )

  • ⎯⎯

⎯  ⎯ ⎯ ⎯⎯⎯ 23 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-35
SLIDE 35

Solving the non-linear program: intuition

  • dominated points

( , )

  • ⎯⎯

⎯  ⎯ ⎯ ⎯⎯⎯ 24 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-36
SLIDE 36

Solving the non-linear program: intuition

  • dominated points

( , )

  • ⎯⎯

⎯  ⎯ ⎯ ⎯⎯⎯ 25 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-37
SLIDE 37

Solving the non-linear program: intuition

  • dominated points

( , )

  • ⎯⎯

⎯  ⎯ ⎯ ⎯⎯⎯ 26 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-38
SLIDE 38

Convergence analysis

Recall what we’ve shown so far: T − − − − − − − →

fixed−point Q∗ −

− − − − →

tractable πhull(Q∗) −

− − →

equal πgreedy(Q∗) −

− − − →

  • ptimal

27 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-39
SLIDE 39

Convergence analysis

Recall what we’ve shown so far: T − − − − − − − →

fixed−point Q∗ −

− − − − →

tractable πhull(Q∗) −

− − →

equal πgreedy(Q∗) −

− − − →

  • ptimal

We’re almost there! All that is left is to perform Fixed-Point Iteration to compute Q∗.

27 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-40
SLIDE 40

Convergence analysis

Recall what we’ve shown so far: T − − − − − − − →

fixed−point Q∗ −

− − − − →

tractable πhull(Q∗) −

− − →

equal πgreedy(Q∗) −

− − − →

  • ptimal

We’re almost there! All that is left is to perform Fixed-Point Iteration to compute Q∗. Theorem (Non-Contractivity) For any BMDP (S, A, P, Rr, Rc, γ) with |A| ≥ 2, T is not a contraction. ∀ε > 0, ∃Q1, Q2 ∈ (R2)SA : T Q1 − T Q2∞ ≥ 1 εQ1 − Q2∞ ✗ We cannot guarantee the convergence of T n(Q0) to Q∗

27 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-41
SLIDE 41

Not a contraction: intuition

  • 28 -Budgeted Reinforcement Learning- Carrara N., Leurent E.
slide-42
SLIDE 42

Not a contraction: intuition

  • 29 -Budgeted Reinforcement Learning- Carrara N., Leurent E.
slide-43
SLIDE 43

Not a contraction: intuition

  • 30 -Budgeted Reinforcement Learning- Carrara N., Leurent E.
slide-44
SLIDE 44

Convergence analysis

Thankfully, Theorem (Contractivity on smooth Q-functions) T is a contraction when restricted to the subset Lγ of Q-functions such that ”Qr is L-Lipschitz with respect to Qc”, with L < 1

γ − 1.

Lγ =

  • Q ∈ (R2)SA s.t. ∃L < 1

γ − 1 : ∀s ∈ S, a1, a2 ∈ A,

|Qr(s, a1) − Qr(s, a2)| ≤ L|Qc(s, a1) − Qc(s, a2)|

  • We guarantee convergence under some (strong) assumptions

We observe empirical convergence

31 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-45
SLIDE 45

Budgeted Dynamic Programming Algorithm 1: Budgeted Value-Iteration Data: P, Rr, Rc Result: Q∗

1 Q0 ← 0 2 repeat 3

Qk+1 ← T Qk

4 until convergence

32 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-46
SLIDE 46

33 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Budgeted Reinforcement Learning

03

slide-47
SLIDE 47

Extension to the RL setting

We address several limitations of Budgeted Value-Iteration

  • 1. If the P, Rr and Rc are unknown:

34 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-48
SLIDE 48

Extension to the RL setting

We address several limitations of Budgeted Value-Iteration

  • 1. If the P, Rr and Rc are unknown:

> Work with a batch of samples D = {(si, ai, ri, s′

i}i∈[0,N]

34 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-49
SLIDE 49

Extension to the RL setting

We address several limitations of Budgeted Value-Iteration

  • 1. If the P, Rr and Rc are unknown:

> Work with a batch of samples D = {(si, ai, ri, s′

i}i∈[0,N]

> Replace T with a sampling operator ˆ T : ˆ T Q(si, ai, ri, s′

i) def

= ri + γ

  • a′

i ∈Ai

πgreedy(a′

i|s′ i ; Q)Q(s′ i , a′ i).

34 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-50
SLIDE 50

Extension to the RL setting

We address several limitations of Budgeted Value-Iteration

  • 1. If the P, Rr and Rc are unknown:

> Work with a batch of samples D = {(si, ai, ri, s′

i}i∈[0,N]

> Replace T with a sampling operator ˆ T : ˆ T Q(si, ai, ri, s′

i) def

= ri + γ

  • a′

i ∈Ai

πgreedy(a′

i|s′ i ; Q)Q(s′ i , a′ i).

  • 2. If S is continuous:

34 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-51
SLIDE 51

Extension to the RL setting

We address several limitations of Budgeted Value-Iteration

  • 1. If the P, Rr and Rc are unknown:

> Work with a batch of samples D = {(si, ai, ri, s′

i}i∈[0,N]

> Replace T with a sampling operator ˆ T : ˆ T Q(si, ai, ri, s′

i) def

= ri + γ

  • a′

i ∈Ai

πgreedy(a′

i|s′ i ; Q)Q(s′ i , a′ i).

  • 2. If S is continuous:

> Employ function approximation Qθ, and minimise a regression loss L(Qθ, Qtarget; D) =

  • D

||Qθ(s, a) − Qtarget(s, a, r, s′)||2

2

34 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-52
SLIDE 52

Scalable implementation

  • CPU parallel computing of the targets
  • a′

i ∈Ai πgreedy(a′

i|s′ i; Q)Q(s′ i, a′ i)

35 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-53
SLIDE 53

Scalable implementation

  • CPU parallel computing of the targets
  • a′

i ∈Ai πgreedy(a′

i|s′ i; Q)Q(s′ i, a′ i)

  • Same for interactions with the environment.

35 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-54
SLIDE 54

Scalable implementation

  • CPU parallel computing of the targets
  • a′

i ∈Ai πgreedy(a′

i|s′ i; Q)Q(s′ i, a′ i)

  • Same for interactions with the environment.
  • Neural Network for function approximation:

s0 s1 βa Qr(a0) Qr(a1) Qc(a0) Qc(a1) (s, βa) Encoder Hidden Layer 1 Hidden Layer 2 Q

35 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-55
SLIDE 55

36 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Experiments

04

slide-56
SLIDE 56

A baseline approximate solution

Lagrangian Relaxation Consider the dual problem so as to replace the hard constraint by a soft constraint penalised by a Lagrangian multiplier λ: max

π

E

  • t

γtRr(s, a) − λγtRc(s, a)

  • Train many policies πk with penalties λk and recover the cost

budgets βk

  • Very data/memory-heavy

37 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-57
SLIDE 57

Dialogue systems

A slot-filling problem: the agent (the dialogue system) fills a form by asking the user each slot. It can either:

  • ask to answer using voice (safe/slow);
  • ask to answer with a numeric pad (unsafe/fast).

r

c

38 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-58
SLIDE 58

Autonomous driving

The agent (the car) is on a two-way road with a car in front of it,

  • it can stay behind (safe/slow);
  • it can overtake (unsafe/fast).

r

c

39 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-59
SLIDE 59

Risk-sensitive exploration

How to collect the batch D? We propose an ε-greedy exploration procedure:

40 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-60
SLIDE 60

Risk-sensitive exploration

How to collect the batch D? We propose an ε-greedy exploration procedure:

  • Sample an initial budget β0 ∼ U(B)

40 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-61
SLIDE 61

Risk-sensitive exploration

How to collect the batch D? We propose an ε-greedy exploration procedure:

  • Sample an initial budget β0 ∼ U(B)
  • At each step, where s = (s, β) only explore feasible budgets:

a = (a, βa) ∼ U(∆AB) where ∆ is such that P (a, βa|s, β) verifies E[βa] ≤ β

40 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-62
SLIDE 62

Corridors

Two corridors:

  • 1. one with high costs / high rewards
  • 2. the other with no costs / low rewards

→ Validate the risk-sensitive exploration procedure

41 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-63
SLIDE 63

Corridors

r

c

BFTQ(risk-sensitive) BFTQ(risk-neutral) 42 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-64
SLIDE 64

Thank You!

43 -Budgeted Reinforcement Learning- Carrara N., Leurent E.