Budgeted Reinforcement Learning in Continuous State Space Nicolas - - PowerPoint PPT Presentation

▶

budgeted reinforcement learning in continuous state space

Budgeted Reinforcement Learning in Continuous State Space Nicolas - - PowerPoint PPT Presentation

Oct 20, 2023 309 likes •966 views

Budgeted Reinforcement Learning in Continuous State Space Nicolas Carrara 1 , Edouard Leurent 1,2 , Tanguy Urvoy 3 , Romain Laroche 4 , Odalric Maillard 1 , Olivier Pietquin 1,5 1 Inria SequeL, 2 Renault Group, 3 Orange Labs, 4 Microsoft Montr

slide-1

SLIDE 1

Budgeted Reinforcement Learning in Continuous State Space

Nicolas Carrara1, Edouard Leurent1,2, Tanguy Urvoy3, Romain Laroche4, Odalric Maillard1, Olivier Pietquin1,5

1Inria SequeL, 2Renault Group, 3Orange Labs, 4Microsoft Montr´

eal,

5Google Research, Brain Team

slide-2

SLIDE 2

2 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

01.. Motivation and Setting 02.. Budgeted Dynamic Programming 03.. Budgeted Reinforcement Learning 04.. Experiments

Contents

slide-3

SLIDE 3

3 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Motivation and Setting

01

slide-4

SLIDE 4

Learning to act

Optimal Decision-Making Which action at should we choose in state st to maximise a cumulative reward R? max

π

E

at∼π(at|st)

∞

t=0

γtR(st, at)

4 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-5

SLIDE 5

Learning to act

Optimal Decision-Making Which action at should we choose in state st to maximise a cumulative reward R? max

π

E

at∼π(at|st)

∞

t=0

γtR(st, at)

A very general formulation

Widely used in the industry

4 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-6

SLIDE 6

Learning to act

Optimal Decision-Making Which action at should we choose in state st to maximise a cumulative reward R? max

π

E

at∼π(at|st)

∞

t=0

γtR(st, at)

A very general formulation

✗ Not widely used in the industry

4 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-7

SLIDE 7

Learning to act

Optimal Decision-Making Which action at should we choose in state st to maximise a cumulative reward R? max

π

E

at∼π(at|st)

∞

t=0

γtR(st, at)

A very general formulation

✗ Not widely used in the industry

> Sample efficiency > Trial and error > Unpredictable behaviour

4 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-8

SLIDE 8

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R

5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-9

SLIDE 9

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R A convenient formulation, but;

5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-10

SLIDE 10

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R A convenient formulation, but; ✗ R is not always easy to design.

5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-11

SLIDE 11

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically:

Task completion vs Safety

5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-12

SLIDE 12

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically:

Task completion vs Safety

For example...

5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-13

SLIDE 13

Example problems with conflicts

Dialogue systems A slot-filling problem: the agent fills a form by asking the user each slot. It can either:

ask to answer using voice (safe/slow);
ask to answer with a numeric pad (unsafe/fast).

6 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-14

SLIDE 14

Example problems with conflicts

Dialogue systems A slot-filling problem: the agent fills a form by asking the user each slot. It can either:

ask to answer using voice (safe/slow);
ask to answer with a numeric pad (unsafe/fast).

Autonomous Driving The agent is driving on a two-way road with a car in front of it,

it can stay behind (safe/slow);
it can overtake (unsafe/fast).

6 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-15

SLIDE 15

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically:

Task completion vs Safety

For example... For a fixed reward function R, no control over the Task Completion

Safety

trade-off π∗ is only guaranteed to lie on a Pareto-optimal curve Π∗

7 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-16

SLIDE 16

The Pareto-optimal curve

Task Completion 𝐻1 = ∑𝛿𝑢𝑆1

𝑢

Safety 𝐻2 = ∑𝛿𝑢𝑆2

𝑢

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑢(𝑆1, 𝑆2)

8 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-17

SLIDE 17

From maximal safety to minimal risk

Task Completion 𝐻𝑠 Risk 𝐻𝑑

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑢(𝑆𝑠, −𝑆𝑑)

9 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-18

SLIDE 18

The optimal policy can move freely along Π∗

Task Completion 𝐻𝑠 Risk 𝐻𝑑

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑢(𝑆𝑠, −𝑆𝑑) 𝜌∗

10 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-19

SLIDE 19

How to choose a desired trade-off

Task Completion 𝐻𝑠 Risk 𝐻𝑑

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑠

𝑢

𝑡. 𝑢. ∑𝛿𝑢𝑆𝑑

𝑢 < 𝛾

𝜌∗

𝛾

11 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-20

SLIDE 20

Constrained Reinforcement Learning

Markov Decision Process An MDP is a tuple (S, A, P, Rr, γ) with:

Rewards Rr ∈ RS×A

Objective Maximise rewards maxπ∈M(A)S E [∞

t=0 γtRr(st, at) | s0 = s]

12 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-21

SLIDE 21

Constrained Reinforcement Learning

Constrained Markov Decision Process A CMDP is a tuple (S, A, P, Rr, Rc, γ, β) with:

Rewards Rr ∈ RS×A
Costs Rc ∈ RS×A
Budget β

Objective Maximise rewards while keeping costs under a fixed budget maxπ∈M(A)S E [∞

t=0 γtRr(st, at) | s0 = s]

s.t. E [∞

t=0 γtRc(st, at) | s0 = s] ≤ β

12 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-22

SLIDE 22

We want to learn Π∗ rather than π∗

β

Task Completion 𝐻𝑠 Risk 𝐻𝑑

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑠

𝑢

𝑡. 𝑢. ∑𝛿𝑢𝑆𝑑

𝑢 < 𝛾

𝜌∗

𝛾

13 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-23

SLIDE 23

We want to learn Π∗ rather than π∗

β

Task Completion 𝐻𝑠 Risk 𝐻𝑑

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑠

𝑢

𝑡. 𝑢. ∑𝛿𝑢𝑆𝑑

𝑢 < 𝛾

𝜌∗

𝛾

13 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-24

SLIDE 24

Budgeted Reinforcement Learning

Budgeted Markov Decision Process A BMDP is a tuple (S, A, P, Rr, Rc, γ, B) with:

Rewards Rr ∈ RS×A
Costs Rc ∈ RS×A
Budget space B

Objective Maximise rewards while keeping costs under an adjustable budget. ∀β ∈ B, maxπ∈M(A×B)S×B E [∞

t=0 γtRr(st, at) | s0 = s, β0 = β]

s.t. E [∞

t=0 γtRc(st, at) | s0 = s, β0 = β] ≤ β

14 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-25

SLIDE 25

Problem formulation

Budgeted policies π

Take a budget β as an additional input
Output a next budget β′
π : (s, β)

s

→ (a, β′)

a

Augment the spaces with the budget β

15 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-26

SLIDE 26

Augmented Setting

Definition (Augmented spaces)

States S = S × B.
Actions A = A × B.
Dynamics P

state (s, β), action (a, βa) → next state

s′ ∼ P(s′|s, a)

β′ = βa Definition (Augmented signals)

1. Rewards R = (Rr, Rc)
2. Returns Gπ = (Gπ

r , Gπ c ) def

= ∞

t=0 γtR(st, at)

3. Value V π(s) = (V π

r , V π c ) def

= E [Gπ | s0 = s]

4. Q-Value Qπ(s, a) = (Qπ

r , Qπ c ) def

= E [Gπ | s0 = s, a0 = a]

16 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-27

SLIDE 27

17 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Budgeted Dynamic Programming

02

slide-28

SLIDE 28

Policy Evaluation

Proposition (Budgeted Bellman Expectation) The Bellman Expectation equations are preserved V π(s) =

a∈A

π(a|s)Qπ(s, a) Qπ(s, a) = R(s, a) + γ

s′∈S

P

s′

s, a

V π(s′)

18 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-29

SLIDE 29

Policy Evaluation

Proposition (Budgeted Bellman Expectation) The Bellman Expectation equations are preserved V π(s) =

a∈A

π(a|s)Qπ(s, a) Qπ(s, a) = R(s, a) + γ

s′∈S

P

s′

s, a

V π(s′)

Proposition (Contraction) The Bellman Expectation Operator T π is a γ-contraction. T πQ(s, a) def = R(s, a) + γ

s′∈S
a′∈A

P(s′|s, a)π(a′|s′)Q(s′, a′) We can evaluate a budgeted policy π

18 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-30

SLIDE 30

Budgeted Optimality

Definition (Budgeted Optimality) In that order, we want to: (i) Respect the budget β: Πa(s) def = {π ∈ Π : V π

c (s, β)≤ β}

(ii) Maximise the rewards: V ∗

r (s) def

= maxπ∈Πa(s)V π

r (s)

Πr(s) def = arg maxπ∈Πa(s)V π

r (s)

(iii) Minimise the costs: V ∗

c (s) def

= minπ∈Πr(s)V π

c (s),

Π∗(s) def = arg minπ∈Πr(s)V π

c (s)

We define the budgeted action-value function Q∗ similarly

19 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-31

SLIDE 31

Budgeted Optimality

Theorem (Budgeted Bellman Optimality Equation) Q∗ verifies the following equation: Q∗(s, a) = T Q∗(s, a)

def

= R(s, a) + γ

s′∈S

P(s′|s, a)

a′∈A

πgreedy(a′|s′; Q∗)Q∗(s′, a′) where the greedy policy πgreedy is defined by: πgreedy(a|s; Q) ∈arg minρ∈ΠQ

r

E

a∼ρ Qc(s, a),

where ΠQ

r def

=arg maxρ∈M(A) E

a∼ρ Qr(s, a)

s.t. E

a∼ρ Qc(s, a)≤ β

20 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-32

SLIDE 32

The optimal policy

Proposition (Optimality of the policy) πgreedy(· ; Q∗) is simultaneously optimal in all states s ∈ S: πgreedy(· ; Q∗) ∈ Π∗(s) In particular, V πgreedy(·;Q∗) = V ∗ and Qπgreedy(·;Q∗) = Q∗. Proposition (Solving the non-linear program) πgreedy can be computed efficiently, as a mixture πhull of two points that lie on the convex hull of Q. πgreedy = πhull

21 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-33

SLIDE 33

Solving the non-linear program: intuition

dominated points

( , )

⎯⎯

⎯  ⎯ ⎯ ⎯⎯⎯ 22 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-34

SLIDE 34

Solving the non-linear program: intuition

dominated points

( , )

⎯⎯

⎯  ⎯ ⎯ ⎯⎯⎯ 23 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-35

SLIDE 35

Solving the non-linear program: intuition

dominated points

( , )

⎯⎯

⎯  ⎯ ⎯ ⎯⎯⎯ 24 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-36

SLIDE 36

Solving the non-linear program: intuition

dominated points

( , )

⎯⎯

⎯  ⎯ ⎯ ⎯⎯⎯ 25 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-37

SLIDE 37

Solving the non-linear program: intuition

dominated points

( , )

⎯⎯

⎯  ⎯ ⎯ ⎯⎯⎯ 26 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-38

SLIDE 38

Convergence analysis

Recall what we’ve shown so far: T − − − − − − − →

fixed−point Q∗ −

− − − − →

tractable πhull(Q∗) −

− − →

equal πgreedy(Q∗) −

− − − →

ptimal

27 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-39

SLIDE 39

Convergence analysis

Recall what we’ve shown so far: T − − − − − − − →

fixed−point Q∗ −

− − − − →

tractable πhull(Q∗) −

− − →

equal πgreedy(Q∗) −

− − − →

ptimal

We’re almost there! All that is left is to perform Fixed-Point Iteration to compute Q∗.

27 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-40

SLIDE 40

Convergence analysis

Recall what we’ve shown so far: T − − − − − − − →

fixed−point Q∗ −

− − − − →

tractable πhull(Q∗) −

− − →

equal πgreedy(Q∗) −

− − − →

ptimal

We’re almost there! All that is left is to perform Fixed-Point Iteration to compute Q∗. Theorem (Non-Contractivity) For any BMDP (S, A, P, Rr, Rc, γ) with |A| ≥ 2, T is not a contraction. ∀ε > 0, ∃Q1, Q2 ∈ (R2)SA : T Q1 − T Q2∞ ≥ 1 εQ1 − Q2∞ ✗ We cannot guarantee the convergence of T n(Q0) to Q∗

27 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-41

SLIDE 41

Not a contraction: intuition

28 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-42

SLIDE 42

Not a contraction: intuition

29 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-43

SLIDE 43

Not a contraction: intuition

30 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-44

SLIDE 44

Convergence analysis

Thankfully, Theorem (Contractivity on smooth Q-functions) T is a contraction when restricted to the subset Lγ of Q-functions such that ”Qr is L-Lipschitz with respect to Qc”, with L < 1

γ − 1.

Lγ =

Q ∈ (R2)SA s.t. ∃L < 1

γ − 1 : ∀s ∈ S, a1, a2 ∈ A,

|Qr(s, a1) − Qr(s, a2)| ≤ L|Qc(s, a1) − Qc(s, a2)|

We guarantee convergence under some (strong) assumptions

We observe empirical convergence

31 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-45

SLIDE 45

Budgeted Dynamic Programming Algorithm 1: Budgeted Value-Iteration Data: P, Rr, Rc Result: Q∗

1 Q0 ← 0 2 repeat 3

Qk+1 ← T Qk

4 until convergence

32 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-46

SLIDE 46

33 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Budgeted Reinforcement Learning

03

slide-47

SLIDE 47

Extension to the RL setting

We address several limitations of Budgeted Value-Iteration

1. If the P, Rr and Rc are unknown:

34 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-48

SLIDE 48

Extension to the RL setting

We address several limitations of Budgeted Value-Iteration

1. If the P, Rr and Rc are unknown:

> Work with a batch of samples D = {(si, ai, ri, s′

i}i∈[0,N]

34 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-49

SLIDE 49

Extension to the RL setting

We address several limitations of Budgeted Value-Iteration

1. If the P, Rr and Rc are unknown:

> Work with a batch of samples D = {(si, ai, ri, s′

i}i∈[0,N]

> Replace T with a sampling operator ˆ T : ˆ T Q(si, ai, ri, s′

i) def

= ri + γ

a′

i ∈Ai

πgreedy(a′

i|s′ i ; Q)Q(s′ i , a′ i).

34 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-50

SLIDE 50

Extension to the RL setting

We address several limitations of Budgeted Value-Iteration

1. If the P, Rr and Rc are unknown:

> Work with a batch of samples D = {(si, ai, ri, s′

i}i∈[0,N]

> Replace T with a sampling operator ˆ T : ˆ T Q(si, ai, ri, s′

i) def

= ri + γ

a′

i ∈Ai

πgreedy(a′

i|s′ i ; Q)Q(s′ i , a′ i).

2. If S is continuous:

34 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-51

SLIDE 51

Extension to the RL setting

We address several limitations of Budgeted Value-Iteration

1. If the P, Rr and Rc are unknown:

> Work with a batch of samples D = {(si, ai, ri, s′

i}i∈[0,N]

> Replace T with a sampling operator ˆ T : ˆ T Q(si, ai, ri, s′

i) def

= ri + γ

a′

i ∈Ai

πgreedy(a′

i|s′ i ; Q)Q(s′ i , a′ i).

2. If S is continuous:

> Employ function approximation Qθ, and minimise a regression loss L(Qθ, Qtarget; D) =

D

||Qθ(s, a) − Qtarget(s, a, r, s′)||2

2

34 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-52

SLIDE 52

Scalable implementation

CPU parallel computing of the targets
a′

i ∈Ai πgreedy(a′

i|s′ i; Q)Q(s′ i, a′ i)

35 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-53

SLIDE 53

Scalable implementation

CPU parallel computing of the targets
a′

i ∈Ai πgreedy(a′

i|s′ i; Q)Q(s′ i, a′ i)

Same for interactions with the environment.

35 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-54

SLIDE 54

Scalable implementation

CPU parallel computing of the targets
a′

i ∈Ai πgreedy(a′

i|s′ i; Q)Q(s′ i, a′ i)

Same for interactions with the environment.
Neural Network for function approximation:

s0 s1 βa Qr(a0) Qr(a1) Qc(a0) Qc(a1) (s, βa) Encoder Hidden Layer 1 Hidden Layer 2 Q

35 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-55

SLIDE 55

36 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Experiments

04

slide-56

SLIDE 56

A baseline approximate solution

Lagrangian Relaxation Consider the dual problem so as to replace the hard constraint by a soft constraint penalised by a Lagrangian multiplier λ: max

π

E

t

γtRr(s, a) − λγtRc(s, a)

Train many policies πk with penalties λk and recover the cost

budgets βk

Very data/memory-heavy

37 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-57

SLIDE 57

Dialogue systems

A slot-filling problem: the agent (the dialogue system) fills a form by asking the user each slot. It can either:

ask to answer using voice (safe/slow);
ask to answer with a numeric pad (unsafe/fast).

Gπ

r

Gπ

c

38 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-58

SLIDE 58

Autonomous driving

The agent (the car) is on a two-way road with a car in front of it,

it can stay behind (safe/slow);
it can overtake (unsafe/fast).

Gπ

r

Gπ

c

39 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-59

SLIDE 59

Risk-sensitive exploration

How to collect the batch D? We propose an ε-greedy exploration procedure:

40 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-60

SLIDE 60

Risk-sensitive exploration

How to collect the batch D? We propose an ε-greedy exploration procedure:

Sample an initial budget β0 ∼ U(B)

40 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-61

SLIDE 61

Risk-sensitive exploration

How to collect the batch D? We propose an ε-greedy exploration procedure:

Sample an initial budget β0 ∼ U(B)
At each step, where s = (s, β) only explore feasible budgets:

a = (a, βa) ∼ U(∆AB) where ∆ is such that P (a, βa|s, β) verifies E[βa] ≤ β

40 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-62

SLIDE 62

Corridors

Two corridors:

1. one with high costs / high rewards
2. the other with no costs / low rewards

→ Validate the risk-sensitive exploration procedure

41 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-63

SLIDE 63

Corridors

Gπ

r

Gπ

c

BFTQ(risk-sensitive) BFTQ(risk-neutral) 42 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

slide-64

SLIDE 64

Thank You!

43 -Budgeted Reinforcement Learning- Carrara N., Leurent E.