Reinforcement Learning for Safe Decision-Making in Autonomous - - PowerPoint PPT Presentation

reinforcement learning for safe decision making in
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning for Safe Decision-Making in Autonomous - - PowerPoint PPT Presentation

Reinforcement Learning for Safe Decision-Making in Autonomous Driving Edouard Leurent 1,2,3 , Odalric-Ambrym Maillard 1 , Denis Efimov 2 1 Inria SequeL, 2 Inria Valse, 3 Renault Group 01 Motivation and Scope 2 -Reinforcement Learning for


slide-1
SLIDE 1

Reinforcement Learning for Safe Decision-Making in Autonomous Driving

Edouard Leurent1,2,3, Odalric-Ambrym Maillard1, Denis Efimov2

1Inria SequeL, 2Inria Valse, 3Renault Group

slide-2
SLIDE 2

2 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Motivation and Scope

01

slide-3
SLIDE 3

Once upon a time

Classic Autonomous Driving Pipeline

3 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-4
SLIDE 4

Once upon a time

Classic Autonomous Driving Pipeline (Bold?) Claim If we remove the humans on the road, the problem becomes easy.

3 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-5
SLIDE 5

Once upon a time

Classic Autonomous Driving Pipeline (Bold?) Claim If we remove the humans on the road, the problem becomes easy. Even with obstacles, partial observability, disturbances, etc. The problems of Route Planning, Motion Planning, Local Feedback Control are basically solved.

3 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-6
SLIDE 6

Scope of this thesis

✗ We focus instead on the (arguably) harder challenge: Behavioural Planning What we have

  • In practice, often a hand-crafted rule-based system (FSM).
  • Won’t scale to complex scenes

What we want

  • Handle human agents with uncertain behaviours
  • Handle the interactions between agents

We turn to learning-based approaches

4 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-7
SLIDE 7

Reinforcement Learning — the framework

Markov Decision Processes

  • 1. Observe state s ∈ S;

5 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-8
SLIDE 8

Reinforcement Learning — the framework

Markov Decision Processes

  • 1. Observe state s ∈ S;
  • 2. Pick action a ∈ A according to our policy π(a|s);

5 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-9
SLIDE 9

Reinforcement Learning — the framework

Markov Decision Processes

  • 1. Observe state s ∈ S;
  • 2. Pick action a ∈ A according to our policy π(a|s);
  • 3. Transition to a next state s′ ∼ P
  • s′|s, a
  • ;

5 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-10
SLIDE 10

Reinforcement Learning — the framework

Markov Decision Processes

  • 1. Observe state s ∈ S;
  • 2. Pick action a ∈ A according to our policy π(a|s);
  • 3. Transition to a next state s′ ∼ P
  • s′|s, a
  • ;
  • 4. Receive a reward r.

5 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-11
SLIDE 11

Reinforcement Learning — the framework

Markov Decision Processes

  • 1. Observe state s ∈ S;
  • 2. Pick action a ∈ A according to our policy π(a|s);
  • 3. Transition to a next state s′ ∼ P
  • s′|s, a
  • ;
  • 4. Receive a reward r.

Objective: maximise V = E [∞

t=0 γtrt]

5 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-12
SLIDE 12

Reinforcement Learning — the framework

Markov Decision Processes

  • 1. Observe state s ∈ S;
  • 2. Pick action a ∈ A according to our policy π(a|s);
  • 3. Transition to a next state s′ ∼ P
  • s′|s, a
  • ;
  • 4. Receive a reward r.

Objective: maximise V = E [∞

t=0 γtrt]

  • States: Ground truth for vehicles, roads, signals, etc.

Continuous

  • Actions: Semantic decisions: change lane, yield, pass, etc.

Discrete

5 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-13
SLIDE 13

Reinforcement Learning — how?

Model-free

  • 1. Directly optimise π(a|s) through policy evaluation and policy

improvement

6 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-14
SLIDE 14

Reinforcement Learning — how?

Model-free

  • 1. Directly optimise π(a|s) through policy evaluation and policy

improvement Model-based

  • 1. Learn a model for the dynamics ˆ

T(st+1|st, at),

  • 2. (Planning) Leverage it to compute

max

π

E ∞

  • t=0

γtr(st, at)

  • at ∼ π(st), st+1 ∼ ˆ

T(st, at)

  • + Better sample efficiency, interpretability, priors.

6 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-15
SLIDE 15

Outline

Model-Free Model-Based Efficient Safe

7 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-16
SLIDE 16

Outline

2

Model-Free Model-Based Efficient Safe

7 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-17
SLIDE 17

Outline

2 3

Model-Free Model-Based Efficient Safe

7 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-18
SLIDE 18

Outline

2 3 4

Model-Free Model-Based Efficient Safe

7 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-19
SLIDE 19

Outline

2 3 4 5

Model-Free Model-Based Efficient Safe

7 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-20
SLIDE 20

8 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Efficient Model-Free

02

slide-21
SLIDE 21

Q-learning

Definition (Optimal State-action Value Function Q∗) Q∗(s, a) = max

π

Eπ ∞

  • t=0

γtR(st, at)

  • s0 = s, a0 = a
  • How to learn Q∗ ?

Proposition (Bellman Optimality Equation) Q∗(s, a) = R(s, a) + γ E

s′ max a′ Q∗(s′, a′)

Represent Q∗ with function approximation (e.g. a neural network in DQN) Apply fixed-point iteration over samples (s, a, s′) until convergence

9 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-22
SLIDE 22

How to represent the state?

The list of features representation A joint state s of N + 1 observed vehicles s = (si)i∈[0,N] si =

  • xi

yi vx

i

vy

i

cos ψi sin ψi T

  • 7.0
  • 5.0
  • 3.0
  • 1.0

1.0 3.0 5.0

  • 11.0
  • 9.0
  • 7.0
  • 5.0
  • 3.0
  • 1.0

1.0 3.0 5.0 7.0 9.0

x1, y1 x2, y2 x3, y3

10 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-23
SLIDE 23

Limitations

Issues related to function approximation

  • 1. Variable size

usual models accept fixed-size inputs

  • 2. Sensitivity to the ordering

we want the policy to be permutation-invariant:

∀τ ∈ SN, π(·|(s0, s1, . . . , sN)) = π(·|(s0, sτ(1), . . . , sτ(N)))

11 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-24
SLIDE 24

A common solution

Occupancy grid representation

  • 7.0
  • 5.0
  • 3.0
  • 1.0

1.0 3.0 5.0

  • 11.0
  • 9.0
  • 7.0
  • 5.0
  • 3.0
  • 1.0

1.0 3.0 5.0 7.0 9.0

Fixed-size Does not depend on an ordering ✗ Suffers from an accuracy / size tradeoff

12 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-25
SLIDE 25

Proposed architecture

ego vehicle1 . . . vehicleN Encoder Encoder . . . Encoder Ego-attention Ego-attention Ego-attention Decoder Q

Model architecture

ego encoding . . . vehicleN encoding Lk Lv Lq q0 k0 v0 Lk Lv kn vn Q = q0

  • K =

   k0 . . . kn    V =    v0 . . . vn    σ QKT √dk

  • attention matrix

V

  • utput

Ego-attention block Inputs can have a variable size Based on a dot product

permutation-invariant

Compact size with no accuracy loss

13 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-26
SLIDE 26

Experiments

The highway-env environment

Agent FCN/List CNN/Grid Ego-Attention Input sizes [15, 7] [32, 32, 7] [ · , 7] Layers sizes [128, 128] Convolutional layers: 3 Kernel Size: 2 Stride: 2 Head: [20] Encoder: [64, 64] Attention: 2 heads dk = 32 Decoder: [64, 64] Number of parameters 3.0e4 3.2e4 3.4e4 Variable input size No No Yes Permutation invariant No Yes Yes

14 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-27
SLIDE 27

Performances

500 1000 1500 2000 2500 3000 3500 4000 episode 2 3 4 5 6 7 total reward agent FCN/List CNN/Grid Ego-Attention

500 1000 1500 2000 2500 3000 3500 4000 episode 5.5 6.0 6.5 7.0 7.5 velocity agent FCN/List CNN/Grid Ego-Attention 500 1000 1500 2000 2500 3000 3500 4000 episode 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 length agent FCN/List CNN/Grid Ego-Attention

15 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-28
SLIDE 28

Attention Visualization

Head specialisation Distance

16 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-29
SLIDE 29

Attention Visualization

Sensitivity to uncertainty A full episode

17 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-30
SLIDE 30

18 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Safe Model-Free

03

slide-31
SLIDE 31

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R

19 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-32
SLIDE 32

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R A convenient formulation, but;

19 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-33
SLIDE 33

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R A convenient formulation, but; ✗ R is not always easy to design.

19 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-34
SLIDE 34

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically:

Task completion vs Safety

19 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-35
SLIDE 35

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically:

Task completion vs Safety

For example...

19 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-36
SLIDE 36

Example problems with conflicts

Two-Way Road The agent is driving on a two-way road with a car in front of it,

  • it can stay behind (safe/slow);
  • it can overtake (unsafe/fast).

20 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-37
SLIDE 37

Limitation of Reinforcement Learning

Reinforcement learning relies on a single reward function R A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically:

Task completion vs Safety

For example... For a fixed reward function R, π∗ is only guaranteed to lie on a Pareto front Π∗ no control over the Task Completion

Safety

trade-off

21 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-38
SLIDE 38

The Pareto front

Task Completion 𝐻1 = ∑𝛿𝑢𝑆1

𝑢

Safety 𝐻2 = ∑𝛿𝑢𝑆2

𝑢

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑢(𝑆1, 𝑆2)

22 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-39
SLIDE 39

From maximal safety to minimal risk

Task Completion 𝐻𝑠 Risk 𝐻𝑑

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑢(𝑆𝑠, −𝑆𝑑)

23 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-40
SLIDE 40

The optimal policy can move freely along Π∗

Task Completion 𝐻𝑠 Risk 𝐻𝑑

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑢(𝑆𝑠, −𝑆𝑑) 𝜌∗

24 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-41
SLIDE 41

How to choose a desired trade-off

Task Completion 𝐻𝑠 Risk 𝐻𝑑

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑠

𝑢

𝑡. 𝑢. ∑𝛿𝑢𝑆𝑑

𝑢 < 𝛾

𝜌∗

𝛾

25 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-42
SLIDE 42

Constrained Reinforcement Learning

Markov Decision Process An MDP is a tuple (S, A, P, Rr, γ) with:

  • Rewards Rr ∈ RS×A

Objective Maximise rewards maxπ∈M(A)S E [∞

t=0 γtRr(st, at) | s0 = s]

26 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-43
SLIDE 43

Constrained Reinforcement Learning

Constrained Markov Decision Process A CMDP is a tuple (S, A, P, Rr, Rc, γ, β) with:

  • Rewards Rr ∈ RS×A
  • Costs Rc ∈ RS×A
  • Budget β

Objective Maximise rewards while keeping costs under a fixed budget maxπ∈M(A)S E [∞

t=0 γtRr(st, at) | s0 = s]

s.t. E [∞

t=0 γtRc(st, at) | s0 = s] ≤ β

26 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-44
SLIDE 44

We want to learn Π∗ rather than π∗

β

Task Completion 𝐻𝑠 Risk 𝐻𝑑

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑠

𝑢

𝑡. 𝑢. ∑𝛿𝑢𝑆𝑑

𝑢 < 𝛾

𝜌∗

𝛾

27 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-45
SLIDE 45

We want to learn Π∗ rather than π∗

β

Task Completion 𝐻𝑠 Risk 𝐻𝑑

Pareto-optimal curve Π∗

argmax

𝜌

∑𝛿𝑢𝑆𝑠

𝑢

𝑡. 𝑢. ∑𝛿𝑢𝑆𝑑

𝑢 < 𝛾

𝜌∗

𝛾

27 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-46
SLIDE 46

Budgeted Reinforcement Learning

Budgeted Markov Decision Process A BMDP is a tuple (S, A, P, Rr, Rc, γ, B) with:

  • Rewards Rr ∈ RS×A
  • Costs Rc ∈ RS×A
  • Budget space B

Objective Maximise rewards while keeping costs under an adjustable budget. ∀β ∈ B, maxπ∈M(A×B)S×B E [∞

t=0 γtRr(st, at) | s0 = s, β0 = β]

s.t. E [∞

t=0 γtRc(st, at) | s0 = s, β0 = β] ≤ β

28 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-47
SLIDE 47

Problem formulation

Budgeted policies π

  • Take a budget β as an additional input
  • Output a next budget β′
  • π : (s, β)

s

→ (a, β′)

a

Augment the spaces with the budget β

29 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-48
SLIDE 48

Augmented Setting

Definition (Augmented spaces)

  • States S = S × B.
  • Actions A = A × B.
  • Dynamics P

state (s, β), action (a, βa) → next state

  • s′ ∼ P(s′|s, a)

β′ = βa Definition (Augmented signals)

  • 1. Rewards R = (Rr, Rc)
  • 2. Returns Gπ = (Gπ

r , Gπ c ) def

= ∞

t=0 γtR(st, at)

  • 3. Value V π(s) = (V π

r , V π c ) def

= E [Gπ | s0 = s]

  • 4. Q-Value Qπ(s, a) = (Qπ

r , Qπ c ) def

= E [Gπ | s0 = s, a0 = a]

30 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-49
SLIDE 49

Budgeted Optimality

Definition (Budgeted Optimality) In that order, we want to: (i) Respect the budget β: Πa(s) def = {π ∈ Π : V π

c (s, β)≤ β}

(ii) Maximise the rewards: V ∗

r (s) def

= maxπ∈Πa(s)V π

r (s)

Πr(s) def = arg maxπ∈Πa(s)V π

r (s)

(iii) Minimise the costs: V ∗

c (s) def

= minπ∈Πr(s)V π

c (s),

Π∗(s) def = arg minπ∈Πr(s)V π

c (s)

We define the budgeted action-value function Q∗ similarly

31 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-50
SLIDE 50

Budgeted Optimality

Theorem (Budgeted Bellman Optimality Equation) Q∗ verifies the following equation: Q∗(s, a) = T Q∗(s, a)

def

= R(s, a) + γ

  • s′∈S

P(s′|s, a)

  • a′∈A

πgreedy(a′|s′; Q∗)Q∗(s′, a′) where the greedy policy πgreedy is defined by: πgreedy(a|s; Q) ∈arg minρ∈ΠQ

r

E

a∼ρ Qc(s, a),

where ΠQ

r def

=arg maxρ∈M(A) E

a∼ρ Qr(s, a)

s.t. E

a∼ρ Qc(s, a)≤ β

32 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-51
SLIDE 51

The optimal policy

Proposition (Optimality of the policy) πgreedy(· ; Q∗) is simultaneously optimal in all states s ∈ S: πgreedy(· ; Q∗) ∈ Π∗(s) In particular, V πgreedy(·;Q∗) = V ∗ and Qπgreedy(·;Q∗) = Q∗. Proposition (Solving the non-linear program) πgreedy can be computed efficiently, as a mixture πhull of two points that lie on the convex hull of Q. πgreedy = πhull

33 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-52
SLIDE 52

Convergence analysis

Recall what we’ve shown so far: T − − − − − − − →

fixed−point Q∗ −

− − − − →

tractable πhull(Q∗) −

− − →

equal πgreedy(Q∗) −

− − − →

  • ptimal

34 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-53
SLIDE 53

Convergence analysis

Recall what we’ve shown so far: T − − − − − − − →

fixed−point Q∗ −

− − − − →

tractable πhull(Q∗) −

− − →

equal πgreedy(Q∗) −

− − − →

  • ptimal

We’re almost there! All that is left is to perform Fixed-Point Iteration to compute Q∗.

34 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-54
SLIDE 54

Convergence analysis

Recall what we’ve shown so far: T − − − − − − − →

fixed−point Q∗ −

− − − − →

tractable πhull(Q∗) −

− − →

equal πgreedy(Q∗) −

− − − →

  • ptimal

We’re almost there! All that is left is to perform Fixed-Point Iteration to compute Q∗. Theorem (Non-Contractivity) For any BMDP (S, A, P, Rr, Rc, γ) with |A| ≥ 2, T is not a contraction. ∀ε > 0, ∃Q1, Q2 ∈ (R2)SA : T Q1 − T Q2∞ ≥ 1 εQ1 − Q2∞ ✗ We cannot guarantee the convergence of T n(Q0) to Q∗

34 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-55
SLIDE 55

Convergence analysis

Thankfully, Theorem (Contractivity on smooth Q-functions) T is a contraction when restricted to the subset Lγ of Q-functions such that ”Qr is L-Lipschitz with respect to Qc”, with L < 1

γ − 1.

Lγ =

  • Q ∈ (R2)SA s.t. ∃L < 1

γ − 1 : ∀s ∈ S, a1, a2 ∈ A,

|Qr(s, a1) − Qr(s, a2)| ≤ L|Qc(s, a1) − Qc(s, a2)|

  • We guarantee convergence under some (strong) assumptions

We observe empirical convergence

35 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-56
SLIDE 56

Experiments

Lagrangian Relaxation Baseline Consider the dual problem so as to replace the hard constraint by a soft constraint penalised by a Lagrangian multiplier λ: max

π

E

  • t

γtRr(s, a) − λγtRc(s, a)

  • Train many policies πk with penalties λk and recover the cost

budgets βk

  • Very data/memory-heavy

36 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-57
SLIDE 57

Experiments

r

c

37 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-58
SLIDE 58

38 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Efficient Model-Based

04

slide-59
SLIDE 59

Principle

Model estimation Learn a model for the dynamics ˆ T(st+1|st, at). For instance:

  • 1. Least-square estimate: min ˆ

T

  • t st+1 − ˆ

T(st, at)2

2

  • 2. Maximum Likelihood estimate: max ˆ

T

  • t ˆ

T(st+1|st, at) Planning Leverage ˆ T to compute max

π

E ∞

  • t=0

γtr(st, at)

  • at ∼ π(st), st+1 ∼ ˆ

T(st, at)

  • How?

39 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-60
SLIDE 60

Online Planning

We can use ˆ T as a generative model:

Agent Environment Planner

40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-61
SLIDE 61

Online Planning

We can use ˆ T as a generative model:

Agent Environment Planner state

40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-62
SLIDE 62

Online Planning

We can use ˆ T as a generative model:

Agent Environment Planner state

40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-63
SLIDE 63

Online Planning

We can use ˆ T as a generative model:

Agent Environment Planner state recommendation

40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-64
SLIDE 64

Online Planning

We can use ˆ T as a generative model:

Agent Environment Planner state action recommendation

40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-65
SLIDE 65

Online Planning

We can use ˆ T as a generative model:

Agent Environment Planner state, reward state action recommendation

40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-66
SLIDE 66

Planning performance

Online Planning

  • fixed budget: the model can only be queried n times

Objective: minimize E V ∗ − V (n)

  • Simple Regret rn

An exploration-exploitation problem.

41 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-67
SLIDE 67

Optimistic Planning

Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome.

42 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-68
SLIDE 68

Optimistic Planning

Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome.

  • Either you performed well;

42 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-69
SLIDE 69

Optimistic Planning

Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome.

  • Either you performed well;
  • or you learned something.

42 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-70
SLIDE 70

Optimistic Planning

Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome.

  • Either you performed well;
  • or you learned something.

Instances

  • Monte-carlo tree search (MCTS) (Coulom, 2006): CrazyStone
  • Reframed in the bandit setting as UCT (Kocsis and Szepesv´

ari, 2006), still very popular (e.g. Alpha Go).

  • Proved asymptotic consistency, but no regret bound.

42 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-71
SLIDE 71

Analysis of UCT

It was analysed in (Coquelin and Munos, 2007)] The sample complexity of is lower-bounded by O(exp(exp(D))).

43 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-72
SLIDE 72

Failing cases of UCT

Not just a theoretical counter-example.

44 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-73
SLIDE 73

Can we get better guarantees?

OPD: Optimistic Planning for Deterministic systems

  • Introduced by (Hren and Munos, 2008)
  • Another optimistic algorithm
  • Only for deterministic MDPs

Theorem (OPD sample complexity) E rn = O

  • n− log 1/γ

log κ

  • , if κ > 1

45 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-74
SLIDE 74

Can we get better guarantees?

OPD: Optimistic Planning for Deterministic systems

  • Introduced by (Hren and Munos, 2008)
  • Another optimistic algorithm
  • Only for deterministic MDPs

Theorem (OPD sample complexity) E rn = O

  • n− log 1/γ

log κ

  • , if κ > 1

OLOP: Open-Loop Optimistic Planning

  • Introduced by (Bubeck and Munos, 2010)
  • Extends OPD to the stochastic setting
  • Only considers open-loop policies, i.e. sequences of actions

45 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-75
SLIDE 75

The idea behind OLOP

A direct application of Optimism in the Face of Uncertainty

  • 1. We want

max

a

V (a)

46 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-76
SLIDE 76

The idea behind OLOP

A direct application of Optimism in the Face of Uncertainty

  • 1. We want

max

a

V (a)

  • 2. Form upper confidence-bounds of sequence values:

V (a) ≤ Ua w.h.p

46 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-77
SLIDE 77

The idea behind OLOP

A direct application of Optimism in the Face of Uncertainty

  • 1. We want

max

a

V (a)

  • 2. Form upper confidence-bounds of sequence values:

V (a) ≤ Ua w.h.p

  • 3. Sample the sequence with highest UCB:

arg max

a

Ua

46 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-78
SLIDE 78

The idea behind OLOP

47 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-79
SLIDE 79

The idea behind OLOP

48 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-80
SLIDE 80

Under the hood

Upper-bounding the value of sequences V (a) =

follow the sequence

  • h
  • t=1

γtµa1:t +

act optimally

  • t≥h+1

γtµa∗

1:t 49 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-81
SLIDE 81

Under the hood

Upper-bounding the value of sequences V (a) =

follow the sequence

  • h
  • t=1

γt µa1:t

  • ≤Uµ

+

act optimally

  • t≥h+1

γt µa∗

1:t

  • ≤1

49 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-82
SLIDE 82

Under the hood

OLOP main tool: the Chernoff-Hoeffding deviation inequality Uµ

a (m) Upper bound def

= ˆ µa(m)

Empirical mean

+

  • 2 log M

Ta(m)

  • Confidence interval

50 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-83
SLIDE 83

Under the hood

OLOP main tool: the Chernoff-Hoeffding deviation inequality Uµ

a (m) Upper bound def

= ˆ µa(m)

Empirical mean

+

  • 2 log M

Ta(m)

  • Confidence interval

OPD: upper-bound all the future rewards by 1 Ua(m) def =

h

  • t=1

γtUµ

a1:t(m)

  • Past rewards

+ γh+1 1 − γ

Future rewards

50 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-84
SLIDE 84

Under the hood

OLOP main tool: the Chernoff-Hoeffding deviation inequality Uµ

a (m) Upper bound def

= ˆ µa(m)

Empirical mean

+

  • 2 log M

Ta(m)

  • Confidence interval

OPD: upper-bound all the future rewards by 1 Ua(m) def =

h

  • t=1

γtUµ

a1:t(m)

  • Past rewards

+ γh+1 1 − γ

Future rewards

Bounds sharpening Ba(m) def = inf

1≤t≤L Ua1:t(m)

50 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-85
SLIDE 85

OLOP guarantees

Theorem (OLOP Sample complexity) OLOP satisfies: E rn =     

  • O
  • n− log 1/γ

log κ′

  • ,

if γ √ κ′ > 1

  • O
  • n− 1

2

  • ,

if γ √ κ′ ≤ 1 ”Remarkably, in the case κγ2 > 1, we obtain the same rate for the simple regret as Hren and Munos (2008). Thus, in this case, we can say that planning in stochastic environments is not harder than planning in deterministic environments”.

51 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-86
SLIDE 86

Does it work?

Our objective: understand and bridge this gap. Make OLOP practical.

52 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-87
SLIDE 87

What’s wrong with OLOP?

Explanation: inconsistency

  • Unintended behaviour happens when Uµ

a (m) > 1, ∀a.

a (m) = ˆ

µa(m)

∈[0,1]

+

  • 2 log M

Ta(m)

  • >0

53 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-88
SLIDE 88

What’s wrong with OLOP?

Explanation: inconsistency

  • Unintended behaviour happens when Uµ

a (m) > 1, ∀a.

a (m) = ˆ

µa(m)

∈[0,1]

+

  • 2 log M

Ta(m)

  • >0
  • Then the sequence (Ua1:t(m))t is increasing

Ua1:1(m) = γUµ

a1(m) + γ21

+γ31 + . . . Ua1:2(m) = γUµ

a1(m) + γ2 Uµ a2

  • >1

+γ31 + . . .

53 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-89
SLIDE 89

What’s wrong with OLOP?

Explanation: inconsistency

  • Unintended behaviour happens when Uµ

a (m) > 1, ∀a.

a (m) = ˆ

µa(m)

∈[0,1]

+

  • 2 log M

Ta(m)

  • >0
  • Then the sequence (Ua1:t(m))t is increasing

Ua1:1(m) = γUµ

a1(m) + γ21

+γ31 + . . . Ua1:2(m) = γUµ

a1(m) + γ2 Uµ a2

  • >1

+γ31 + . . .

  • Then Ba(m) = Ua1:1(m)

53 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-90
SLIDE 90

What’s wrong with OLOP?

What we were promised

54 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-91
SLIDE 91

What’s wrong with OLOP?

What we actually get OLOP behaves as uniform planning!

55 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-92
SLIDE 92

Our contribution: Kullback-Leibler OLOP

We summon the upper-confidence bound from kl-UCB (Capp´ e et al., 2013): Uµ

a (m) def

= max {q ∈ I : Ta(m)d(ˆ µa(m), q) ≤ f (m)}

56 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-93
SLIDE 93

Our contribution: Kullback-Leibler OLOP

We summon the upper-confidence bound from kl-UCB (Capp´ e et al., 2013): Uµ

a (m) def

= max {q ∈ I : Ta(m)d(ˆ µa(m), q) ≤ f (m)} Algorithm OLOP KL-OLOP Interval I R [0, 1] Divergence d dQUAD dBER f (m) 4 log M 2 log M + 2 log log M dQUAD(p, q) def = 2(p − q)2 dBER(p, q) def = p log p q + (1 − p) log 1 − p 1 − q

56 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-94
SLIDE 94

Our contribution: Kullback-Leibler OLOP

0 Lµ

a

ˆ µa U µ

a

1

1 Taf(m)

dber(ˆ µa, q)

57 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-95
SLIDE 95

Our contribution: Kullback-Leibler OLOP

0 Lµ

a

ˆ µa U µ

a

1

1 Taf(m)

dber(ˆ µa, q)

And now,

a (m) ∈ I = [0, 1], ∀a.

57 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-96
SLIDE 96

Our contribution: Kullback-Leibler OLOP

0 Lµ

a

ˆ µa U µ

a

1

1 Taf(m)

dber(ˆ µa, q)

And now,

a (m) ∈ I = [0, 1], ∀a.

  • The sequence (Ua1:t(m))t is non-increasing

57 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-97
SLIDE 97

Our contribution: Kullback-Leibler OLOP

0 Lµ

a

ˆ µa U µ

a

1

1 Taf(m)

dber(ˆ µa, q)

And now,

a (m) ∈ I = [0, 1], ∀a.

  • The sequence (Ua1:t(m))t is non-increasing
  • Ba(m) = Ua(m), the bound sharpening step is superfluous.

57 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-98
SLIDE 98

Sample complexity

Theorem (Sample complexity) KL-OLOP enjoys the same regret bounds as OLOP. More precisely, KL-OLOP satisfies: E rn =     

  • O
  • n− log 1/γ

log κ′

  • ,

if γ √ κ′ > 1

  • O
  • n− 1

2

  • ,

if γ √ κ′ ≤ 1

58 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-99
SLIDE 99

Experiments — Expanded Trees

59 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-100
SLIDE 100

Experiments — Expanded Trees

59 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-101
SLIDE 101

Experiments — Expanded Trees

59 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-102
SLIDE 102

Experiments — Performances

101 102 103 104

budget

3.25 3.50 3.75 4.00 4.25 4.50 4.75 5.00

return Highway

agent OPD KL-OLOP KL-OLOP(1) OLOP Random

60 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-103
SLIDE 103

Experiments — Performances

101 102 103 104

budget

0.8 1.0 1.2 1.4 1.6 1.8

return Stochastic Gridworld

agent OPD KL-OLOP KL-OLOP(1) OLOP Random

60 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-104
SLIDE 104

61 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Safe Model-Based

05

slide-105
SLIDE 105

The issue of model bias

Model-based RL learns the dynamics ˆ T and optimizes max

π

E ∞

  • t=0

γtr(st, at)

  • at ∼ π(st), st+1 ∼ ˆ

T(st, at)

  • Definition (Model Bias)

T = ˆ T

  • Video example

62 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-106
SLIDE 106

How to deal with model bias?

  • 1. Build a confidence region Cδ around the true dynamics T

P (T ∈ Cδ) > 1 − δ

  • 2. Plan robustly with respect to this ambiguity

max

π

min

T∈Cδ ∞

  • t=0

γtrt

  • vr(π)

63 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-107
SLIDE 107

Model Estimation

In order to build Cδ, we rely on a structure assumption Assumption (Structure) ˙ x(t) = A(θ)x(t) + Bu(t) + d(t) with A(θ) =

d

  • i=1

θiΦi Having observed a history of ˙ x(t), x(t), we obtain a linear regression problem: min

θ ˙

x(t) − A(θ)x(t) − Bu(t)2

2

64 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-108
SLIDE 108

Confidence Ellipsoid

Proposition (Confidence ellipsoid (Abbasi-yadkori, P´ al, and Szepesv´ ari, 2011)) Under some assumptions on the disturbance d(t), it holds with probability 1 − δ that: θ − θNp,λGNp,λ ≤ βt(δ) where θNp,λ = G−1

Np,λΦT [Np]Y[Np];

GNp,λ = ΦT

[Np]Φ[Np] + λId.

𝜄𝑂,𝜇 𝐷𝜀 𝜄

65 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-109
SLIDE 109

The prediction goal

Possible trajectories ˙ x(t) = A(θ)x(t) + Bu(t) + d(t) There are two sources of uncertainty:

  • Parametric uncertainty A(θ) ∈ Cδ
  • External perturbations d(t)

𝑦 0 𝑦 𝑢, 𝜄 𝑢 , 𝑒(𝑢)

66 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-110
SLIDE 110

The prediction goal

Interval Prediction Can we design an interval predictor [x(t), x(t)] that verifies:

  • inclusion property: ∀t, x(t) ≤ x(t) ≤ x(t);
  • stable dynamics?

We want the predictor to be as tight as possible. How to proceed?

𝑦 0 𝑦 𝑢 , 𝑦 𝑢 𝑦 𝑢, 𝜄 𝑢 , 𝑒(𝑢)

67 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-111
SLIDE 111

A first idea

Assume that x(t) ≤ x(t) ≤ x(t), for some t ≥ 0. ë ë

68 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-112
SLIDE 112

A first idea

Assume that x(t) ≤ x(t) ≤ x(t), for some t ≥ 0. ë To propagate the interval to x(t + dt), we need to bound A(θ)x(t). ë

68 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-113
SLIDE 113

A first idea

Assume that x(t) ≤ x(t) ≤ x(t), for some t ≥ 0. ë To propagate the interval to x(t + dt), we need to bound A(θ)x(t). ë Why not use interval arithmetics?

68 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-114
SLIDE 114

A first idea

Assume that x(t) ≤ x(t) ≤ x(t), for some t ≥ 0. ë To propagate the interval to x(t + dt), we need to bound A(θ)x(t). ë Why not use interval arithmetics? Lemma (Image of an interval (Efimov et al., 2012)) If A a known matrix, then A+x − A−x ≤ Ax ≤ A+x − A−x. where A+ = max(A, 0) and A− = A − A+.

68 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-115
SLIDE 115

A first idea

Assume that x(t) ≤ x(t) ≤ x(t), for some t ≥ 0. ë To propagate the interval to x(t + dt), we need to bound A(θ)x(t). ë Why not use interval arithmetics? Lemma (Product of intervals (Efimov et al., 2012)) If A is unknown but bounded A ≤ A ≤ A, A+x+ − A

+x− − A−x+ + A −x− ≤ Ax

≤ A

+x+ − A+x− − A −x+ + A−x−.

68 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-116
SLIDE 116

A first idea

Assume that x(t) ≤ x(t) ≤ x(t), for some t ≥ 0. ë To propagate the interval to x(t + dt), we need to bound A(θ)x(t). ë Why not use interval arithmetics? Lemma (Product of intervals (Efimov et al., 2012)) If A is unknown but bounded A ≤ A ≤ A, A+x+ − A

+x− − A−x+ + A −x− ≤ Ax

≤ A

+x+ − A+x− − A −x+ + A−x−.

Since A(θ) belongs to a known Cδ, we can easily compute such bounds A ≤ A(θ) ≤ A

68 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-117
SLIDE 117

A candidate predictor

Following this result, define the predictor: ˙ x(t) = A+x+(t) − A

+x−(t) − A−x+(t)

+A

−x−(t) + B+d(t) − B−d(t),

(1) ˙ x(t) = A

+x+(t) − A+x−(t) − A −x+(t)

+A−x−(t) + B+d(t) − B−d(t), x(0) = x0, x(0) = x0,

69 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-118
SLIDE 118

A candidate predictor

Following this result, define the predictor: ˙ x(t) = A+x+(t) − A

+x−(t) − A−x+(t)

+A

−x−(t) + B+d(t) − B−d(t),

(1) ˙ x(t) = A

+x+(t) − A+x−(t) − A −x+(t)

+A−x−(t) + B+d(t) − B−d(t), x(0) = x0, x(0) = x0, Proposition (Inclusion property) The predictor (1) satisfies x(t) ≤ x(t) ≤ x(t)(t)

69 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-119
SLIDE 119

A candidate predictor

Following this result, define the predictor: ˙ x(t) = A+x+(t) − A

+x−(t) − A−x+(t)

+A

−x−(t) + B+d(t) − B−d(t),

(1) ˙ x(t) = A

+x+(t) − A+x−(t) − A −x+(t)

+A−x−(t) + B+d(t) − B−d(t), x(0) = x0, x(0) = x0, Proposition (Inclusion property) The predictor (1) satisfies x(t) ≤ x(t) ≤ x(t)(t) ? But is it stable?

69 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-120
SLIDE 120

Motivating example

Consider the scalar system, for all t ≥ 0: ˙ x(t) = −θ(t)x(t) + d(t), where      x(0) ∈ [x0, x0] = [1.0, 1.1], θ(t) ∈ Θ = [θ, θ] = [1, 2], d(t) ∈ [d, d] = [−0.1, 0.1],

1 2 3 4 5 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

x(t)

The system is always stable

70 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-121
SLIDE 121

Motivating example

Consider the scalar system, for all t ≥ 0: ˙ x(t) = −θ(t)x(t) + d(t), where      x(0) ∈ [x0, x0] = [1.0, 1.1], θ(t) ∈ Θ = [θ, θ] = [1, 2], d(t) ∈ [d, d] = [−0.1, 0.1],

1 2 3 4 5 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

x(t) x(t), x(t)

The system is always stable ✗ The predictor (1) is unstable

70 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-122
SLIDE 122

Additional assumption

Assumption (Polytopic Structure) There exist A0 Metzler and ∆A0, · · · , ∆AN such that: A(θ) = A0

  • Nominal

dynamics

+

N

  • i=1

λi(θ)∆Ai,

N

  • i=1

λi(θ)

≥0

= 1; ∀θ ∈ Θ

𝐵0 Δ𝐵1 Δ𝐵2 Δ𝐵3 Δ𝐵4 Δ𝐵5 𝐵(𝜄)

71 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-123
SLIDE 123

Additional assumption

Assumption (Polytopic Structure) There exist A0 Metzler and ∆A0, · · · , ∆AN such that: A(θ) = A0

  • Nominal

dynamics

+

N

  • i=1

λi(θ)∆Ai,

N

  • i=1

λi(θ)

≥0

= 1; ∀θ ∈ Θ

𝜄𝑂,𝜇 𝐷𝜀 𝜄

71 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-124
SLIDE 124

Additional assumption

Assumption (Polytopic Structure) There exist A0 Metzler and ∆A0, · · · , ∆AN such that: A(θ) = A0

  • Nominal

dynamics

+

N

  • i=1

λi(θ)∆Ai,

N

  • i=1

λi(θ)

≥0

= 1; ∀θ ∈ Θ

𝜄𝑂,𝜇 𝐷𝜀 𝜄

71 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-125
SLIDE 125

Our proposed predictor

Denote ∆A+ =

N

  • i=1

∆A+

i , ∆A− = N

  • i=1

∆A−

i ,

We define the predictor ˙ x(t) = A0x(t) − ∆A+x−(t) − ∆A−x+(t) +B+d(t) − B−d(t), ˙ x(t) = A0x(t) + ∆A+x+(t) + ∆A−x−(t) (2) +B+d(t) − B−d(t), x(0) = x0, x(0) = x0 Theorem (Inclusion property) The predictor (2) ensures x(t) ≤ x(t) ≤ x(t).

72 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-126
SLIDE 126

Stability

Theorem (Stability) If there exist diagonal matrices P, Q, Q+, Q−, Z+, Z−, Ψ+, Ψ−, Ψ, Γ ∈ R2n×2n such that the following LMIs are satisfied: P + min{Z+, Z−} > 0, Υ 0, Γ > 0, Q + min{Q+, Q−} + 2 min{Ψ+, Ψ−} > 0, where Υ = Υ(A0, ∆A−, ∆A+, Ψ−, Ψ+, Ψ), then the predictor (2) is input-to-state stable with respect to the inputs d, d.

73 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-127
SLIDE 127

Back to our motivating example

Recall the scalar system: ˙ x(t) = −θ(t)x(t) + d(t), where      x(0) ∈ [x0, x0] = [1.0, 1.1], θ(t) ∈ Θ = [θ, θ] = [1, 2], d(t) ∈ [d, d] = [−0.1, 0.1],

1 2 3 4 5 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

x(t) x(t), x(t)

The system is always stable The predictor (2) is stable

74 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-128
SLIDE 128

Prediction Results

The naive predictor (1) quickly diverges The proposed predictor (2) remains stable

75 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-129
SLIDE 129

Prediction Results

Prediction during a lane change maneuver Prediction with uncertainty in the followed lane Li

76 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-130
SLIDE 130

Robust Control with Continuous Ambiguity

Approximate the robust objective by a tractable surrogate. Definition (Robust objective vr) vr(π) def = min

A(θ)∈Cδ H

  • t=0

γtR(xt, π(xt)) (3) Definition (Surrogate objective ˆ vr) ˆ vr(π) def =

H

  • t=0

γt min

[x∈x(t),x(t)] R(x, π(x))

(4)

77 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-131
SLIDE 131

Guarantees

The approximate performance of a policy is guaranteed on the true environment. Proposition (Lower bound) The surrogate objective ˆ vr is a lower bound of the true

  • bjective vr:

∀π, ˆ vr(π) ≤ vr(π) (5)

78 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-132
SLIDE 132

Experiments

Ambiguity Agent Worst-case Mean ± std None Oracle 9.83 10.84 ± 0.16 Continuous Nominal 1.99 9.95 ± 2.38 Robust 7.88 10.73 ± 0.61

79 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-133
SLIDE 133

But what if...

Our linear structure assumption is wrong? Model Adequacy: you can detect it with statistical tests

80 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-134
SLIDE 134

But what if...

Our linear structure assumption is wrong? Model Adequacy: you can detect it with statistical tests Solution: Multi-Model Prediction Use many linear models with different features. For instance:

  • Lane-dependent features
  • Neural network features
  • Random features

Maintain a set of admissible experts Perform robust aggregation

80 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-135
SLIDE 135

Multi-Model Uncertainty

Assumption (Discrete Ambiguity Set) T ∈ {T1, · · · , Tm}

Optimistic evaluation of paths at the leaves for all dynamics Worst-case aggregation

  • ver the M

dynamics min

m

Optimal planning of action sequences max

a

81 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-136
SLIDE 136

A robust extension of action-values

Definition (Robust sequence value upper-bound) Given node i ∈ T , define the robust B-value: Br

i (n) def

=    min

m∈[1,M]

d−1

t=0 γtrt + γd 1−γ

if i ∈ Ln ; max

a∈A br ia(n)

if i ∈ Tn \ Ln Theorem (Regret bound) The corresponding planning algorithm enjoys a simple regret of: If κ > 1, Rn = O  n

− log 1/γ

logκ

  (6)

82 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-137
SLIDE 137

Experiments

Ambiguity Agent Worst-case Mean ± std None Oracle 9.83 10.84 ± 0.16 Continuous Nominal 1.99 9.95 ± 2.38 Robust 7.88 10.73 ± 0.61 Discrete Nominal 2.09 8.85 ± 3.53 Robust 8.99 10.78 ± 0.34

83 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-138
SLIDE 138

Conclusion Decision-making among interacting drivers with behavioural uncertainty Model-free

  • 1. Self-attention model for permutation invariance and variable size
  • 2. Budgeted reinforcement learning to constrain the expected risk

Model-based

  • 3. Efficient tree-based planning with tight statistical bounds
  • 4. Tackle the issue of model bias

ë Build a confidence region around the true model ë Design a stable interval predictor ë Perform robust control with respect to this uncertainty

84 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

slide-139
SLIDE 139

Thank You!

85 -Reinforcement Learning for Autonomous Driving- Edouard Leurent