Decentralized Non- Communicating Multi-agent Collision Avoidance - - PowerPoint PPT Presentation

decentralized non communicating multi agent collision
SMART_READER_LITE
LIVE PREVIEW

Decentralized Non- Communicating Multi-agent Collision Avoidance - - PowerPoint PPT Presentation

Decentralized Non- Communicating Multi-agent Collision Avoidance with Deep Reinforcement Learning By Yu Fan Chen, Miao Liu, Michael Everett, and Jonathan P . How Presenter: Jared Choi Motivation Finding a path Computationally


slide-1
SLIDE 1

Decentralized Non- Communicating Multi-agent Collision Avoidance with Deep Reinforcement Learning

By Yu Fan Chen, Miao Liu, Michael Everett, and Jonathan P . How Presenter: Jared Choi

slide-2
SLIDE 2

Motivation

  • Finding a path
  • Computationally

expensive due to

  • Collision checking
  • Feasibility checking
  • Effjciency checking
slide-3
SLIDE 3

Motivation

  • Finding a path
  • Computationally

expensive due to

  • Collision checking
  • Feasibility checking
  • Effjciency checking
  • Offmine Learning
slide-4
SLIDE 4

Background

  • A sequential decision making problem can be

formulated as a Markov Decision Process (MDP)

  • M = <S, A, P, R, >
slide-5
SLIDE 5

Background

  • A sequential decision making problem can be

formulated as a Markov Decision Process (MDP)

  • M = <S, A, P, R, >
  • S (state space)
  • A(action space)
  • P(state transition model)
  • R: reward function
  • : discount factor
slide-6
SLIDE 6

State Space (M = <S, A, P, R, >)

  • S(state space)
  • System’s state is constructed by concatenating the two agents’

individual states

Observable State Vector (position (x,y), velocity(x,y), radius) Unobservable State vector (goal position (x,y), preferred speed, he

slide-7
SLIDE 7

State Space (M = <S, A, P, R, >)

  • S(state space)
  • System’s state is constructed by concatenating

the two agents’ individual states

Observable State Vector (position (x,y), velocity(x,y), radius) Unobservable State vector (goal position (x,y), preferred speed, hea

slide-8
SLIDE 8
  • A(action space):
  • Set of permissible velocity vectors, a(s) = v

Action Space (M = <S, A, P, R, >)

slide-9
SLIDE 9

State Transition Model(M = <S, A, P, R, >)

  • P(state transition model)
  • A probabilistic state transition

model

  • Determined by the agents’

kinematics

  • Unknown to us
slide-10
SLIDE 10

Reward Function (M = <S, A, P, R, >)

  • R: reward function
  • Award the agent for reaching its goal
  • Penalize the agent for getting too close or

colliding with other agent

slide-11
SLIDE 11

Discount Factor(M = <S, A, P, R, >)

  • Discount factor
slide-12
SLIDE 12

Value Function

  • The value of a state
  • Value depends on
  • close to 1
  • We care about our long term reward
  • close to 0
  • We care only about our immediate reward
slide-13
SLIDE 13

Optimal Policy

  • The best trajectory at given state
slide-14
SLIDE 14

Value Function and Optimal Policy

From David Silver’s slide

slide-15
SLIDE 15

Value Function and Optimal Policy

  • Every state s has value V(s)
  • Store it in a lookup table
  • In a grid world : 16 values
  • In motion planning : Infjnite values (b/c it’s

continuous state space)

  • Solution:
  • Approximate value via neural network
slide-16
SLIDE 16

Value Function and Optimal Policy

From David Silver’s slides

slide-17
SLIDE 17

Value Function and Optimal Policy

slide-18
SLIDE 18

Collision Avoidance Deep Reinforcement Learning

1.T rain Value network using ORCA 2.T rain again with Deep reinforcement Learning

slide-19
SLIDE 19

Collision Avoidance Deep Reinforcement Learning

1.T rain Value network using ORCA

  • Why pre-train?
slide-20
SLIDE 20

Collision Avoidance Deep Reinforcement Learning

1.T rain Value network using ORCA

  • Why pre-train?
  • Initializing the neural network is crucial to convergence
  • We want the network to output something reasonable
slide-21
SLIDE 21

Collision Avoidance Deep Reinforcement Learning

1.T rain Value network using ORCA

  • Why pre-train?
  • Initializing the neural network is crucial to convergence
  • We want the network to output something reasonable
  • Generate 500 trajectories as a training set
  • Each trajectory contains 40 state-value pairs (total of 20,000

pairs)

  • Back-propagate to minimize our loss function:
slide-22
SLIDE 22

Collision Avoidance Deep Reinforcement Learning

1.T rain Value network using ORCA 2.T rain again with Deep reinforcement Learning

slide-23
SLIDE 23

Collision Avoidance Deep Reinforcement Learning

1.T rain again with Deep reinforcement Learning

slide-24
SLIDE 24

Collision Avoidance Deep Reinforcement Learning

1.T rain again with Deep reinforcement Learning

slide-25
SLIDE 25

Collision Avoidance Deep Reinforcement Learning

1.T rain again with Deep reinforcement Learning

slide-26
SLIDE 26

Collision Avoidance Deep Reinforcement Learning

1.T rain again with Deep reinforcement Learning

slide-27
SLIDE 27

Collision Avoidance Deep Reinforcement Learning

1.T rain again with Deep reinforcement Learning

slide-28
SLIDE 28

Collision Avoidance Deep Reinforcement Learning

1.T rain again with Deep reinforcement Learning

slide-29
SLIDE 29

Collision Avoidance Deep Reinforcement Learning

1.T rain again with Deep reinforcement Learning

slide-30
SLIDE 30

Collision Avoidance Deep Reinforcement Learning

1.T rain again with Deep reinforcement Learning

slide-31
SLIDE 31

Collision Avoidance Deep Reinforcement Learning

1.T rain again with Deep reinforcement Learning

slide-32
SLIDE 32

Collision Avoidance Deep Reinforcement Learning

1.T rain again with Deep reinforcement Learning

slide-33
SLIDE 33

Collision Avoidance Deep Reinforcement Learning

1.T rain again with Deep reinforcement Learning

slide-34
SLIDE 34

Collision Avoidance Deep Reinforcement Learning

1.T rain again with Deep reinforcement Learning

slide-35
SLIDE 35

Collision Avoidance Deep Reinforcement Learning

1.T rain again with Deep reinforcement Learning

slide-36
SLIDE 36

Collision Avoidance Deep Reinforcement Learning

1.T rain again with Deep reinforcement Learning

slide-37
SLIDE 37

Collision Avoidance Deep Reinforcement Learning

1.T rain again with Deep reinforcement Learning

slide-38
SLIDE 38

Collision Avoidance Deep Reinforcement Learning

1.T rain again with Deep reinforcement Learning

Backpropagatio n

slide-39
SLIDE 39

Collision Avoidance Deep Reinforcement Learning

1.T rain again with Deep reinforcement Learning

slide-40
SLIDE 40

Result

slide-41
SLIDE 41

Result

slide-42
SLIDE 42

Result

slide-43
SLIDE 43

Q&A

slide-44
SLIDE 44

Quiz

 Values are update after each episode (T/F)  Value function needs to be trained with ORCA (T/F)  ORCA path does not need to be optimal (T/F)