SLIDE 1 Decentralized Non- Communicating Multi-agent Collision Avoidance with Deep Reinforcement Learning
By Yu Fan Chen, Miao Liu, Michael Everett, and Jonathan P . How Presenter: Jared Choi
SLIDE 2 Motivation
- Finding a path
- Computationally
expensive due to
- Collision checking
- Feasibility checking
- Effjciency checking
SLIDE 3 Motivation
- Finding a path
- Computationally
expensive due to
- Collision checking
- Feasibility checking
- Effjciency checking
- Offmine Learning
SLIDE 4 Background
- A sequential decision making problem can be
formulated as a Markov Decision Process (MDP)
SLIDE 5 Background
- A sequential decision making problem can be
formulated as a Markov Decision Process (MDP)
- M = <S, A, P, R, >
- S (state space)
- A(action space)
- P(state transition model)
- R: reward function
- : discount factor
SLIDE 6 State Space (M = <S, A, P, R, >)
- S(state space)
- System’s state is constructed by concatenating the two agents’
individual states
Observable State Vector (position (x,y), velocity(x,y), radius) Unobservable State vector (goal position (x,y), preferred speed, he
SLIDE 7 State Space (M = <S, A, P, R, >)
- S(state space)
- System’s state is constructed by concatenating
the two agents’ individual states
Observable State Vector (position (x,y), velocity(x,y), radius) Unobservable State vector (goal position (x,y), preferred speed, hea
SLIDE 8
- A(action space):
- Set of permissible velocity vectors, a(s) = v
Action Space (M = <S, A, P, R, >)
SLIDE 9 State Transition Model(M = <S, A, P, R, >)
- P(state transition model)
- A probabilistic state transition
model
- Determined by the agents’
kinematics
SLIDE 10 Reward Function (M = <S, A, P, R, >)
- R: reward function
- Award the agent for reaching its goal
- Penalize the agent for getting too close or
colliding with other agent
SLIDE 11 Discount Factor(M = <S, A, P, R, >)
SLIDE 12 Value Function
- The value of a state
- Value depends on
- close to 1
- We care about our long term reward
- close to 0
- We care only about our immediate reward
SLIDE 13 Optimal Policy
- The best trajectory at given state
SLIDE 14 Value Function and Optimal Policy
From David Silver’s slide
SLIDE 15 Value Function and Optimal Policy
- Every state s has value V(s)
- Store it in a lookup table
- In a grid world : 16 values
- In motion planning : Infjnite values (b/c it’s
continuous state space)
- Solution:
- Approximate value via neural network
SLIDE 16 Value Function and Optimal Policy
From David Silver’s slides
SLIDE 17
Value Function and Optimal Policy
SLIDE 18
Collision Avoidance Deep Reinforcement Learning
1.T rain Value network using ORCA 2.T rain again with Deep reinforcement Learning
SLIDE 19 Collision Avoidance Deep Reinforcement Learning
1.T rain Value network using ORCA
SLIDE 20 Collision Avoidance Deep Reinforcement Learning
1.T rain Value network using ORCA
- Why pre-train?
- Initializing the neural network is crucial to convergence
- We want the network to output something reasonable
SLIDE 21 Collision Avoidance Deep Reinforcement Learning
1.T rain Value network using ORCA
- Why pre-train?
- Initializing the neural network is crucial to convergence
- We want the network to output something reasonable
- Generate 500 trajectories as a training set
- Each trajectory contains 40 state-value pairs (total of 20,000
pairs)
- Back-propagate to minimize our loss function:
SLIDE 22
Collision Avoidance Deep Reinforcement Learning
1.T rain Value network using ORCA 2.T rain again with Deep reinforcement Learning
SLIDE 23
Collision Avoidance Deep Reinforcement Learning
1.T rain again with Deep reinforcement Learning
SLIDE 24
Collision Avoidance Deep Reinforcement Learning
1.T rain again with Deep reinforcement Learning
SLIDE 25
Collision Avoidance Deep Reinforcement Learning
1.T rain again with Deep reinforcement Learning
SLIDE 26
Collision Avoidance Deep Reinforcement Learning
1.T rain again with Deep reinforcement Learning
SLIDE 27
Collision Avoidance Deep Reinforcement Learning
1.T rain again with Deep reinforcement Learning
SLIDE 28
Collision Avoidance Deep Reinforcement Learning
1.T rain again with Deep reinforcement Learning
SLIDE 29
Collision Avoidance Deep Reinforcement Learning
1.T rain again with Deep reinforcement Learning
SLIDE 30
Collision Avoidance Deep Reinforcement Learning
1.T rain again with Deep reinforcement Learning
SLIDE 31
Collision Avoidance Deep Reinforcement Learning
1.T rain again with Deep reinforcement Learning
SLIDE 32
Collision Avoidance Deep Reinforcement Learning
1.T rain again with Deep reinforcement Learning
SLIDE 33
Collision Avoidance Deep Reinforcement Learning
1.T rain again with Deep reinforcement Learning
SLIDE 34
Collision Avoidance Deep Reinforcement Learning
1.T rain again with Deep reinforcement Learning
SLIDE 35
Collision Avoidance Deep Reinforcement Learning
1.T rain again with Deep reinforcement Learning
SLIDE 36
Collision Avoidance Deep Reinforcement Learning
1.T rain again with Deep reinforcement Learning
SLIDE 37
Collision Avoidance Deep Reinforcement Learning
1.T rain again with Deep reinforcement Learning
SLIDE 38 Collision Avoidance Deep Reinforcement Learning
1.T rain again with Deep reinforcement Learning
Backpropagatio n
SLIDE 39
Collision Avoidance Deep Reinforcement Learning
1.T rain again with Deep reinforcement Learning
SLIDE 40
Result
SLIDE 41
Result
SLIDE 42
Result
SLIDE 43
Q&A
SLIDE 44 Quiz
Values are update after each episode (T/F) Value function needs to be trained with ORCA (T/F) ORCA path does not need to be optimal (T/F)