LunarLander-v2 using Deep Reinforcement Learning A project - - PowerPoint PPT Presentation

lunarlander v2 using deep reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

LunarLander-v2 using Deep Reinforcement Learning A project - - PowerPoint PPT Presentation

LunarLander-v2 using Deep Reinforcement Learning A project developed for Autonomous Agents Course PLH513 Portokalakis Petros February 2020 Simple Game 8-Dimensional state space 4 actions per state +100 points for landing


slide-1
SLIDE 1

LunarLander-v2 using Deep Reinforcement Learning

A project developed for Autonomous Agents Course PLH513 Portokalakis Petros February 2020

slide-2
SLIDE 2

Simple Game

  • 8-Dimensional state space
  • 4 actions per state
  • +100 points for landing
  • 100 points when crashed
  • Infinite fuel, but -0.3 points per

frame when firing main engine

  • +10 for each leg ground contact (to

encourage smooth landing)

slide-3
SLIDE 3

Deep Reinforcement Learning

Objective: approximate the optimal Q-Function (which satisfies the Bellman Equation) Neural network:

  • 8 node input layer - dimensionality of state space
  • 150 node fully connected 1st hidden layer
  • 128 node fully connected 2nd hidden layer
  • 4 node output layer - q-values for actions

4 layer approach works well with a variety of hidden layer node number 5 layers prove insufficient to even train the agent

slide-4
SLIDE 4

Deep Reinforcement Learning: Advancing performance

Experience replay:

  • Every tuple(s,a,r,s’,done) is stored in a replay buffer (maxlength=1M)
  • Randomly sample a batch of previous experiences (64). Break correlation

between consecutive samples

  • Predict best action for all items in the batch via the NN
  • Update neural network weights
  • Generate episodes via exploration or exploitation
slide-5
SLIDE 5

Deep Reinforcement Learning: Advancing performance

  • Calculating loss between output Q-value and target Q-value requires a seconds

pass to the network for the next state

  • s and s’ share the same network and have one step difference
  • Optimization becomes unstable

Target network: Use an identical network to the policy network, but update target network weight’s every C iterations (C is a hyperparameter) First pass occures with the policy network Second pass occures with the target network

slide-6
SLIDE 6

Deep Reinforcement Learning: Advancing performance

Abstract version of the agent algorithm implemented

slide-7
SLIDE 7

Deep Reinforcement Learning: Performance of Lunar Lander

slide-8
SLIDE 8

Deep Reinforcement Learning: Performance of Lunar Lander

Adding a third hidden layer

slide-9
SLIDE 9

Deep Reinforcement Learning: Hyperparameter Tuning

Hyperparameter Value Starting epsilon 1 Minimum epsilon 0.01 Decay factor of epsilon 0.99 Discount factor gamma 0.99 Learning rate 0.001 Batch size 64 Replay buffer 1000000

slide-10
SLIDE 10

Thank you Questions?

Contact: pportokalakis@gmail.com