Chrome Dino DQN AU T H T H O R : G E O RG E M A RG A R I T I S - - PowerPoint PPT Presentation

chrome dino dqn
SMART_READER_LITE
LIVE PREVIEW

Chrome Dino DQN AU T H T H O R : G E O RG E M A RG A R I T I S - - PowerPoint PPT Presentation

Chrome Dino DQN AU T H T H O R : G E O RG E M A RG A R I T I S I N S N ST RU C T U C TO R : P RO F. M L AG O U DA K I S C O U R U R S E S E : C O M P 513 , AUTO N O M O US AG E N T S S C H O O L : E C E , T E C HN I C A L UN I


slide-1
SLIDE 1

Chrome Dino DQN

AU T H T H O R : G E O RG E M A RG A R I T I S I N S N ST RU C T U C TO R : P RO F. M L AG O U DA K I S C O U R U R S E S E : C O M P 513 , AUTO N O M O US AG E N T S S C H O O L : E C E , T E C HN I C A L UN I VE R S I T Y O F C R E T E P E R I E R I O D O D: FA L L S E M E ST E R , 2 019 - 2 0 2 0

slide-2
SLIDE 2

Overview

  • What is Chrome Dino?
  • Model
  • Deep Q Learning
  • Implementation
  • Results
  • Conclusions
  • Pros – Cons
  • Future Work
  • References
slide-3
SLIDE 3

What is Chrome Dino?

  • 2D Arcade Game created by Google for Chrome
  • Designed as an “Easter Egg” game for when there is no internet in Chrome
  • Player: A little Dino
  • Task: The player controls a dino and can either jump or duck at any specific time. The goal is to

avoid as many obstacles as you can in order to maximize your score. As time progresses, the game becomes more difficult as the environment moves faster and more obstacles appear.

slide-4
SLIDE 4

What is Chrome Dino?

slide-5
SLIDE 5

Model

State space -> Very Large:

  • Each state -> Represented by 4 frames of 84x84 binary images

Actions:

  • Do nothing
  • Jump
  • Duck

Rewards:

  • +0.1 in every frame the Dino is alive
  • -1 when the Dino dies
slide-6
SLIDE 6

Deep Q Learning

slide-7
SLIDE 7

Implementation

  • The game is run on a browser simulator (Selenium)
  • Python uses a chrome webriver to communicate with selenium and play the game
  • Our DQN model is implemented in Tensorflow 2.0
  • The agent interacts with the environment and the environment returns (s,a,r,s’) where:
  • s: current state (4x84x84 matrix)
  • a: action (0 for do nothing, 1 for jump, 2 for duck)
  • r: numeric reward
  • s’: new state
slide-8
SLIDE 8

Implementation

For better results and smoother training, our agent uses:

  • Experience replay:
  • Transitions are used in batches of past experiences for training
  • Same transition used multiple times to improve learning
  • Target Network:
  • Use

se o

  • f 2

2 ne networks: Target network to estimate target Q value and Policy network to get Q values

  • Increases training stability
slide-9
SLIDE 9

Results

In our results we tested 2 different models:

  • Model 1: without duck action (learning rate = 10−3)
  • Model 2: with duck action (learning rate = 10−4)

For those models, we measured every 20 episodes (games):

  • The ma

maximum mum score of the last 20 episodes

  • The average score of the last 20 episodes
  • The mi

mini nimum mum score of the last 20 episodes

Then we smoothed the curves in order to better observe the trend

slide-10
SLIDE 10

Results (max score)

Model 1 (No duck) Model 2 (Duck) Episodes

50 100 150 200 250 300 350 1k 2k 3k 4k 5k 6k 7k 8k 9k 10k 11k

slide-11
SLIDE 11

Results (avg score)

20 40 60 80 100 120 140 1k 2k 3k 4k 5k 6k 7k 8k 9k 10k 11k

Model 1 (No duck) Model 2 (Duck) Episodes

slide-12
SLIDE 12

Results (min score)

Model 1 (No duck) Model 2 (Duck) Episodes

41 42 43 44 45 46 47 1k 2k 3k 4k 5k 6k 7k 8k 9k 10k 11k

slide-13
SLIDE 13

Conclusions

By reducing the learning rate and allowing duck action we can observe:

  • Slower convergence BUT

UT

  • Better and more consistent results

Observation

  • n: Using duck action, our agent discovers a hidden strategy:
  • Jump
  • While in the air, hit duck to descend:
  • Minimizes air-time
  • Returns to ground where the agent has more control -> Avoids more obstacles
slide-14
SLIDE 14

Pros - Cons

Advan antag tages:

  • Can be used without any domain specific knowledge
  • r assumptions about the environment
  • The exact same model can be used to beat many different games

when trained in a different environment Disad advan antag ages:

  • Slow learning ability:
  • A lot of time taken for training (1 or 2 days)
  • Scores between near episodes not very consistent:
  • Increased score variation in near episodes
slide-15
SLIDE 15

Future Work

Try to improve DQN using:

  • Better hyperparameter tuning
  • Double DQN
  • Prioritized Experience Replay
  • Dueling DQN

Try different approaches:

  • Use statistics (dino height, distance to next obstacle e.t.c)

instead of images -> Already implemented in code (enabled using flag ---use-statistics)

  • Use NEAT (Neuro Evolution of Augmented Topologies)

instead of DQN in conjunction with statistics -> Will yield better results

slide-16
SLIDE 16

Code

The source code is available on GitHub with documentation: https://github.com/margaeor/dino-dqn

slide-17
SLIDE 17

References

  • Atari DQN - Paper
  • Intro to Deep RL – Article
  • Intro to DQN - Article
  • DQN Hands On - Article
  • DQN by sentdex - Video
  • courses – lectures