RE REIN INFOR FORCEM CEMENT ENT LE LEAR ARNI NING NG AS A - - PowerPoint PPT Presentation

re rein infor forcem cement ent le lear arni ning ng as a
SMART_READER_LITE
LIVE PREVIEW

RE REIN INFOR FORCEM CEMENT ENT LE LEAR ARNI NING NG AS A - - PowerPoint PPT Presentation

RE REIN INFOR FORCEM CEMENT ENT LE LEAR ARNI NING NG AS A PR AS A PROD ODUC UCTION TION TOOL OOL ON ON AAA AAA GA GAME MES Olivie ier r DELALL ALLEAU EAU & A Adrien n LOGU GUT 2017-10 10-18 18 AGENDA Project


slide-1
SLIDE 1

RE REIN INFOR FORCEM CEMENT ENT LE LEAR ARNI NING NG AS A PR AS A PROD ODUC UCTION TION TOOL OOL ON ON AAA AAA GA GAME MES

Olivie ier r DELALL ALLEAU EAU & A Adrien n LOGU GUT

2017-10 10-18 18

slide-2
SLIDE 2

AGENDA

Project Overview Fighting in For Honor Driving in Watch_Dogs 2

slide-3
SLIDE 3

Project Overview

Build AIs that can play y games es like our players ers wo would

Olivier Delalleau Frédéric Doll Maxim Peter FOR HONOR WATCH_DOG DOGS S 2 Adrien Logut Olivier Lamothe-Penelle

slide-4
SLIDE 4

Automated testing Design assistance

Motivations

In-game AI

slide-5
SLIDE 5

Evolutionary methods Imitation learning

Why Reinforcement Learning

Google Trends genetic algorithm reinforcement learning imitation learning

slide-6
SLIDE 6

RL & Video games

Atari Minecraft Doom Universe SNES Starcraft II Dota 2 Unity

(recent) (incomplete)

slide-7
SLIDE 7
slide-8
SLIDE 8

 

slide-9
SLIDE 9

CENTURION GLADIATOR HIGHLANDER SHINOBI

slide-10
SLIDE 10
slide-11
SLIDE 11

slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15
slide-16
SLIDE 16
slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22
slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25
slide-26
SLIDE 26
slide-27
SLIDE 27

slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33

A u t o n o m o u s D r i v i n g

slide-34
SLIDE 34

Watch_Dogs 2

Open world game within a living city

» Takes place in San Francisco » Living city  Cars in the street » Cars need to be controlled by an AI

slide-35
SLIDE 35

Objectives

How is it currently done?

» PID controller with custom curves » Hand-tuned curves ▪ Takes a lot of time ▪ Not precise ➢ How about Reinforcement Learning?

slide-36
SLIDE 36

Reinforcement Learning

What the agent can see and do

Acceleration: [0,1] Brake: [-1, 1] Steering: [-1, 1]

+3

Distance to the road: 0.1 Velocity: 0.3 Desired speed: 0.9 ...

Environment Agent State Action Reward

slide-37
SLIDE 37

Reinforcement Learning

What the agent can see and do

➢ Continuous States ➢ Neural network to approximate

𝑹 𝒕𝒖, 𝒃𝒖

slide-38
SLIDE 38

Reinforcement Learning

What the agent can see and do

➢ Continuous States ➢ Neural network to approximate

𝑹(𝒕𝒖, 𝒃𝒖)

➢ Continuous Actions ➢ Cannot use greedy policy from DQN (For Honor) ➢ Neural network to approximate a policy 𝒃𝒖~𝝂 𝒕𝒖

slide-39
SLIDE 39

Reinforcement Learning

Actor Critic Architecture

» Two neural networks, approximate functions ▪ Actor : 𝑏𝑢 ~ 𝜈 𝑡𝑢 ▪ Critic : 𝑅 𝑡𝑢, 𝑏𝑢 » Critic update: Expected discounted reward Q-Learning (same as For Honor) » Actor update: Policy gradient

∇𝜄𝜈 𝐾 = 𝛼

𝑏𝑅 𝑡, 𝑏 𝜄𝑅)|𝑡=𝑡𝑢,𝑏=𝜈 𝑡𝑢 𝛼𝜄𝜈𝜈 𝑡 𝜄𝜈)|𝑡=𝑡𝑢

» Actor » Critic

𝑡𝑢 𝑏𝑢 𝑅(𝑡𝑢, 𝑏𝑢)

Updates

slide-40
SLIDE 40

Reinforcement Learning

Actor Critic Architecture

» Two neural networks, approximate functions ▪ Actor : 𝑏𝑢 ~ 𝜈 𝑡𝑢 ▪ Critic : 𝑅 𝑡𝑢, 𝑏𝑢 » Critic update: Expected discounted reward Q-Learning (same as For Honor) » Actor update: Policy gradient

∇𝜄𝜈 𝐾 = 𝛼

𝑏𝑅 𝑡, 𝑏 𝜄𝑅)|𝑡=𝑡𝑢,𝑏=𝜈 𝑡𝑢 𝛼𝜄𝜈𝜈 𝑡 𝜄𝜈)|𝑡=𝑡𝑢

» Actor » Critic

𝑡𝑢 𝑏𝑢 𝑅(𝑡𝑢, 𝑏𝑢)

Updates

slide-41
SLIDE 41

Reinforcement Learning

Actor update - Intuition

» Actor update: Policy gradient » Intuition: ▪ Critic gives the direction to update the actor ▪ “In which way should I change actor parameters in order to maximize the critic

  • utput given a state.”

» Actor » Critic

𝑡𝑢 𝑏𝑢 𝑅(𝑡𝑢, 𝑏𝑢)

Updates

slide-42
SLIDE 42

First experience

Since we have the PID, what about imitating it?

» Supervised learning on the actor ▪ Updated with Mean Squared Error between actor

  • utput and PID output

» Actor » PID

𝑡𝑢 𝑏𝑢, 𝑏𝑑𝑢𝑝𝑠 𝑏𝑢, 𝑄𝐽𝐸 𝜀𝑢 = 𝑏𝑢,𝑏𝑑𝑢𝑝𝑠 − 𝑏𝑢,𝑄𝐽𝐸

2

Updates

slide-43
SLIDE 43

First experience

Supervised vs Original – Slight improvement

slide-44
SLIDE 44

First experience

Supervised vs Original – Slight improvement

slide-45
SLIDE 45

Reward Shaping

Defining the reward function

» The reward is the only signal received by the agent ▪ Am I doing good or bad? » This is the key part of reinforcement learning ▪ Called Reward Shaping ▪ Requires a good understanding of the problem » For driving: ▪ Follow the given path at the right speed ▪ Stop when needed

slide-46
SLIDE 46

Reward Shaping

Defining the reward function - Configuration

» Three main components are measured: » Velocity along the path

𝒘𝒚

» Velocity perpendicular to the path 𝒘𝒛 » Distance from the path

𝒆 𝒘

𝒘𝒚 𝒘𝒛 𝒆

slide-47
SLIDE 47

Reward Shaping

Defining the reward function – Desired speed

» Positive reward when driving close to the desired speed » Negative when far from the desired speed » Punish more when driving faster than slower » Desired speed in red

Reward vs. velocity along path Reward Velocity x

slide-48
SLIDE 48

Reward Shaping

Defining the reward function – Velocity y

» Only negative reward » Want to punish harder for small values (Power < 1)

Reward vs. velocity perpendicular to path Reward Velocity y

slide-49
SLIDE 49

Reward Shaping

Defining the reward function – Distance

» Only negative reward » Want to punish less for small values (Power > 1)

Reward vs. distance from path Reward Distance

slide-50
SLIDE 50

Results

The learning curve

slide-51
SLIDE 51

Results

The learning curve

slide-52
SLIDE 52

Results

Good Results after 15 mins

slide-53
SLIDE 53

Results

One model to rule them all?

slide-54
SLIDE 54

Results

One model to rule them all?

» Each vehicle has its own physical model » Accelerate, Steer, Brake all have different reactions

  • ver the vehicles

» We can still group physically close vehicles » Need more state info for bigger vehicles (Bus, Trucks, …)

slide-55
SLIDE 55

Results

One model to rule them all?

slide-56
SLIDE 56

Results

One model to rule them all?

slide-57
SLIDE 57

Results

Need to deal with a lot of variance

» Game is not deterministic » Even with seeding, different results

slide-58
SLIDE 58

Tools

Multi-dimensional function visualizer

» Developed with PyQt5 » Load the models and plot the output

slide-59
SLIDE 59

Tools

Multi-dimensional function visualizer

» Developed with PyQt5 » Load the models and plot the output

slide-60
SLIDE 60

Tools

Archives reader and comparison tools

» Developed with PyQt5 » Load the metrics and plot them to compare models

slide-61
SLIDE 61

What’s next?

Awesome stuff!

» Analyze what could be introduced into the game ? Level of quality ? Robustness ? Computation time ? Learning time ? Size of models in memory » Try other learning algorithms » Optimize workflow with multiple agents

slide-62
SLIDE 62

Conclusion

Reinforcement learning is promising Found efficient fighting behavior in For Honor Already better driving in Watch_Dogs 2 compared to PID It is just the beginning… Still a lot of work and research to do Not ready to use in production... yet The future? Player-facing AIs

slide-63
SLIDE 63

Do you have questions?

Thank you!

laforge@ubisoft.com

PS: we’re hiring (!)