RE REIN INFOR FORCEM CEMENT ENT LE LEAR ARNI NING NG AS A PR AS A PROD ODUC UCTION TION TOOL OOL ON ON AAA AAA GA GAME MES
Olivie ier r DELALL ALLEAU EAU & A Adrien n LOGU GUT
2017-10 10-18 18
RE REIN INFOR FORCEM CEMENT ENT LE LEAR ARNI NING NG AS A - - PowerPoint PPT Presentation
RE REIN INFOR FORCEM CEMENT ENT LE LEAR ARNI NING NG AS A PR AS A PROD ODUC UCTION TION TOOL OOL ON ON AAA AAA GA GAME MES Olivie ier r DELALL ALLEAU EAU & A Adrien n LOGU GUT 2017-10 10-18 18 AGENDA Project
2017-10 10-18 18
Build AIs that can play y games es like our players ers wo would
Olivier Delalleau Frédéric Doll Maxim Peter FOR HONOR WATCH_DOG DOGS S 2 Adrien Logut Olivier Lamothe-Penelle
Automated testing Design assistance
In-game AI
Evolutionary methods Imitation learning
Google Trends genetic algorithm reinforcement learning imitation learning
(recent) (incomplete)
CENTURION GLADIATOR HIGHLANDER SHINOBI
A u t o n o m o u s D r i v i n g
Open world game within a living city
» Takes place in San Francisco » Living city Cars in the street » Cars need to be controlled by an AI
How is it currently done?
» PID controller with custom curves » Hand-tuned curves ▪ Takes a lot of time ▪ Not precise ➢ How about Reinforcement Learning?
What the agent can see and do
Acceleration: [0,1] Brake: [-1, 1] Steering: [-1, 1]
+3
Distance to the road: 0.1 Velocity: 0.3 Desired speed: 0.9 ...
Environment Agent State Action Reward
What the agent can see and do
➢ Continuous States ➢ Neural network to approximate
𝑹 𝒕𝒖, 𝒃𝒖
What the agent can see and do
➢ Continuous States ➢ Neural network to approximate
𝑹(𝒕𝒖, 𝒃𝒖)
➢ Continuous Actions ➢ Cannot use greedy policy from DQN (For Honor) ➢ Neural network to approximate a policy 𝒃𝒖~𝝂 𝒕𝒖
Actor Critic Architecture
» Two neural networks, approximate functions ▪ Actor : 𝑏𝑢 ~ 𝜈 𝑡𝑢 ▪ Critic : 𝑅 𝑡𝑢, 𝑏𝑢 » Critic update: Expected discounted reward Q-Learning (same as For Honor) » Actor update: Policy gradient
∇𝜄𝜈 𝐾 = 𝛼
𝑏𝑅 𝑡, 𝑏 𝜄𝑅)|𝑡=𝑡𝑢,𝑏=𝜈 𝑡𝑢 𝛼𝜄𝜈𝜈 𝑡 𝜄𝜈)|𝑡=𝑡𝑢
» Actor » Critic
𝑡𝑢 𝑏𝑢 𝑅(𝑡𝑢, 𝑏𝑢)
Updates
Actor Critic Architecture
» Two neural networks, approximate functions ▪ Actor : 𝑏𝑢 ~ 𝜈 𝑡𝑢 ▪ Critic : 𝑅 𝑡𝑢, 𝑏𝑢 » Critic update: Expected discounted reward Q-Learning (same as For Honor) » Actor update: Policy gradient
∇𝜄𝜈 𝐾 = 𝛼
𝑏𝑅 𝑡, 𝑏 𝜄𝑅)|𝑡=𝑡𝑢,𝑏=𝜈 𝑡𝑢 𝛼𝜄𝜈𝜈 𝑡 𝜄𝜈)|𝑡=𝑡𝑢
» Actor » Critic
𝑡𝑢 𝑏𝑢 𝑅(𝑡𝑢, 𝑏𝑢)
Updates
Actor update - Intuition
» Actor update: Policy gradient » Intuition: ▪ Critic gives the direction to update the actor ▪ “In which way should I change actor parameters in order to maximize the critic
» Actor » Critic
𝑡𝑢 𝑏𝑢 𝑅(𝑡𝑢, 𝑏𝑢)
Updates
Since we have the PID, what about imitating it?
» Supervised learning on the actor ▪ Updated with Mean Squared Error between actor
» Actor » PID
𝑡𝑢 𝑏𝑢, 𝑏𝑑𝑢𝑝𝑠 𝑏𝑢, 𝑄𝐽𝐸 𝜀𝑢 = 𝑏𝑢,𝑏𝑑𝑢𝑝𝑠 − 𝑏𝑢,𝑄𝐽𝐸
2
Updates
Supervised vs Original – Slight improvement
Supervised vs Original – Slight improvement
Defining the reward function
» The reward is the only signal received by the agent ▪ Am I doing good or bad? » This is the key part of reinforcement learning ▪ Called Reward Shaping ▪ Requires a good understanding of the problem » For driving: ▪ Follow the given path at the right speed ▪ Stop when needed
Defining the reward function - Configuration
» Three main components are measured: » Velocity along the path
𝒘𝒚
» Velocity perpendicular to the path 𝒘𝒛 » Distance from the path
𝒆 𝒘
𝒘𝒚 𝒘𝒛 𝒆
Defining the reward function – Desired speed
» Positive reward when driving close to the desired speed » Negative when far from the desired speed » Punish more when driving faster than slower » Desired speed in red
Reward vs. velocity along path Reward Velocity x
Defining the reward function – Velocity y
» Only negative reward » Want to punish harder for small values (Power < 1)
Reward vs. velocity perpendicular to path Reward Velocity y
Defining the reward function – Distance
» Only negative reward » Want to punish less for small values (Power > 1)
Reward vs. distance from path Reward Distance
The learning curve
The learning curve
Good Results after 15 mins
One model to rule them all?
One model to rule them all?
» Each vehicle has its own physical model » Accelerate, Steer, Brake all have different reactions
» We can still group physically close vehicles » Need more state info for bigger vehicles (Bus, Trucks, …)
One model to rule them all?
One model to rule them all?
Need to deal with a lot of variance
» Game is not deterministic » Even with seeding, different results
Multi-dimensional function visualizer
» Developed with PyQt5 » Load the models and plot the output
Multi-dimensional function visualizer
» Developed with PyQt5 » Load the models and plot the output
Archives reader and comparison tools
» Developed with PyQt5 » Load the metrics and plot them to compare models
Awesome stuff!
» Analyze what could be introduced into the game ? Level of quality ? Robustness ? Computation time ? Learning time ? Size of models in memory » Try other learning algorithms » Optimize workflow with multiple agents
Reinforcement learning is promising Found efficient fighting behavior in For Honor Already better driving in Watch_Dogs 2 compared to PID It is just the beginning… Still a lot of work and research to do Not ready to use in production... yet The future? Player-facing AIs
laforge@ubisoft.com
PS: we’re hiring (!)