re rein infor forcem cement ent le lear arni ning ng as a
play

RE REIN INFOR FORCEM CEMENT ENT LE LEAR ARNI NING NG AS A - PowerPoint PPT Presentation

RE REIN INFOR FORCEM CEMENT ENT LE LEAR ARNI NING NG AS A PR AS A PROD ODUC UCTION TION TOOL OOL ON ON AAA AAA GA GAME MES Olivie ier r DELALL ALLEAU EAU & A Adrien n LOGU GUT 2017-10 10-18 18 AGENDA Project


  1. RE REIN INFOR FORCEM CEMENT ENT LE LEAR ARNI NING NG AS A PR AS A PROD ODUC UCTION TION TOOL OOL ON ON AAA AAA GA GAME MES Olivie ier r DELALL ALLEAU EAU & A Adrien n LOGU GUT 2017-10 10-18 18

  2. AGENDA Project Overview Fighting in For Honor Driving in Watch_Dogs 2

  3. Project Overview Build AIs that can play y games es like our players ers wo would FOR HONOR Olivier Delalleau Fr édéric Doll Maxim Peter S 2 DOGS WATCH_DOG Adrien Logut Olivier Lamothe-Penelle

  4. Motivations Automated testing Design assistance In-game AI

  5. Why Reinforcement Learning Google Trends reinforcement learning genetic algorithm imitation learning Evolutionary methods Imitation learning

  6. RL & Video games (recent) (incomplete) Atari Doom Minecraft Universe SNES Starcraft II Dota 2 Unity

  7.  

  8. CENTURION GLADIATOR SHINOBI HIGHLANDER

  9. • • • •

  10. A u t o n o m o u s D r i v i n g

  11. Watch_Dogs 2 Open world game within a living city Takes place in San Francisco » Living city  Cars in the street » Cars need to be controlled » by an AI

  12. Objectives How is it currently done? PID controller with custom curves » Hand-tuned curves » Takes a lot of time ▪ Not precise ▪ How about Reinforcement ➢ Learning?

  13. Reinforcement Learning What the agent can see and do Environment Distance to the road: 0.1 Action Velocity: 0.3 Reward State Acceleration: [0,1] Desired speed: 0.9 +3 Brake: [-1, 1] ... Steering: [-1, 1] Agent

  14. Reinforcement Learning What the agent can see and do Continuous States ➢ ➢ Neural network to approximate 𝑹 𝒕 𝒖 , 𝒃 𝒖

  15. Reinforcement Learning What the agent can see and do Continuous States ➢ ➢ Neural network to approximate 𝑹(𝒕 𝒖 , 𝒃 𝒖 ) Continuous Actions ➢ ➢ Cannot use greedy policy from DQN (For Honor) ➢ Neural network to approximate a policy 𝒃 𝒖 ~ 𝝂 𝒕 𝒖

  16. Reinforcement Learning Actor Critic Architecture Two neural networks, approximate functions » Updates Actor : 𝑏 𝑢 ~ 𝜈 𝑡 𝑢 ▪ Critic : 𝑅 𝑡 𝑢 , 𝑏 𝑢 ▪ 𝑏 𝑢 Critic update: 𝑡 𝑢 𝑅(𝑡 𝑢 , 𝑏 𝑢 ) » » Actor » Critic Expected discounted reward Q-Learning (same as For Honor) Actor update: » Policy gradient 𝑏 𝑅 𝑡, 𝑏 𝜄 𝑅 )| 𝑡=𝑡 𝑢 ,𝑏=𝜈 𝑡 𝑢 𝛼 𝜄 𝜈 𝜈 𝑡 𝜄 𝜈 )| 𝑡=𝑡 𝑢 ∇ 𝜄 𝜈 𝐾 = 𝛼

  17. Reinforcement Learning Actor Critic Architecture Two neural networks, approximate functions » Updates Actor : 𝑏 𝑢 ~ 𝜈 𝑡 𝑢 ▪ Critic : 𝑅 𝑡 𝑢 , 𝑏 𝑢 ▪ 𝑏 𝑢 Critic update: 𝑡 𝑢 𝑅(𝑡 𝑢 , 𝑏 𝑢 ) » » Actor » Critic Expected discounted reward Q-Learning (same as For Honor) Actor update: » Policy gradient 𝑏 𝑅 𝑡, 𝑏 𝜄 𝑅 )| 𝑡=𝑡 𝑢 ,𝑏=𝜈 𝑡 𝑢 𝛼 𝜄 𝜈 𝜈 𝑡 𝜄 𝜈 )| 𝑡=𝑡 𝑢 ∇ 𝜄 𝜈 𝐾 = 𝛼

  18. Reinforcement Learning Actor update - Intuition Actor update: » Updates Policy gradient Intuition: » 𝑏 𝑢 Critic gives the direction ▪ 𝑡 𝑢 𝑅(𝑡 𝑢 , 𝑏 𝑢 ) » Actor » Critic to update the actor ▪ “In which way should I change actor parameters in order to maximize the critic output given a state.”

  19. First experience Since we have the PID, what about imitating it? Supervised learning on the actor » Updated with Mean Squared Error between actor ▪ output and PID output Updates 𝑏 𝑢, 𝑏𝑑𝑢𝑝𝑠 𝑡 𝑢 » Actor 2 𝜀 𝑢 = 𝑏 𝑢,𝑏𝑑𝑢𝑝𝑠 − 𝑏 𝑢,𝑄𝐽𝐸 𝑏 𝑢, 𝑄𝐽𝐸 » PID

  20. First experience Supervised vs Original – Slight improvement

  21. First experience Supervised vs Original – Slight improvement

  22. Reward Shaping Defining the reward function The reward is the only signal received by the agent » Am I doing good or bad? ▪ This is the key part of reinforcement learning » Called Reward Shaping ▪ Requires a good understanding of the problem ▪ For driving: » Follow the given path at the right speed ▪ Stop when needed ▪

  23. Reward Shaping Defining the reward function - Configuration Three main components are measured: » Velocity along the path 𝒘 𝒚 » Velocity perpendicular to the path 𝒘 𝒛 » 𝒆 Distance from the path » 𝒆 𝒘 𝒚 𝒘 𝒘 𝒛

  24. Reward Shaping Defining the reward function – Desired speed Positive reward when driving close to the desired speed » Negative when far from » the desired speed Punish more when driving » Reward vs. velocity along path faster than slower Desired speed in red » Reward Velocity x

  25. Reward Shaping Defining the reward function – Velocity y Only negative reward » Want to punish harder for » small values (Power < 1) Reward vs. velocity perpendicular to path Reward Velocity y

  26. Reward Shaping Defining the reward function – Distance Only negative reward » Want to punish less » for small values (Power > 1) Reward vs. distance from path Reward Distance

  27. Results The learning curve

  28. Results The learning curve

  29. Results Good Results after 15 mins

  30. Results One model to rule them all?

  31. Results One model to rule them all? Each vehicle has its own physical model » Accelerate, Steer, Brake all have different reactions » over the vehicles We can still group physically close vehicles » Need more state info for bigger vehicles (Bus, Trucks, …) »

  32. Results One model to rule them all?

  33. Results One model to rule them all?

  34. Results Need to deal with a lot of variance Game is not deterministic » Even with seeding, different results »

  35. Tools Multi-dimensional function visualizer Developed with » PyQt5 Load the models » and plot the output

  36. Tools Multi-dimensional function visualizer Developed with » PyQt5 Load the models » and plot the output

  37. Tools Archives reader and comparison tools Developed with » PyQt5 Load the metrics » and plot them to compare models

  38. What’s next? Awesome stuff! Analyze what could be introduced into the game » Level of quality ? Robustness ? Computation time ? Learning time ? Size of models in memory ? Try other learning algorithms » Optimize workflow with multiple » agents

  39. Conclusion Reinforcement learning is promising Found efficient fighting behavior in For Honor Already better driving in Watch_Dogs 2 compared to PID It is just the beginning… Still a lot of work and research to do Not ready to use in production... yet The future? Player-facing AIs

  40. Thank you! Do you have questions? laforge@ubisoft.com PS: we’re hiring (!)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend