A Deep Journey of Playing Games with RL NSE Seminar Kim Hammar - PowerPoint PPT Presentation

Intro AlphaGo AlphaGo Zero AlphaZero Summary A Deep Journey of Playing Games with RL NSE Seminar Kim Hammar kimham@kth.se January 31, 2020 1 / 41

2 / 41 Why Combine the two? Branching factor = 3 香桂香歩角銀香桂 VA LV E 歩歩角銀金 R 歩桂 x 1 , 3 x 0 , 3 歩香金 Ply 3 歩角桂歩銀玉香歩歩歩角歩銀金金歩桂歩玉歩歩香香歩歩金金銀歩歩角歩桂桂銀歩玉桂歩歩歩金歩銀香角歩歩歩角角歩銀銀歩歩玉金角香歩歩歩歩歩銀桂 x 1 , 2 x 0 , 2 香桂角歩歩歩歩歩金金歩金歩角香銀歩歩歩歩歩歩玉歩桂桂歩歩歩歩銀 Depth = 3 香角金歩歩歩歩歩玉玉歩金角香銀歩歩歩歩歩銀歩桂 ˆ y Ply 2 香桂角歩歩歩歩歩歩金歩金角香銀金歩歩歩歩歩桂桂歩歩玉歩歩歩歩歩歩銀銀香角金歩歩歩金歩歩歩歩歩角桂桂香桂銀歩玉歩歩歩歩香香角角金歩歩歩銀歩角歩歩歩角角香香 x 1 , 1 x 0 , 1 銀玉金歩歩歩歩桂桂歩歩銀歩桂角歩歩歩金金歩香歩銀銀玉歩歩桂歩歩歩金金歩銀歩歩角歩玉金香歩歩銀歩歩桂角歩金香歩 Ply 1 玉玉桂歩歩 b 1 b 0 金金銀角歩歩歩桂香銀銀角角歩歩香桂桂香香 P P P p P P s t + 1 p P P p P P P Environment p p k P P p p P k P Action a t p p p P P k P V V V V V r t + 1 b p Q k P N N N N N O O O O O b p Q k P C C C C C b p p Q k k P P r t s t b K N Q b K Q N b b K N Q Q K r N Agent K r N r K K N N r r r r Games AI & Machine Learning Why Games Summary AlphaZero AlphaGo Zero AlphaGo Intro

Intro AlphaGo AlphaGo Zero AlphaZero Summary Why Games AI & Machine Learning Games r r r r N N r K K N K r N K r Agent Q Q N K b b N Q K b Q N K b Action a t P P k k Q p p b s t r t P k Q p b C C C C C P k Q p b P P p p r t + 1 O O O O O P k p N N N N N P p V V V V V P k p P p P k p P p P P Environment P p P P P p s t + 1 P P P 香香桂桂香銀銀歩歩角角角角桂歩歩香金金歩角角桂銀 b 0 b 1 歩歩香歩金玉玉 Ply 1 歩角角歩歩銀歩香桂金玉歩歩角角歩歩銀歩金金歩歩歩香桂歩歩玉銀銀歩歩角角銀歩金金歩歩歩歩桂歩玉歩桂桂歩歩歩金歩金銀香香角角角角歩歩歩歩角角歩銀歩歩歩角角桂香香 x 0 , 1 x 1 , 1 桂桂歩歩歩歩歩金歩玉金歩歩銀香角角歩歩歩歩歩歩角桂香桂銀銀歩歩歩歩歩歩玉歩歩歩銀 Depth = 3 香角角金金歩歩歩歩歩歩金角香銀歩歩歩歩歩歩銀桂 Ply 2 y ˆ 香桂角角金玉玉歩歩歩歩歩歩金歩角香銀歩歩歩歩歩歩銀歩桂香桂歩角角玉歩歩歩歩歩歩角香銀歩金歩金金歩歩歩歩歩桂 x 0 , 2 x 1 , 2 桂玉歩歩歩歩香角角金歩金銀歩銀角角歩歩歩歩角香桂銀玉歩歩桂桂歩歩歩金金歩銀歩角歩歩銀歩桂香歩香歩金玉銀歩歩角歩金歩香歩玉銀歩桂角歩 Ply 3 金香歩 x 0 , 3 x 1 , 3 銀桂角歩金香歩歩桂銀角歩 VA LV E R 桂香 Branching factor = 3 香 Why Combine the two? ▸ AI & Games have a long history (Turing ’50& Minsky 60’) ▸ Simple to evaluate, reproducible, controllable, quick feedback loop ▸ Common benchmark for the research community 2 / 41

Intro AlphaGo AlphaGo Zero AlphaZero Summary 1997: DeepBlue 1 vs Kasparov 2em1 1 Murray Campbell, A. Joseph Hoane, and Feng-hsiung Hsu. “Deep Blue”. In: Artif. Intell. 134.1–2 (Jan. 2002), 57–83. issn : 0004-3702. doi : 10.1016/S0004- 3702(01)00129- 1 . url : https://doi.org/10.1016/S0004-3702(01)00129-1 . 3 / 41

Intro AlphaGo AlphaGo Zero AlphaZero Summary 1992: Tesauro’s TD-Gammon 2 2em1 2 Gerald Tesauro. “TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play”. In: Neural Comput. 6.2 (Mar. 1994), 215–219. issn : 0899-7667. doi : 10.1162/neco.1994.6.2.215 . url : https://doi.org/10.1162/neco.1994.6.2.215 . 4 / 41

Intro AlphaGo AlphaGo Zero AlphaZero Summary 1959: Arthur Samuel’s Checkers Player 3 2em1 3 A. L. Samuel. “Some Studies in Machine Learning Using the Game of Checkers”. In: IBM J. Res. Dev. 3.3 (July 1959), 210–229. issn : 0018-8646. doi : 10.1147/rd.33.0210 . url : https://doi.org/10.1147/rd.33.0210 , A. L. Samuel. “Some Studies in Machine Learning Using the Game of Checkers”. In: IBM J. Res. Dev. 3.3 (July 1959), 210–229. issn : 0018-8646. doi : 10.1147/rd.33.0210 . url : https://doi.org/10.1147/rd.33.0210 . 5 / 41

Intro AlphaGo AlphaGo Zero AlphaZero Summary 6 / 41

Intro AlphaGo AlphaGo Zero AlphaZero Summary 7 / 41

Intro AlphaGo AlphaGo Zero AlphaZero Summary Papers in Focus Today ▸ AlphaGo 4 ▸ AlphaGo Zero 5 ▸ AlphaZero 6 AlphaGo AlphaGo Zero Alpha Zero Nature, 6.5k citations Nature, 2.5k citations Science, 400 citations 2016 2017 2018 2em1 4 David Silver et al. “Mastering the Game of Go with Deep Neural Networks and Tree Search”. In: Nature 529.7587 (Jan. 2016), pp. 484–489. doi : 10.1038/nature16961 . 2em1 5 David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–. url : http://dx.doi.org/10.1038/nature24270 . 2em1 6 David Silver et al. “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play”. In: Science 362.6419 (2018), pp. 1140–1144. url : http : //science.sciencemag.org/content/362/6419/1140/tab-pdf . 8 / 41

Intro AlphaGo AlphaGo Zero AlphaZero Summary The Reinforcement Learning Problem ▸ Notation; policy : π , state : s , reward : r , action : a ▸ Agent’s goal: maximize reward, R t = ∞ 0 ≤ γ ≤ 1 ∑ γ k r t + k + 1 k = 0 ▸ RL’s goal, find optimal policy π ∗ = max π E [ R ∣ π ] Agent Action a t s t r t r t + 1 Environment s t + 1 9 / 41

Intro AlphaGo AlphaGo Zero AlphaZero Summary RL Examples: Elevator (Crites & Barto ’95 7 ) select(up,down,wait,stop at floor 1 , ⋯ ,n ) a t + 1 y ˆ x 1 , 1 x 1 , 2 x 1 , 3 b 1 x 0 , 1 x 0 , 2 x 0 , 3 b 0 Elevator Agent r t + 1 Observations ElevatorPosition Reward ∈ R 2em1 7 Robert H. Crites and Andrew G. Barto. “Improving Elevator Performance Using Re- inforcement Learning”. In: Proceedings of the 8th International Conference on Neural Information Processing Systems . NIPS’95. Denver, Colorado: MIT Press, 1995, 1017–1023. 10 / 41

Intro AlphaGo AlphaGo Zero AlphaZero Summary RL Examples: Atari (Mnih ’15) 8 a t + 1 ⋯ Q ( s,a 1 ) Q ( s,a 18 ) DQN Agent r t + 1 Observations Reward ∈ R Screen frames ∈ R 4 × 84 × 84 2em1 8 Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540 (Feb. 2015), pp. 529–533. issn : 00280836. url : http://dx.doi.org/10.1038/ nature14236 . 11 / 41

Intro AlphaGo AlphaGo Zero AlphaZero Summary How to Act Optimally? (Bellman 57’ 9 ) ∞ optimal ( s t ) = max E [ γ k − 1 r t + k ∣ s t ] ∑ π k = 1 2em1 9 Richard Bellman. Dynamic Programming . Dover Publications, 1957. isbn : 9780486428093. 12 / 41

Intro AlphaGo AlphaGo Zero AlphaZero Summary How to Act Optimally? (Bellman 57’ 10 ) ∞ optimal ( s t ) = max E [ γ k − 1 r t + k ∣ s t ] ∑ π k = 1 ∞ = max E [ r t + 1 γ k − 1 r t + k ∣ s t ] ∑ π k = 2 ∞ = max a t E [ r t + 1 + max E [ ∑ γ k − 1 r t + k ∣ s t + 1 ]∣ s t ] π k = 2 = max a t E [ r t + 1 + γ max E [ ∞ γ k − 2 r t + k ∣ s t + 1 ]∣ s t ] ∑ π k = 2 ∞ = max a t E [ r t + 1 + γ max E [ γ k − 2 r t + k ∣ s t + 1 ]∣ s t ] ∑ π k = 2 = max a t E [ r t + 1 + γoptimal ( s t + 1 )∣ s t ] 12 / 41 2em1 10 Richard Bellman. Dynamic Programming . Dover Publications, 1957. isbn : 9780486428093.

Intro AlphaGo AlphaGo Zero AlphaZero Summary Reinforcement Learning: An Overview Deep Reinforcement Learning Gradient ∇ θ L ( y, ˆ y ) b 0 b 1 ⎛ ⎞ x 1 ⎜ ⎟ ⋮ x 0 , 1 x 1 , 1 ⎜ ⎟ y ˆ L ( y, ˆ y ) ⎝ ⎠ y ˆ x n x 0 , 2 x 1 , 2 C C C C C O O O O O x 0 , 3 x 1 , 3 N N N N N V V V V V Features Model θ Prediction Loss Algorithms: DQN, DDPG, Double-DQN 13 / 41

A Deep Journey of Playing Games with RL NSE Seminar Kim Hammar - PowerPoint PPT Presentation

Intro AlphaGo AlphaGo Zero AlphaZero Summary A Deep Journey of Playing Games with RL NSE Seminar Kim Hammar kimham@kth.se January 31, 2020 1 / 41 2 / 41 Why Combine the two? Branching factor = 3 VA

A Musical Journey A Musical Journey A Musical Journey A Musical Journey A Musical Journey A

Using Social Media for Health Studies Ingmar Weber Social Computing, Qatar Computing Research

Game Theory Preliminaries: Playing and Solving Games Zero-sum games with perfect information

Games Miheer Dewaskar Chennai Mathematical Institute April 27, 2016 1 / 19 Outline Finite

S S S S erious Games erious Games erious Games erious Games + Computer S + Computer S +

Potential Games Matoula Petrolia April 14, 2011 Examples Potential Games Potential vs

Pre-Grundy Games Games And Graphs Workshop 2017 In collaboration with : Eric Duch ene,

Game Playing Why do AI researchers study game playing? 1. Its a good reasoning problem, formal

Game playing Chapter 6 Chapter 6 1 Outline Games Perfect play minimax decisions

AI in Multiplayer Games Alex Zook @zookae AI in Multiplayer Games AI so Playing Online

Game playing Chapter 5 Chapter 5 1 Outline Games Perfect play minimax decisions

Game playing Chapter 5 Chapter 5 1 Outline Games Perfect play minimax decisions

LOGIC OF GAMES Andreas Blass University of Michigan Ann Arbor, MI 48109 ablass@umich.edu Games

Nash Dynamics and Potential Games Maria Serna Fall 2016 AGT-MIRI, FIB Potential Games Contents

CSC2556 Lecture 11 Noncooperative Games 2: Zero-Sum Games, Stackelberg Games CSC2556 - Nisarg

Congestion Games with affine functions Maria Serna Fall 2016 AGT-MIRI, FIB-UPC Congestion Games

Thermodynamics of Natural Systems NGEO106 Fall Semester 2019 Outline of the course Review of

Minimal surfaces by way of complex analysis Franc Forstneri c ICTP, Trieste 7 December 2018

02/01/2019 1 2 Tetradrachm of Ptolemy I struck c 314 BC as Dr William Sterling Alexander with

Center-Focus and Smale-Pugh problems for Abel equation: why to study them? Dmitry Batenkov Yosef

1 Peter Series Lesson #035 December 3, 2015 Dean Bible Ministries www.deanbibleministries.org

1 PETER 1:1-9 1 Peter, an apostle of Jesus Christ: To the temporary residents dispersed in Pontus,

1 PETER 1:10-12 10 Concerning this salvation, the prophets who prophesied about the grace that

Counting for Humanists Andrew Goldstone (http://andrewgoldstone.com) Wednesday, April 30, 2014 .

Sambuz

Useful Links

Newsletter

Mail Us