Emergent solu-ons to high dimensional mul--task reinforcement - - PowerPoint PPT Presentation
Emergent solu-ons to high dimensional mul--task reinforcement - - PowerPoint PPT Presentation
Humies Compe--on GECCO 2018 Emergent solu-ons to high dimensional mul--task reinforcement learning Stephen Kelly & Malcolm Heywood Why does the result qualify as human compe--ve? Visual State s ( t ) End-of-Evalua-on Game
Why does the result qualify as human compe--ve?
Game -tle:
- Atari
- Doom
Game Playing Agent Visual State s(t) Atomic Ac-on a(t) End-of-Evalua-on Game score
July 2018 2 Humies
Visual RL dominated by Deep learning
- DQN (2015)
– Visual RL on Atari Learning Environment (49 -tles) – Q-learning with Deep learning – Cropped visual image (84 × 84) – Frame stacking (removes the interleaving of sprites & stochas-c proper-es) – “able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games” [Nature (2015) Vol. 518]
- Gorila (2015), Double Q (2016), Dueling DL (2016), AC3 (2016),
Noisy DQN (2017), Distribu-onal DQN (2017), Rainbow (2018)
- One policy per game -tle
- Learning parameters and DNN topology iden-fied a priori
July 2018 3 Humies
1 10 100 1000 10000 TPG DQN Gorila Double-DQN H-NEAT
Visual RL Compared to ‘human’
100 (algorithm – rnd)/(Human – rnd)
Human level
Algorithm Beker than Human Algorithm Worse than Human
Best -tle Worst -tle July 2018 4 Humies
Sta-s-cally equivalent Log ( % human )
Visual RL and Mul--task learning
- Mul-ple game -tles played by single agent
- Single -tle DQN provides the baseline
- Best DNN result needs prior knowledge regarding
parameters and topology
- Cons-tutes an example of a task pertaining to
‘Ar-ficial General Intelligence’
July 2018 Humies 5
Mul---tle TPG versus Single--tle DQN
1 10 100 1000 Group 1 Group 2 Group 3
Log (%DQN score) Single Title DQN score
July 2018 6 Humies
Alien Bakle Zone Asteroids Bank Heist Bowling Copper Com. Cen-pede Fishing Derby Kangaroo Frostbite Krull Kung-Fu Ms.Pac-Man Time Pilot Private Eye Beker Worse
Why [is our entry] ‘best’ in comparison to other entries?
- Single -tle task
– TPG provides solu-ons compe--ve with human and DQN – Agents have to be compe--ve over mul-ple game
- tles
- Mul---tle task
– TPG mul--task solu-on is compe--ve with DQN trained under single -tle sepng – DNN state-of-the-art in single task does not address Mul---tle task
- TPG for Single -tle task a special case of TPG for Mul--
- tle task
July 2018 7 Humies
The ‘icing on the cake’
- TPG addresses mul-ple issues simultaneously:
– Complexity of topology is emergent and:
- Highly modular
- Unique to the task
- Explicitly reflects a decomposi-on of the task
– No image specific instruc9ons just:
- Four 2 Argument operators {+, −, ×, ÷}
- Three 1 Argument operators {log, exp, cosine}
- One condi-onal operator
– TPG highly efficient computa9onally – Some examples…
July 2018 Humies 8
Teams (nodes) per graph emerge… [ diko pixels used ]
- Alien
- Asteroids
- Bowling
- Boxing
- Ms. Pac−Man
- Rand
200 400 600 800 1 2 5 10 20 50 100 200 Number of Teams Generation
- Entire champion policy graph
Visited per decision during test
July 2018 9 Humies
Overall solu-on complexity Per Decision complexity
Emergent discovery of Mul---tle solu-ons
July 2018 Humies 10
{ } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { }
- Ms. Pac−Man
Frostbite Centipede
1 3 1 7 1
Run -me complexity
DQN
- ≈1.6 million weights in MLP
- ≈3.2 million convolu-on
- pera-ons in DNN
- 3.2 GHz Intel i7-4700s
– 5 decisions per second
- GPU accelera-on
– 330 decisions per second
TPG
- Single -tle
– 71 – 2346 Instruc-ons (avg)
- Mul- -tle
– 413 – 869 Instruc-ons (avg)
- 2.2 GHz Intel E5-2650
– Single -tle:
- 758-2853 decisions per sec.
– Mul---tle
- 1832-2922 decisions per sec.
July 2018 11 Humies
Ques-ons?
July 2018 Humies 12