Learning to Prune Dominated Action Sequences in Online Black-box - PowerPoint PPT Presentation

Learning to Prune Dominated Action Sequences in Online Black-box Planning Yuu Jinnai Alex Fukunaga The University of Tokyo

Black-box Planning in Arcade Learning Environment • What a human sees Arcade Learning Environment (Bellemare et al. 2013)

Black-box Planning in Arcade Learning Environment • What the computer sees ? ? ? 0101 1111 0010 …. 0101 1111 0010 …. 0101 1111 0010 …. Arcade Learning Environment (Bellemare et al. 2013)

General-purpose agents have many irrelevant actions • The set of actions which are “useful” in each environment (= game) is a subset of the available action set in the ALE • Yet an agent has no prior knowledge regarding which actions are relevant to the given environment in black-box domain Neutral + fire Neutral Up + fire Up Up-left + fire Up-left cc Left + fire Left Down-left + fire Down-left Neutral Down + fire Down Up Down-right + fire Down-right Left Right + fire Right Down Up-right + fire Up-right Right Available action set in the ALE Actions which are useful (18 actions) in the environment

State Space Planning Problem Two ways of domain description • Transparent model domain (e.g. PDDL) • Black-box domain

Transparent Model Domain Input: initial state, goal condition, action set is described in logic (e.g. PDDL) • Easy to compute relevant action • Possilble to deduce which actions are useful I n i t : o n t a b l e ( a ) , o n t a b l e ( b ) , c l e a r ( a ) , c l e a r ( b ) G o a l : o n ( a , b ) A c t i o n : M o v e ( b , x , y ) P r e c o n d : o n ( b , x ) , c l e a r ( x ) , c l e a r ( y ) E fg e c t : o n ( b , y ) , c l e a r ( x ) , ¬ o n ( b , x ) , ¬ c l e a r ( y ) Initial state Goal condition A B A B Example: blocks world

Black-box Domain • Domain description in Black-box domain: • s 0 : initial state (bit vector) • suc ( s , a ) : (black-box) successor generator function returns a state which results when action a is applied to state s • r ( s , a ) : (black-box) reward function (or goal condition) → No description of which actions are valid/relevant Initial state Goal condition ? ? 1011 1001 1000 …. 0101 1111 0010 ….

Arcade Learning Environment (ALE): A Black-box Domain (Bellemare et al. 2013) • Domain description in the ALE: • State: RAM state (bit vector of 1024 bits) • Successor generator: Complete emulator • Reward function: Complete emulator Arcade Learning Environment

Arcade Learning Environment (ALE): A Black-box Domain (Bellemare et al. 2013) • Domain description in the ALE: • 18 available actions for an agent • No description of which actions are relevant/required • Node generation is the main bottleneck of walltime (requires running simulator) Arcade Learning Environment

Two Lines of Research in the ALE (Bellemare et al. 2013) • Online planning setting (e.g. Lipovetzky et al. 2015) An agent runs a simulated lookahead each k (= 5) frames and chooses an action to execute next ( no prior learning ) • Learning setting (e.g. Mnih et al. 2015) An agent generates a reactive controller for mapping states into actions We focus on Online planning setting for this talk (applying our method to RL is future work)

Online Planning on the ALE (Bellemare et al 2013) For each planning iteration (= planning episode) 1.Run a simulated lookahead with a limited amount of computational resource (e.g. # of simulation frames) 2.Choose an action which leads to the best accumulated reward accumulated reward r=5 r=8 r=9 r = 10 Up Down Up Down Up Down Current game state

Online Planning on the ALE (Bellemare et al 2013) For each planning iteration (= planning episode) 1.Run a simulated lookahead with a limited amount of computational resource (e.g. # of simulation frames) 2.Choose an action which leads to the best accumulated reward Up Down Up Down Current game state Up Down

Online Planning on the ALE (Bellemare et al 2013) For each planning iteration (= planning episode) 1.Run a simulated lookahead with a limited amount of computational resource (e.g. # of simulation frames) 2.Choose an action which leads to the best accumulated reward r=12 r=8 r=11 r=6 Up Up Down Down Up Down Up Down Current game state Up Down

Online Planning on the ALE (Bellemare et al 2013) For each planning iteration (= planning episode) 1.Run a simulated lookahead with a limited amount of computational resource (e.g. # of simulation frames) 2.Choose an action which leads to the best accumulated reward ・・ Up Up Down Down Current game state Up Down Up Down Up Down

General-purpose agents have many irrelevant actions • The set of actions which are “useful” in each environment (= game) is a subset of the available action set in the ALE Neutral + fire Neutral Up + fire Up Up-left + fire Up-left cc Left + fire Left Down-left + fire Down-left Neutral Down + fire Down Up Down-right + fire Down-right Left Right + fire Right Down Up-right + fire Up-right Right Available action set in the ALE Actions which are useful (18 actions) in the environment

General-purpose agents have many irrelevant actions • The set of actions which are “useful” in each environment (= game) is a subset of the available action set in the ALE • The set of actions which are “useful” in each state in the environment is a smaller subset Neutral + fire Neutral Up + fire Up Up-left + fire Up-left cc Left + fire Left Down-left + fire Down-left Neutral Down + fire Down Up Down-right + fire Down-right Left Neutral Right + fire Right Down Up Up-right + fire Up-right Right Left Available action set in the ALE Actions which are useful Actions which are useful (18 actions) in the environment in the state

General-purpose agents have many irrelevant actions Up Left Neutral Up-left Down-left Down Up-right (+ fire) Down-right (+ fire) Right (+ fire) • Generated duplicate nodes can be pruned by duplicate detection • However, in simulation-based black-box domain node generation is the main bottleneck of the walltime performance → By pruning irrelevant actions we should make use of the computational resource more efficiently

Dominated action sequence pruning (DASP) • Goal: Find action sequences which are useful in the environment (for simplicity we explain using action sequence of length=1) • Prune redundant actions in the course of online planning • Find a minimal action set which can reproduce previous search graphs and use the action set for the next planning episode

Dominated action sequence pruning (DASP) Action set available Down+Fire Up+Fire Down to the agent Up {Up, Down, Up+Fire, Down+Fire Up+Fire Down+Fire} Down Up Down Up Minimal action set {Up, Down} Down Up

DASP: Find a minimal action set • Algorithm: Find a minimal action set A Down+Fire Up+Fire Down Up Down+Fire Up+Fire Down Up search graphs in previous episodes

DASP: Find a minimal action set • Algorithm: Find a minimal action set A 1.v i ∈ V corresponds to action i in hypergraph G = ( V , E ) . Down+Fire Up+ Up+Fire Up Down Fire Up Down+ Down+Fire Up+Fire Down Fire Down Up Hypergraph G search graphs in previous episodes

DASP: Find a minimal action set • Algorithm: Find a minimal action set A 1.v i ∈ V corresponds to action i in hypergraph G = ( V , E ) . e(v 0 , v 1 , …, v n ) ∈ E iff there is one or more duplicate search nodes generated by all of v 0 ,v 1 ,…,v n but not by any other actions . Down+Fire Up+ Up+Fire Up Down Fire Up Down+ Down+Fire Up+Fire Down Fire Down Up Hypergraph G search graphs in previous episodes

DASP: Find a minimal action set • Algorithm: Find a minimal action set A 1.v i ∈ V corresponds to action i in hypergraph G = ( V , E ) . e(v 0 , v 1 , …, v n ) ∈ E iff there is one or more duplicate search nodes generated by all of v 0 ,v 1 ,…,v n but not by any other actions . 2.Add the minimal vertex cover of G to A A = {Up, Down} Down+Fire Up+ Up+Fire Up Down Fire Up Down+ Down+Fire Up+Fire Down Fire Down Up Hypergraph G search graphs in previous episodes

DASP: Find a minimal action set • Algorithm: Find a minimal action set A 1.v i ∈ V corresponds to action i in hypergraph G = ( V , E ) . e(v 0 , v 1 , …, v n ) ∈ E iff there is one or more duplicate search nodes generated by all of v 0 ,v 1 ,…,v n but not by any other actions . 2.Add the minimal vertex cover of G to A A = {Up, Down} Down Up+ Up Up Fire Down+ Down Down Up Fire Hypergraph G search graph using A

Experimental Result: acquired minimal action set • DASP finds and uses a minimal action set at each planning epsiode except for the first 12 planning episodes • Restricted action set: hand-coded set of minimal actions for each game default action set DASP (jittered) (=18 actions)

Learning to Prune Dominated Action Sequences in Online Black-box - PowerPoint PPT Presentation

Learning to Prune Dominated Action Sequences in Online Black-box Planning Yuu Jinnai Alex Fukunaga The University of Tokyo Black-box Planning in Arcade Learning Environment What a human sees Arcade Learning Environment (Bellemare et

CRP 021502 Prune Alley PRUNE ALLEY Improvements C O M P L E T E S T R E E T 2 WASHINGTON

20-03-06 7. Learning Sequences/Behaviors How to use sequences/behaviors? Sequences and more

NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz 2017 WHY WE CAN PRUNE

Green Action Centre, 2019 Green Action Centre, 2019 Green Action Centre, 2019 Green Action

Radiation- -Dominated Dominated Radiation Relativistic Current Sheets Relativistic Current

Sequences Sequences and Difference Equations "Sequences" is a central topic in

Sequences Sequences and Difference Equations "Sequences" is a central topic in

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

Construction of covering arrays from Outline m-sequences Covering arrays Definition Research

Sequences Sequences are ordered lists of elements, e.g. 2, 3, 5, 7, 11, 13, 17, 19, . . . or a , b

Towards a Generative Model of Natural Motion C. Karen Liu University of Southern California

Municipal Water District of Orange County May 1, 2019 Action 1 Action 1 Action 2 Action 2

Berries, Grapes and Kiwi Pruning Blueberries Prune to an open vase shape, leaving 4 to 6

The interval branch-and-prune algorithm for the protein structure determination by NMR Th er`

Prune the Unnecessary: Sriram Aananthakrishnan * Parallel Pull-Push Louvain Algorithms Fabrizio

Outline Security risk and management Some terminology CSci 5271 Introduction to Computer

ALE: AES-Based Lightweight Authenticated Encryption Andrey Bogdanov 1 , Florian Mendel 2 ,

ARDSLEY HIGH SCHOOL ALE Brandon Milonovich 1/24/17 FLEXIBILITY AND CHOICE IN LEARNING Students

From LIBOR to SONIA and what you need to know: Recommended steps for transition Chris Wilford

More Design Issues 1. Roles. 2. Sub classes. 3. Keys. 4. W eak en tit y sets. 5.

Augmented Likelihood Estimators for Mixture Models Markus Haas Jochen Krause Marc S. Paolella

Spelunking for Hardware Data Matt Porter <mporter@konsulko.com> CC-BY SA4 c ii

Spurious Mixing in MOM6 An energetic approach Angus Gibson May 27, 2016 Overview Motivation

Learning to Prune Dominated Action Sequences in Online Black-box - PowerPoint PPT Presentation

Learning to Prune Dominated Action Sequences in Online Black-box Planning Yuu Jinnai Alex Fukunaga The University of Tokyo Black-box Planning in Arcade Learning Environment What a human sees Arcade Learning Environment (Bellemare et

CRP 021502 Prune Alley PRUNE ALLEY Improvements C O M P L E T E S T R E E T 2 WASHINGTON

20-03-06 7. Learning Sequences/Behaviors How to use sequences/behaviors? Sequences and more

NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz 2017 WHY WE CAN PRUNE

Green Action Centre, 2019 Green Action Centre, 2019 Green Action Centre, 2019 Green Action

Radiation- -Dominated Dominated Radiation Relativistic Current Sheets Relativistic Current

Sequences Sequences and Difference Equations &quot;Sequences&quot; is a central topic in

Sequences Sequences and Difference Equations &quot;Sequences&quot; is a central topic in

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

Construction of covering arrays from Outline m-sequences Covering arrays Definition Research

Sequences Sequences are ordered lists of elements, e.g. 2, 3, 5, 7, 11, 13, 17, 19, . . . or a , b

Towards a Generative Model of Natural Motion C. Karen Liu University of Southern California

Municipal Water District of Orange County May 1, 2019 Action 1 Action 1 Action 2 Action 2

Berries, Grapes and Kiwi Pruning Blueberries Prune to an open vase shape, leaving 4 to 6

The interval branch-and-prune algorithm for the protein structure determination by NMR Th er`

Prune the Unnecessary: Sriram Aananthakrishnan * Parallel Pull-Push Louvain Algorithms Fabrizio

Outline Security risk and management Some terminology CSci 5271 Introduction to Computer

ALE: AES-Based Lightweight Authenticated Encryption Andrey Bogdanov 1 , Florian Mendel 2 ,

ARDSLEY HIGH SCHOOL ALE Brandon Milonovich 1/24/17 FLEXIBILITY AND CHOICE IN LEARNING Students

From LIBOR to SONIA and what you need to know: Recommended steps for transition Chris Wilford

More Design Issues 1. Roles. 2. Sub classes. 3. Keys. 4. W eak en tit y sets. 5.

Augmented Likelihood Estimators for Mixture Models Markus Haas Jochen Krause Marc S. Paolella

Spelunking for Hardware Data Matt Porter &lt;mporter@konsulko.com&gt; CC-BY SA4 c ii

Spurious Mixing in MOM6 An energetic approach Angus Gibson May 27, 2016 Overview Motivation

Sequences Sequences and Difference Equations "Sequences" is a central topic in

Sequences Sequences and Difference Equations "Sequences" is a central topic in

Spelunking for Hardware Data Matt Porter <mporter@konsulko.com> CC-BY SA4 c ii