Verification of Agents learning through Reinforcement Shashank - PowerPoint PPT Presentation

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Verification of Agents learning through Reinforcement Shashank Pathak 12 Giorgio Metta 12 Luca Pulina 3 Armando Tacchella 2 Robotics, Brain and Cognitive Sciences (RBCS) Istituto Italiano di Tecnologia (IIT), Via Morego, 30 – 16163 Genova – Italy Shashank.Pathak@iit.it - Giorgio.Metta@iit.it Dipartimento. di Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi (DIBRIS) Universit` a degli Studi di Genova, Via Opera Pia, 13 – 16145 Genova – Italy Armando.Tacchella@unige.it POLCOMING, Universit` a degli Studi di Sassari Viale Mancini 5 – 07100 Sassari – Italy Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL lpulina@uniss.it

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Some relevant features: Learning through experiences ie ( S t , A t , R t , S next ) Objective is to attain a policy π ( s i ) → A i Secondly, policy should be maximizing some Figure : Reinforcement Learning measure of ”rewards” R i Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Finite-window rewards Assume: finite time-horizon t ∈ ( t , T ) and discounting γ with γ ∈ [0 , 1) R t = r t +1 + γ r t +2 + γ 2 r t +3 + · · · + γ T − t − 1 r T and that we define, Value as expected-value of this averaged-reward V π ( s ) = E π ( R t | s t = s ) V ( s t ) → V ( s t ) + α ( R t − V ( s t )) We would have update : V ( s t ) → V ( s t ) + αδ, δ = ( r t +1 + γ V ( s t +1 ) − V ( s t )) (1) Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Air hockey Figure : Platform and simulator Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Reasons for picking up air hockey Air hockey is a challenging platform and has been used in past to demonstrate learning As a robotic setup, it has been included as one of the benchmark for robotics & humanoids Our previous work has been performed on real air hockey and supervised learning Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Simulator For the current study, we chose simulator instead of real setup Our goal was to demonstrate safety in a model-free learning approach and ways to improve it Some sophiticated semi-supervised approaches are needed to apply RL on real setup Showing benefits of verification and repair was independent to these approaches Simulation or at least some logging would be required even if real setup were used Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Simulator ... Simulator was implemented with C++ using some libraries like OpenCV, Boost and Pantheios For simplicity no game engine was used, rather 2D Physics was implemented Also physical and geometric considerations were made Extensive logging and a GUI based parameter search was done Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Learning Problem Given : an air hockey platform and a robotic arm. Objective : to learn to defend the goal as good as possible Action of robotic arm was constrained to be minimum-jerk trajectory Joint-kinematics and safety State was defined in trajectory-space rather than cartesian coordinates Discrete state and discrete actions were considered Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Learn Algorithm 1 Pseudo-code for learning to play Air hockey using Re- inforcement Learning Initialize Q ← 0 ; ∆ t ← 20ms function Learn ( Ne , Nb , Nr ) for all i ∈ { 1 , . . . , Ne } do Send Start signal to Simulator j ← 1 repeat Receive sj ← ( pj , α j , θ j ) from Simulator ∆ θ j ← ComputePolicy ( Q , sj ) Send (∆ θ j , ∆ t ) to Simulator Receive sj +1 ← ( pj +1 , α j +1 , θ j +1) and fj +1 ← ( m , g , w , r ) rj +1 ← ComputeReward (( sj +1 , fj +1) Ej ← ( sj , ∆ θ j , rj +1 , sj +1 , fj +1) Q ← Update ( Q , Ej ) j ← j + 1 if ( j = Nb ) then for all k ∈ { 1 , . . . , Nr } do Choose random m ∈ { 1 , . . . , Nb } Q ← Update ( Q , Ej ) end for j ← 1 end if until r = TRUE end for return Q end function Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Verification of DTMC Discrete state-action space, allowed to model learned policy as a Discrete Time Markov Chain Learnt policy π ( s ) → a was Softmax distribution over Q-values, e κ Q ( s , a i ) π ( s , a i ) = (2) a ∈ A e κ Q ( s , a ) � Next states were observed via simulation and probabilities were adjusted imperically We considered 2 approaches: unsafe states as failure and as fault Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Verification of DTMC: Unsafe states as failures unsafe flag = ⇒ halt On practical setups, there are usually low-level control Some approaches to address this: Lyapunov candidates, safety conscious rewarding etc For sake of generality and yet effectiveness, we used safety conscious rewarding schema while avoided Lyapunov candidates In our case, safety of the agent is reachability probability on unsafe states Using safety property, we used both PRISM and MRMC, to get qualitative measure of safety Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Repairing DTMC Intuition: badness of a state depends on forward proximity to a bad state. In general, changing Q-values in ways similar to eligibility trace would make policy safer While this is more effective than incorporating safety while learning, it could deteriorate learnt policy Our experiments show it need not be the case Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Repairing DTMC: Using COMICS We used tool COMICS to generate the counter-example We then proceeded with repairing the paths The overall algorithm was Algorithm 2 Pseudo-code for Verification and Repair of Learn 1: Given agent A , learning algorithm Learn , safety bound P bound 2: Using A perform Learn 3: Obtain policy π ( s , a ) 4: Construct a DTMC D from policy π ( s , a ) 5: Use MRMC or PRISM on D to obtain P unsafe of violating P 6: repeat 7: repeat 8: Use COMICS to generate set S unsafe negating P with bound P unsafe 9: Apply Repair on S unsafe 10: until S unsafe = { φ } 11: P unsafe ← P unsafe − ǫ , ǫ ∈ (0 , P unsafe − P bound ] 12: until P unsafe < P bound Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Results 0.25 0.25 Learn Learn Test Test 1 0.2 0.2 0.8 0.15 0.15 0.6 Learn Test 0.1 0.1 0.4 0.05 0.05 0.2 0 0 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Nbr of episodes Nbr of episodes Nbr of episodes 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 Learn Learn Learn Test Test Test 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Nbr of episodes Nbr of episodes Nbr of episodes Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Thanks to audience and my colleagues 1 ! Questions or comments? 1 Armando Tacchella , Giorgio Metta , & Luca Pulina Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Verification of Agents learning through Reinforcement Shashank - PowerPoint PPT Presentation

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Verification of Agents learning through Reinforcement Shashank Pathak 12 Giorgio Metta 12 Luca Pulina 3 Armando Tacchella 2 Robotics, Brain

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Learning Agents Overview Learning important aspects Learning in Agents goal, types; individual

Intelligent Agents Chapter 2 Intelligent Agents p.1/25 Outline Agents and environments

DIVS DL/ID Verification Systems Verification of Legal Status DIVS Passport Verification

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

CSC421 Intro to Artificial Intelligence UNIT 01: Intelligent Agents Agents & environments

Intelligent Agents Chapter 2 Intelligent Agents p.1/25 Outline Agents and environments

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

Reinforcement Learning: Part 2 Chris Watkins Department of Computer Science Royal Holloway,

Reinforcemen t Learning Read Chapter Exercises

Reinforcement Learning-Based SLC Cache Technique for Enhancing SSD Write Performance Sangjin Yoo

Reinforcement Learning for NLP Graham Neubig Site https://phontron.com/class/nn4nlp2019/ What

Device Placement Optimization with Reinforcement Learning Azalia Mirhoseini, Hieu Pham, Quoc V.

SDRL: Interpretable and Data-efficient Deep Liu Reinforcement Learning Introduction Background

\ Task Scheduling in High-Performance Computing Thomas McSweeney School of Mathematics The

CS885 Reinforcement Learning Lecture 1a: May 2, 2018 Course Introduction [SutBar] Chapter 1,

Verification of Agents learning through Reinforcement Shashank - PowerPoint PPT Presentation

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Verification of Agents learning through Reinforcement Shashank Pathak 12 Giorgio Metta 12 Luca Pulina 3 Armando Tacchella 2 Robotics, Brain

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Learning Agents Overview Learning important aspects Learning in Agents goal, types; individual

Intelligent Agents Chapter 2 Intelligent Agents p.1/25 Outline Agents and environments

DIVS DL/ID Verification Systems Verification of Legal Status DIVS Passport Verification

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

CSC421 Intro to Artificial Intelligence UNIT 01: Intelligent Agents Agents &amp; environments

Intelligent Agents Chapter 2 Intelligent Agents p.1/25 Outline Agents and environments

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

Reinforcement Learning: Part 2 Chris Watkins Department of Computer Science Royal Holloway,

Reinforcemen t Learning Read Chapter Exercises

Reinforcement Learning-Based SLC Cache Technique for Enhancing SSD Write Performance Sangjin Yoo

Reinforcement Learning for NLP Graham Neubig Site https://phontron.com/class/nn4nlp2019/ What

Device Placement Optimization with Reinforcement Learning Azalia Mirhoseini, Hieu Pham, Quoc V.

SDRL: Interpretable and Data-efficient Deep Liu Reinforcement Learning Introduction Background

\ Task Scheduling in High-Performance Computing Thomas McSweeney School of Mathematics The

CS885 Reinforcement Learning Lecture 1a: May 2, 2018 Course Introduction [SutBar] Chapter 1,

CSC421 Intro to Artificial Intelligence UNIT 01: Intelligent Agents Agents & environments