Verification of Agents learning through Reinforcement Shashank - - PowerPoint PPT Presentation

verification of agents learning through reinforcement
SMART_READER_LITE
LIVE PREVIEW

Verification of Agents learning through Reinforcement Shashank - - PowerPoint PPT Presentation

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Verification of Agents learning through Reinforcement Shashank Pathak 12 Giorgio Metta 12 Luca Pulina 3 Armando Tacchella 2 Robotics, Brain


slide-1
SLIDE 1

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion

Verification of Agents learning through Reinforcement

Shashank Pathak12 Giorgio Metta12 Luca Pulina3 Armando Tacchella2

Robotics, Brain and Cognitive Sciences (RBCS) Istituto Italiano di Tecnologia (IIT), Via Morego, 30 – 16163 Genova – Italy Shashank.Pathak@iit.it - Giorgio.Metta@iit.it

  • Dipartimento. di Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi

(DIBRIS) Universit` a degli Studi di Genova, Via Opera Pia, 13 – 16145 Genova – Italy Armando.Tacchella@unige.it POLCOMING, Universit` a degli Studi di Sassari Viale Mancini 5 – 07100 Sassari – Italy lpulina@uniss.it

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

slide-2
SLIDE 2

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

slide-3
SLIDE 3

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion

Some relevant features:

Figure : Reinforcement Learning

Learning through experiences ie (St, At, Rt, Snext) Objective is to attain a policy π(si) → Ai Secondly, policy should be maximizing some measure of ”rewards” Ri

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

slide-4
SLIDE 4

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion

Finite-window rewards

Assume: finite time-horizon t ∈ (t, T) and discounting γ with γ ∈ [0, 1) Rt = rt+1 + γrt+2 + γ2rt+3 + · · · + γT−t−1rT and that we define, Value as expected-value of this averaged-reward V π(s) = E π(Rt|st = s) V (st) → V (st) + α(Rt − V (st)) We would have update: V (st) → V (st) + αδ, δ = (rt+1 + γV (st+1) − V (st)) (1)

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

slide-5
SLIDE 5

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion

Air hockey

Figure : Platform and simulator

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

slide-6
SLIDE 6

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion

Reasons for picking up air hockey

Air hockey is a challenging platform and has been used in past to demonstrate learning As a robotic setup, it has been included as one of the benchmark for robotics & humanoids Our previous work has been performed on real air hockey and supervised learning

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

slide-7
SLIDE 7

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion

Simulator

For the current study, we chose simulator instead of real setup Our goal was to demonstrate safety in a model-free learning approach and ways to improve it Some sophiticated semi-supervised approaches are needed to apply RL on real setup Showing benefits of verification and repair was independent to these approaches Simulation or at least some logging would be required even if real setup were used

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

slide-8
SLIDE 8

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion

Simulator ...

Simulator was implemented with C++ using some libraries like OpenCV, Boost and Pantheios For simplicity no game engine was used, rather 2D Physics was implemented Also physical and geometric considerations were made Extensive logging and a GUI based parameter search was done

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

slide-9
SLIDE 9

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion

Learning Problem

Given: an air hockey platform and a robotic arm. Objective: to learn to defend the goal as good as possible Action of robotic arm was constrained to be minimum-jerk trajectory Joint-kinematics and safety State was defined in trajectory-space rather than cartesian coordinates Discrete state and discrete actions were considered

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

slide-10
SLIDE 10

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion

Learn Algorithm 1 Pseudo-code for learning to play Air hockey using Re- inforcement Learning

Initialize Q ← 0; ∆t ← 20ms function Learn(Ne , Nb, Nr ) for all i ∈ {1, . . . , Ne } do Send Start signal to Simulator j ← 1 repeat Receive sj ← (pj , αj , θj ) from Simulator ∆θj ← ComputePolicy(Q, sj ) Send (∆θj , ∆t) to Simulator Receive sj+1 ← (pj+1, αj+1, θj+1) and fj+1 ← (m, g, w, r) rj+1 ← ComputeReward((sj+1, fj+1) Ej ← (sj , ∆θj , rj+1, sj+1, fj+1) Q ← Update(Q, Ej ) j ← j + 1 if (j = Nb) then for all k ∈ {1, . . . , Nr } do Choose random m ∈ {1, . . . , Nb} Q ← Update(Q, Ej ) end for j ← 1 end if until r = TRUE end for return Q end function

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

slide-11
SLIDE 11

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion

Verification of DTMC

Discrete state-action space, allowed to model learned policy as a Discrete Time Markov Chain Learnt policy π(s) → a was Softmax distribution over Q-values, π(s, ai) = eκQ(s,ai)

  • a∈A eκQ(s,a)

(2) Next states were observed via simulation and probabilities were adjusted imperically We considered 2 approaches: unsafe states as failure and as fault

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

slide-12
SLIDE 12

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion

Verification of DTMC: Unsafe states as failures

unsafe flag = ⇒ halt On practical setups, there are usually low-level control Some approaches to address this: Lyapunov candidates, safety conscious rewarding etc For sake of generality and yet effectiveness, we used safety conscious rewarding schema while avoided Lyapunov candidates In our case, safety of the agent is reachability probability on unsafe states Using safety property, we used both PRISM and MRMC, to get qualitative measure of safety

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

slide-13
SLIDE 13

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion

Repairing DTMC

Intuition: badness of a state depends on forward proximity to a bad state. In general, changing Q-values in ways similar to eligibility trace would make policy safer While this is more effective than incorporating safety while learning, it could deteriorate learnt policy Our experiments show it need not be the case

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

slide-14
SLIDE 14

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion

Repairing DTMC: Using COMICS

We used tool COMICS to generate the counter-example We then proceeded with repairing the paths The overall algorithm was Algorithm 2 Pseudo-code for Verification and Repair of Learn

1: Given agent A, learning algorithm

Learn

, safety bound Pbound

2: Using A perform Learn 3: Obtain policy π(s, a) 4: Construct a DTMC D from policy π(s, a) 5: Use MRMC or PRISM on D to obtain Punsafe of violating P 6: repeat 7:

repeat

8:

Use COMICS to generate set Sunsafe negating P with bound Punsafe

9:

Apply Repair on Sunsafe

10:

until Sunsafe = {φ}

11:

Punsafe ← Punsafe − ǫ, ǫ ∈ (0, Punsafe − Pbound]

12: until Punsafe < Pbound

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

slide-15
SLIDE 15

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion

Results

0.05 0.1 0.15 0.2 0.25 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Nbr of episodes Learn Test 0.05 0.1 0.15 0.2 0.25 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Nbr of episodes Learn Test 0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Nbr of episodes Learn Test 0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Nbr of episodes Learn Test 0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Nbr of episodes Learn Test 0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Nbr of episodes Learn Test

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

slide-16
SLIDE 16

Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion

Thanks to audience and my colleagues 1! Questions or comments?

1Armando Tacchella,Giorgio Metta, & Luca Pulina Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL