aipol anti imitation based policy learning
play

AIPOL: Anti Imitation-based Policy Learning Mich` ele Sebag, Riad - PowerPoint PPT Presentation

AIPOL: Anti Imitation-based Policy Learning Mich` ele Sebag, Riad Akrour, Basile Mayeur, Marc Schoenauer TAO, CNRS INRIA LRI, UPSud, Universit e Paris-Saclay, France ECML PKDD 2016, Riva della Garda Reinforcement Learning The


  1. AIPOL: Anti Imitation-based Policy Learning Mich` ele Sebag, Riad Akrour, Basile Mayeur, Marc Schoenauer TAO, CNRS − INRIA − LRI, UPSud, Universit´ e Paris-Saclay, France ECML PKDD 2016, Riva della Garda

  2. Reinforcement Learning The ultimate challenge ◮ Learning improves survival expectation RL and the value function ◮ State space S , action space A ◮ transition p ( s , a , s ′ ) ◮ reward function R : S �→ I R ◮ policy π : S �→ A For each π , define reward expectation � ∞ � � γ t R ( s t +1 ) | s 0 = s , s t +1 ∼ p ( s t , a t = π ( s t ) , · ) V π ( s ) = R ( s )+ I E t =0

  3. 1: Do we really need the value function ? YES Bellman optimality equation V ∗ ( s ) = max π V π ( s ) Q ∗ ( s , a ) = R ( s ) + γ I E s ′ ∼ p ( s , a , · ) V ∗ ( s ′ ) π ∗ ( s ) = arg max Q ∗ ( s , a ) a NO Learning value function Q : S × A �→ I R more complex than learning policy π : S �→ A

  4. Value function and Energy-based learning Le Cun et al., 2006 Goal : Learn h : X �→ Y e.g. Y structured Energy-based Learning 1. Learn s . t . g ( x , y x ) > g ( x , y ′ ) for y ′ � = y x g : X × Y �→ I R 2. Set h ( x ) = arg max g ( x , y ) y EbL pros and cons − more complex + more robust

  5. 2: Which human expertise required for RL ? Agent learns Human designs / yields RL Pol. π ∗ S , A , R Inverse RL Reward R (optimal) trajectories [1] + RL Preference RL Pol. Return ranked trajectories [2,3,4,5] + DPS [1] Abbeel, P.: Apprenticeship Learning and Reinforcement Learning PhD thesis 08 [2] Frnkrantz, J. et al.: Preference-based reinforcement learning. MLJ 12M [3] Wilson et al.: A Bayesian Approach for Policy Learning from Trajectory Preference Queries. NIPS 12 [4] Jain et al.: Learning Trajectory Preferences for Manipulators via Iterative Improvement NIPS 13 [5] Akrour et al. Programming by Feedback, ICML 14

  6. This talk Relaxing expertise requirement Expert only required to know what can go wrong Counter-trajectories CD = def ( s 1 , . . . s T ) s.t. V ∗ ( s t ) < V ∗ ( s t +1 ) with V ∗ the (unknown) optimal value function. Example ◮ Take a bicycle in equilibrium s 1 ◮ Apply a random policy ◮ Bicycle soon falls down... s T

  7. Anti-Imitation Policy Learning 1/3 Given counter trajectories E = { ( s i , 1 , . . . s i , T i ) , i = 1 . . . n } Learn pseudo-value U ∗ s.t. U ∗ ( s i , t ) > U ∗ ( s i , t +1 ) Formally U ∗ = arg min { Loss( U , E ) + R ( U ) } with • Loss( U , E ) = Σ i Σ t < t ′ [ U ( s i , t ′ ) − U ( s i , t ) + 1] + • R ( U ) a regularization term If transition model is known, AiPOL policy : E s ′ ∼ p ( s , a , · ) U ∗ ( s ′ ) π U ∗ ( s ) = arg max I a

  8. Anti-Imitation Policy Learning 2/3 If transition model is unknown 1. Given U ∗ pseudo-value function 2. Given G = { ( s i , a i , s ′ i ) , i = 1 , m , s . t . U ∗ ( s ′ i ) > U ∗ ( s ′ i +1 ) } Learn pseudo Q -value s.t. Q ∗ ( s i , a i ) > Q ∗ ( s i +1 , a i +1 ) Formally Learning to rank Q ∗ = arg min { Loss( Q , G ) + R ( Q ) } with • Loss( Q , G ) = Σ i < j [ Q ( s j , a j ) − Q ( s i , a i ) + 1] + • R ( U ) a regularization term AiPOL policy : Q ∗ ( s , a ) π Q ∗ ( s ) = arg max a

  9. Anti-Imitation Policy Learning 2/3 Proposition If i) V ∗ continuous on S ; ii) U ∗ monotonous wrt V ∗ on S iii) with a margin between best and other actions ∀ a ′ � = a = π U ∗ ( s ) , I E U ∗ ( s ′ E U ∗ ( s ′ s , a ) > I s , a ′ ) + M iv) U ∗ Lipschitz with constant M ; v) transition model β -sub-Gaussian: s , a || 2 > t ) < 2 e − β t 2 R + , I ∀ t ∈ I P ( || I E s ′ s , a − s ′ Then if 2 L < M β , π U ∗ is an optimal policy

  10. Experimental validation Goals of experiment ◮ How many CD s? ◮ How much expertise in generating CD s ? (starting state, controller) Experimental setting Mountain Bicycle Pendulum # CD 1 20 1 length CD 1,000 5 1,000 starting state target st random target st. controller neutral random neutral

  11. Experimental setting, 2 Learning to rank Ranking SVM with Gaussian kernel Joachims 06 Mountain Bicycle Pendulum 10 3 10 3 10 − 5 U ∗ C 1 1 /σ 2 10 − 3 10 − 3 .5 1 Q ∗ nb const 500 5,000 − 10 3 10 3 C 2 − 1 /σ 2 10 − 3 10 − 3 − 2

  12. Mountain Car, 1/3 AiPOL vs SARSA depending on the friction steps to the goal (after learning) 600 500 400 300 200 100 0 - 0. 005 0 0. 005 0. 01 0. 015 0. 02 0. 025 friction level Mountain car (20 runs)

  13. Mountain Car, 2/3 AiPOL pseudo-value vs SARSA value (1,000 iter) 1000 0 -50 0 -100 -1000 -150 -2000 -200 -3000 -250 -4000 -300 0.1 1 0.05 1 0.5 0.5 0.5 0 0 0 0 -0.5 -0.05 -0.5 -0.5 -1 Speed -1 Speed Position Position -0.1 -1 -1.5 -1.5

  14. Mountain Car, 3/3 AiPOL policy vs SARSA policy - 0. 6 - 0. 6 - 0. 4 - 0. 4 - 0. 2 - 0. 2 Speed Speed 0 0 0. 2 0. 2 0. 4 0. 4 0. 6 0. 6 - 1. 2 - 1 - 0. 8 - 0. 6 - 0. 4 - 0. 2 0 0. 2 0. 4 0. 6 - 1. 2 - 1 - 0. 8 - 0. 6 - 0. 4 - 0. 2 0 0. 2 0. 4 Position Position Action: forward, backward, neutral.

  15. Bicycle Sensitivity wrt number and length of CD s 100 80 % of success 60 40 20 CD length = 2 CD length = 5 CD length =10 0 0 10 20 30 40 50 number of CDs

  16. Inverted Pendulum Sensitivity wrt Ranking-SVM hyper-parameters Interpretation ◮ kernel width too small, no generalization (doesn’t reach the top) ◮ too large, U ∗ imprecise (goes to the top and falls on the other side)

  17. AiPOL: Discussion Pro ◮ Compared to Inverse RL, AiPOL involves relaxed expertise requirements, with lesser computational requirements (greedification as opposed to RL) Limitations ◮ Latency of transitions: (e.g. bicycle) ( s i , a i , s ′ i , a ′ i , s “ i ) Q ( s i , a i ) > Q ( s j , a j ) if U ∗ ( s ′′ i ) > U ∗ ( s “ j ) ◮ Cost of learning Q ∗ quadratic in number of triplets. Further work ◮ Non reversible MDP needs be addressed.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend