exploration conscious reinforcement learning

Exploration Conscious Reinforcement Learning Revisited Lior Shani* - PowerPoint PPT Presentation

Exploration Conscious Reinforcement Learning Revisited Lior Shani* Yonathan Efroni* Shie Mannor Technion Institute of Technology Why? To learn a good policy, an RL agent must explore! However, it can cause hazardous behavior during


  1. Exploration Conscious Reinforcement Learning Revisited Lior Shani* Yonathan Efroni* Shie Mannor Technion Institute of Technology

  2. Why? β€’ To learn a good policy, an RL agent must explore! β€’ However, it can cause hazardous behavior during training. I LOVE 𝝑 -GREEDY Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 2 /1 9

  3. Why? β€’ To learn a good policy, an RL agent must explore! β€’ However, it can cause hazardous behavior during training. Damn you Exploration! I LOVE 𝝑 -GREEDY Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 3 /1 9

  4. Exploration Conscious Reinforcement Learning β€’ Objective: Find the optimal policy knowing that exploration might occur β€’ For example : 𝝑 -greedy exploration ( 𝜷 = 𝝑 ) ∞ βˆ— ∈ argmax π†βˆˆπœ¬ 𝔽 πŸβˆ’πœ· 𝝆+πœ·π† 𝟏 ෍ 𝜹 𝒖 𝒔 𝒕 𝒖 , 𝒃 𝒖 𝝆 𝜷 𝒖=𝟏 Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 4 /1 9

  5. Exploration Conscious Reinforcement Learning β€’ Objective: Find the optimal policy knowing that exploration might occur β€’ For example : 𝝑 -greedy exploration ( 𝜷 = 𝝑 ) ∞ βˆ— ∈ argmax π†βˆˆπœ¬ 𝔽 πŸβˆ’πœ· 𝝆+πœ·π† 𝟏 ෍ 𝜹 𝒖 𝒔 𝒕 𝒖 , 𝒃 𝒖 𝝆 𝜷 𝒖=𝟏 β€’ Solving the Exploration-Conscious problem = Solving an MDP β€’ We describe a bias-error sensitivity tradeoff in 𝜷 Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 5 /1 9

  6. Exploration Conscious Reinforcement Learning β€’ Objective: Find the optimal policy knowing that exploration might occur I ’ m Exploration Conscious Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 6 /1 9

  7. Fixed Exploration Schemes (e.g. πœ— -greedy) Choose Greedy Action 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 β€’ 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 ∈ argmax 𝒃 𝑹 𝝆 𝜷 𝒕, 𝒃 Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 7 /1 9

  8. Fixed Exploration Schemes (e.g. πœ— -greedy) Choose Draw Greedy Exploratory Action Action 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 𝒃 𝒃𝒅𝒖 β€’ 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 ∈ argmax 𝒃 𝑹 𝝆 𝜷 𝒕, 𝒃 β€’ For 𝜷 -greedy: 𝒃 𝒃𝒅𝒖 ∈ 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 w.p. 𝟐 βˆ’ 𝜷 𝝆 𝟏 else Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 8 /1 9

  9. Fixed Exploration Schemes (e.g. πœ— -greedy) Choose Draw Greedy Exploratory Act Action Action 𝒃 𝒃𝒅𝒖 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 𝒃 𝒃𝒅𝒖 β€’ 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 ∈ argmax 𝒃 𝑹 𝝆 𝜷 𝒕, 𝒃 β€’ For 𝜷 -greedy: 𝒃 𝒃𝒅𝒖 ∈ 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 w.p. 𝟐 βˆ’ 𝜷 𝝆 𝟏 else Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 9 /1 9

  10. Fixed Exploration Schemes (e.g. πœ— -greedy) Choose Draw Greedy Exploratory Act Recieve Action Action 𝒃 𝒃𝒅𝒖 𝒔, 𝒕′ 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 𝒃 𝒃𝒅𝒖 β€’ 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 ∈ argmax 𝒃 𝑹 𝝆 𝜷 𝒕, 𝒃 β€’ For 𝜷 -greedy: 𝒃 𝒃𝒅𝒖 ∈ 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 w.p. 𝟐 βˆ’ 𝜷 𝝆 𝟏 else Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 10 /1 9

  11. Fixed Exploration Schemes (e.g. πœ— -greedy) Choose Draw Greedy Exploratory Act Recieve Action Action 𝒃 𝒃𝒅𝒖 𝒔, 𝒕′ 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 𝒃 𝒃𝒅𝒖 β€’ Normally used information: 𝒕, 𝒃 𝒃𝒅𝒖 , 𝒔, 𝒕 β€² Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 11 /1 9

  12. Fixed Exploration Schemes (e.g. πœ— -greedy) Choose Draw Greedy Exploratory Act Recieve Action Action 𝒃 𝒃𝒅𝒖 𝒔, 𝒕′ 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 𝒃 𝒃𝒅𝒖 β€’ Normally used information: 𝒕, 𝒃 𝒃𝒅𝒖 , 𝒔, 𝒕 β€² β€’ Using information about the exploration process: 𝒕, 𝒃 𝒉𝒔𝒇𝒇𝒆𝒛 , 𝒃 𝒃𝒅𝒖 , 𝒔, 𝒕 β€² Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 12 /1 9

  13. Two Approaches – Expected approach Update 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 1. 𝒃𝒅𝒖 2. Expect that the agent might explore in the next state 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒃𝒅𝒖 += 𝜽 𝒔 𝒖 + πœΉπ”½ πŸβˆ’πœ· 𝝆+πœ·π† 𝟏 𝑹 𝝆 𝜷 𝒕 𝒖+𝟐 , 𝒃 βˆ’ 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒃𝒅𝒖 Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 13 /1 9

  14. Two Approaches – Expected approach Update 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 1. 𝒃𝒅𝒖 2. Expect that the agent might explore in the next state 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒃𝒅𝒖 += 𝜽 𝒔 𝒖 + πœΉπ”½ πŸβˆ’πœ· 𝝆+πœ·π† 𝟏 𝑹 𝝆 𝜷 𝒕 𝒖+𝟐 , 𝒃 βˆ’ 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒃𝒅𝒖 β€’ Calculating expectations can be hard. β€’ Requires sampling in the continuous case! Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 14 /1 9

  15. Two Approaches – Surrogate approach β€’ Exploration is incorporated into the environment! Update 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒉𝒔𝒇𝒇𝒆𝒛 1. 2. The rewards and next state 𝒔 𝒖 , 𝒕 𝒖+𝟐 are given by the acted action 𝒃 𝒖 𝒃𝒅𝒖 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒉𝒔𝒇𝒇𝒆𝒛 += 𝜽 𝒔 𝒖 + πœΉπ‘Ή 𝝆 𝜷 𝒕 𝒖+𝟐 , 𝒃 𝒖+𝟐 βˆ’ 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒉𝒔𝒇𝒇𝒆𝒛 𝒉𝒔𝒇𝒇𝒆𝒛 Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 15 /1 9

  16. Two Approaches – Surrogate approach β€’ Exploration is incorporated into the environment ! Update 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒉𝒔𝒇𝒇𝒆𝒛 1. 2. The rewards and next state 𝒔 𝒖 , 𝒕 𝒖+𝟐 are given by the acted action 𝒃 𝒖 𝒃𝒅𝒖 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒉𝒔𝒇𝒇𝒆𝒛 += 𝜽 𝒔 𝒖 + πœΉπ‘Ή 𝝆 𝜷 𝒕 𝒖+𝟐 , 𝒃 𝒖+𝟐 βˆ’ 𝑹 𝝆 𝜷 𝒕 𝒖 , 𝒃 𝒖 𝒉𝒔𝒇𝒇𝒆𝒛 𝒉𝒔𝒇𝒇𝒆𝒛 β€’ NO NEED TO SAMPLE! Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 16 /1 9

  17. Deep RL Experimental Results Training Evaluation Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 17 /1 9

  18. Summary β€’ We define Exploration Conscious RL and analyze its properties. β€’ Exploration Conscious RL can improve performance over both the training and evaluation regimes . β€’ Conclusion: Exploration-Conscious RL and specifically, the Surrogate approach, can easily help to improve variety of RL algorithms. Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 18 /1 9

  19. Summary β€’ We define Exploration Conscious RL and analyze its properties. β€’ Exploration Conscious RL can improve performance over both the training and evaluation regimes . β€’ Conclusion: Exploration-Conscious RL and specifically, the Surrogate approach, can easily help to improve variety of RL algorithms. SEE YOU AT POSTER #90 Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 19 /1 9

Recommend


More recommend