Exploration Conscious Reinforcement Learning Revisited Lior Shani* Yonathan Efroni* Shie Mannor Technion Institute of Technology
Why? β’ To learn a good policy, an RL agent must explore! β’ However, it can cause hazardous behavior during training. I LOVE π -GREEDY Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 2 /1 9
Why? β’ To learn a good policy, an RL agent must explore! β’ However, it can cause hazardous behavior during training. Damn you Exploration! I LOVE π -GREEDY Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 3 /1 9
Exploration Conscious Reinforcement Learning β’ Objective: Find the optimal policy knowing that exploration might occur β’ For example : π -greedy exploration ( π· = π ) β β β argmax πβπ¬ π½ πβπ· π+π·π π ΰ· πΉ π π π π , π π π π· π=π Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 4 /1 9
Exploration Conscious Reinforcement Learning β’ Objective: Find the optimal policy knowing that exploration might occur β’ For example : π -greedy exploration ( π· = π ) β β β argmax πβπ¬ π½ πβπ· π+π·π π ΰ· πΉ π π π π , π π π π· π=π β’ Solving the Exploration-Conscious problem = Solving an MDP β’ We describe a bias-error sensitivity tradeoff in π· Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 5 /1 9
Exploration Conscious Reinforcement Learning β’ Objective: Find the optimal policy knowing that exploration might occur I β m Exploration Conscious Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 6 /1 9
Fixed Exploration Schemes (e.g. π -greedy) Choose Greedy Action π ππππππ β’ π ππππππ β argmax π πΉ π π· π, π Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 7 /1 9
Fixed Exploration Schemes (e.g. π -greedy) Choose Draw Greedy Exploratory Action Action π ππππππ π ππ π β’ π ππππππ β argmax π πΉ π π· π, π β’ For π· -greedy: π ππ π β π ππππππ w.p. π β π· π π else Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 8 /1 9
Fixed Exploration Schemes (e.g. π -greedy) Choose Draw Greedy Exploratory Act Action Action π ππ π π ππππππ π ππ π β’ π ππππππ β argmax π πΉ π π· π, π β’ For π· -greedy: π ππ π β π ππππππ w.p. π β π· π π else Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 9 /1 9
Fixed Exploration Schemes (e.g. π -greedy) Choose Draw Greedy Exploratory Act Recieve Action Action π ππ π π, πβ² π ππππππ π ππ π β’ π ππππππ β argmax π πΉ π π· π, π β’ For π· -greedy: π ππ π β π ππππππ w.p. π β π· π π else Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 10 /1 9
Fixed Exploration Schemes (e.g. π -greedy) Choose Draw Greedy Exploratory Act Recieve Action Action π ππ π π, πβ² π ππππππ π ππ π β’ Normally used information: π, π ππ π , π, π β² Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 11 /1 9
Fixed Exploration Schemes (e.g. π -greedy) Choose Draw Greedy Exploratory Act Recieve Action Action π ππ π π, πβ² π ππππππ π ππ π β’ Normally used information: π, π ππ π , π, π β² β’ Using information about the exploration process: π, π ππππππ , π ππ π , π, π β² Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 12 /1 9
Two Approaches β Expected approach Update πΉ π π· π π , π π 1. ππ π 2. Expect that the agent might explore in the next state πΉ π π· π π , π π ππ π += π½ π π + πΉπ½ πβπ· π+π·π π πΉ π π· π π+π , π β πΉ π π· π π , π π ππ π Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 13 /1 9
Two Approaches β Expected approach Update πΉ π π· π π , π π 1. ππ π 2. Expect that the agent might explore in the next state πΉ π π· π π , π π ππ π += π½ π π + πΉπ½ πβπ· π+π·π π πΉ π π· π π+π , π β πΉ π π· π π , π π ππ π β’ Calculating expectations can be hard. β’ Requires sampling in the continuous case! Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 14 /1 9
Two Approaches β Surrogate approach β’ Exploration is incorporated into the environment! Update πΉ π π· π π , π π ππππππ 1. 2. The rewards and next state π π , π π+π are given by the acted action π π ππ π πΉ π π· π π , π π ππππππ += π½ π π + πΉπΉ π π· π π+π , π π+π β πΉ π π· π π , π π ππππππ ππππππ Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 15 /1 9
Two Approaches β Surrogate approach β’ Exploration is incorporated into the environment ! Update πΉ π π· π π , π π ππππππ 1. 2. The rewards and next state π π , π π+π are given by the acted action π π ππ π πΉ π π· π π , π π ππππππ += π½ π π + πΉπΉ π π· π π+π , π π+π β πΉ π π· π π , π π ππππππ ππππππ β’ NO NEED TO SAMPLE! Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 16 /1 9
Deep RL Experimental Results Training Evaluation Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 17 /1 9
Summary β’ We define Exploration Conscious RL and analyze its properties. β’ Exploration Conscious RL can improve performance over both the training and evaluation regimes . β’ Conclusion: Exploration-Conscious RL and specifically, the Surrogate approach, can easily help to improve variety of RL algorithms. Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 18 /1 9
Summary β’ We define Exploration Conscious RL and analyze its properties. β’ Exploration Conscious RL can improve performance over both the training and evaluation regimes . β’ Conclusion: Exploration-Conscious RL and specifically, the Surrogate approach, can easily help to improve variety of RL algorithms. SEE YOU AT POSTER #90 Shani, Efroni & Mannor Exploration Conscious Reinforcement Learning revisited 12-Jun-19 19 /1 9
Recommend
More recommend