distributional reinforcement learning for efficient
play

Distributional Reinforcement Learning for Efficient Exploration - PowerPoint PPT Presentation

Distributional Reinforcement Learning for Efficient Exploration Hengshuai Yao Huawei Hi-Silicon June, 2019 Hengshuai Yao (Huawei Hi-Silicon) June, 2019 1/10 The exploration problem Exploration is a long standing problem in Reinforcement


  1. Distributional Reinforcement Learning for Efficient Exploration Hengshuai Yao Huawei Hi-Silicon June, 2019 Hengshuai Yao (Huawei Hi-Silicon) June, 2019 1/10

  2. The exploration problem ◮ Exploration is a long standing problem in Reinforcement Learning. ◮ One major fundamental principle is optimism in the face of uncertainty. ◮ Both count-based methods and Bayesian methods follow this optimism principle. ◮ Here the uncertainty refers to parametric uncertainty , which arises from the variance in the estimates of certain parameters given finite samples. Hengshuai Yao (Huawei Hi-Silicon) June, 2019 2/10

  3. Intrinsic uncertainties ◮ The estimation is not the only source of uncertainties. ◮ Most environment itself is stochastic. ◮ Even for deterministic game like GO, the opponent is a huge factor of uncertainty. ◮ The learning process can’t eliminate Intrinsic uncertainty. Hengshuai Yao (Huawei Hi-Silicon) June, 2019 3/10

  4. A naive approach ◮ A naive approach to exploration would be to use the variance of the estimated distribution as a bonus. ◮ Consider a multi-armed bandit environment with 10 arms where each arm’s reward follows normal distribution N ( µ k , σ k ) . ◮ In the setting of multi-armed bandits, this approach leads to picking the arm a such that a = arg max µ k + c σ k ¯ (1) k µ k and σ 2 where ¯ k are the estimated mean and variance of the k -th arm, computed from the corresponding quantile distribution estimation. Hengshuai Yao (Huawei Hi-Silicon) June, 2019 4/10

  5. The naive approach is not optimal Naive exploration bonus I n t r i n s i c u n c e r t a t i n t y P a r a m e t r i c u n c e r t a i n t y Time steps ◮ The naive approach favors actions with high intrinsic uncertainty forever. Hengshuai Yao (Huawei Hi-Silicon) June, 2019 5/10

  6. The motivation of Decaying exploration bonus Decaying bonus I n t r i n s i c u n c e r t a t i n t y P a r a m e t r i c u n c e r t a i n t y Time steps ◮ To suppress the intrinsic uncertainty, we propose a decaying schedule in the form of a multiplier. Hengshuai Yao (Huawei Hi-Silicon) June, 2019 6/10

  7. The DLTV exploration bonus ◮ For instantiating optimism in the face of uncertainty, the upper tail variability is more relevant than the lower tail. ◮ To increase stability, we use the left truncated measure of the variability, σ 2 + . ◮ By combining decaying schedule with σ 2 + we obtain a new exploration bonus for picking an action, which we call Decaying Left Truncated Variance (DLTV): � σ 2 c t + where � log t c t = c . t Hengshuai Yao (Huawei Hi-Silicon) June, 2019 7/10

  8. Results ◮ Our approach achieved 483 % average gain in cumulative rewards on the set of 49 Atari games. ◮ None of the learning curves exhibit plummeting behaviour ◮ Notably the performance gain is obtained in hard games such as Venture, PrivateEye, Montezuma Revenge and Seaquest. DLTV QR-DQN-1 Hengshuai Yao (Huawei Hi-Silicon) June, 2019 8/10

  9. Application on driving safety ◮ A particularly interesting application of the (Distributional) RL approach is driving safety. ◮ DLTV learns significantly faster than DQN and QR-DQN, achieving higher rewards for safety driving. Hengshuai Yao (Huawei Hi-Silicon) June, 2019 9/10

  10. Summary ◮ Exploration is important. ◮ Principle is optimism in the face of uncertainty. ◮ Optimism without decaying is not optimal. ◮ Truncated measure is more stable. ◮ Combining them decaying schedule and truncated measure, we have DLTV. ◮ And it works. Hengshuai Yao (Huawei Hi-Silicon) June, 2019 10/10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend