Distributional Reinforcement Learning for Efficient Exploration - PowerPoint PPT Presentation

Distributional Reinforcement Learning for Efficient Exploration Hengshuai Yao Huawei Hi-Silicon June, 2019 Hengshuai Yao (Huawei Hi-Silicon) June, 2019 1/10

The exploration problem ◮ Exploration is a long standing problem in Reinforcement Learning. ◮ One major fundamental principle is optimism in the face of uncertainty. ◮ Both count-based methods and Bayesian methods follow this optimism principle. ◮ Here the uncertainty refers to parametric uncertainty , which arises from the variance in the estimates of certain parameters given finite samples. Hengshuai Yao (Huawei Hi-Silicon) June, 2019 2/10

Intrinsic uncertainties ◮ The estimation is not the only source of uncertainties. ◮ Most environment itself is stochastic. ◮ Even for deterministic game like GO, the opponent is a huge factor of uncertainty. ◮ The learning process can’t eliminate Intrinsic uncertainty. Hengshuai Yao (Huawei Hi-Silicon) June, 2019 3/10

A naive approach ◮ A naive approach to exploration would be to use the variance of the estimated distribution as a bonus. ◮ Consider a multi-armed bandit environment with 10 arms where each arm’s reward follows normal distribution N ( µ k , σ k ) . ◮ In the setting of multi-armed bandits, this approach leads to picking the arm a such that a = arg max µ k + c σ k ¯ (1) k µ k and σ 2 where ¯ k are the estimated mean and variance of the k -th arm, computed from the corresponding quantile distribution estimation. Hengshuai Yao (Huawei Hi-Silicon) June, 2019 4/10

The naive approach is not optimal Naive exploration bonus I n t r i n s i c u n c e r t a t i n t y P a r a m e t r i c u n c e r t a i n t y Time steps ◮ The naive approach favors actions with high intrinsic uncertainty forever. Hengshuai Yao (Huawei Hi-Silicon) June, 2019 5/10

The motivation of Decaying exploration bonus Decaying bonus I n t r i n s i c u n c e r t a t i n t y P a r a m e t r i c u n c e r t a i n t y Time steps ◮ To suppress the intrinsic uncertainty, we propose a decaying schedule in the form of a multiplier. Hengshuai Yao (Huawei Hi-Silicon) June, 2019 6/10

The DLTV exploration bonus ◮ For instantiating optimism in the face of uncertainty, the upper tail variability is more relevant than the lower tail. ◮ To increase stability, we use the left truncated measure of the variability, σ 2 + . ◮ By combining decaying schedule with σ 2 + we obtain a new exploration bonus for picking an action, which we call Decaying Left Truncated Variance (DLTV): � σ 2 c t + where � log t c t = c . t Hengshuai Yao (Huawei Hi-Silicon) June, 2019 7/10

Results ◮ Our approach achieved 483 % average gain in cumulative rewards on the set of 49 Atari games. ◮ None of the learning curves exhibit plummeting behaviour ◮ Notably the performance gain is obtained in hard games such as Venture, PrivateEye, Montezuma Revenge and Seaquest. DLTV QR-DQN-1 Hengshuai Yao (Huawei Hi-Silicon) June, 2019 8/10

Application on driving safety ◮ A particularly interesting application of the (Distributional) RL approach is driving safety. ◮ DLTV learns significantly faster than DQN and QR-DQN, achieving higher rewards for safety driving. Hengshuai Yao (Huawei Hi-Silicon) June, 2019 9/10

Summary ◮ Exploration is important. ◮ Principle is optimism in the face of uncertainty. ◮ Optimism without decaying is not optimal. ◮ Truncated measure is more stable. ◮ Combining them decaying schedule and truncated measure, we have DLTV. ◮ And it works. Hengshuai Yao (Huawei Hi-Silicon) June, 2019 10/10

Distributional Reinforcement Learning for Efficient Exploration - PowerPoint PPT Presentation

Distributional Reinforcement Learning for Efficient Exploration Hengshuai Yao Huawei Hi-Silicon June, 2019 Hengshuai Yao (Huawei Hi-Silicon) June, 2019 1/10 The exploration problem Exploration is a long standing problem in Reinforcement

Statistics and Samples in Distributional Reinforcement Learning Mark Rowland, Robert Dadashi,

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Statistics and Samples in Distributional Reinforcement Learning Rowland, Dadashi, Kumar, Munos,

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Deep he(a)p, big feat arXiv:1707.06887 A Distributional Perspective on Reinforcement Learning

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Distributional Semantics The unsupervised modeling of meaning on a large scale Tim Van de Cruys

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Qualification of SSR1 cavities for PIP-II Soniya Samani Queen Mary University of London

Reconciling Traditional Optimization Analyses and Modern Deep Learning: the Intrinsic Learning

Overview of nucleon form factor measurements Focus on theoretical calculations of form factors

International Quality Infrastructure Dr Martin Milton Director, BIPM 29 th November 2017 Outline

Identification of Complexity Factors for Remote Towers Billy Josefsson Joern

Variational Autoencoders Recap: Story so far A classification MLP actually comprises two

Sterile neutrino as Dark Matter Oleg Ruchayskiy Institut des Hautes Etudes Scientifiques

Marta Bunge Intrinsic n-stack completions over a topos Joint work with Claudio