Distributional Reinforcement Learning for Efficient Exploration - - PowerPoint PPT Presentation

distributional reinforcement learning for efficient
SMART_READER_LITE
LIVE PREVIEW

Distributional Reinforcement Learning for Efficient Exploration - - PowerPoint PPT Presentation

Distributional Reinforcement Learning for Efficient Exploration Hengshuai Yao Huawei Hi-Silicon June, 2019 Hengshuai Yao (Huawei Hi-Silicon) June, 2019 1/10 The exploration problem Exploration is a long standing problem in Reinforcement


slide-1
SLIDE 1

Distributional Reinforcement Learning for Efficient Exploration

Hengshuai Yao

Huawei Hi-Silicon

June, 2019

Hengshuai Yao (Huawei Hi-Silicon) June, 2019 1/10

slide-2
SLIDE 2

The exploration problem

◮ Exploration is a long standing problem in Reinforcement

Learning.

◮ One major fundamental principle is optimism in the face of

uncertainty.

◮ Both count-based methods and Bayesian methods follow this

  • ptimism principle.

◮ Here the uncertainty refers to parametric uncertainty, which

arises from the variance in the estimates of certain parameters given finite samples.

Hengshuai Yao (Huawei Hi-Silicon) June, 2019 2/10

slide-3
SLIDE 3

Intrinsic uncertainties

◮ The estimation is not the only source of uncertainties. ◮ Most environment itself is stochastic. ◮ Even for deterministic game like GO, the opponent is a huge

factor of uncertainty.

◮ The learning process can’t eliminate Intrinsic uncertainty.

Hengshuai Yao (Huawei Hi-Silicon) June, 2019 3/10

slide-4
SLIDE 4

A naive approach

◮ A naive approach to exploration would be to use the variance of

the estimated distribution as a bonus.

◮ Consider a multi-armed bandit environment with 10 arms where

each arm’s reward follows normal distribution N(µk, σk).

◮ In the setting of multi-armed bandits, this approach leads to

picking the arm a such that a = arg max

k

¯ µk + cσk (1) where ¯ µk and σ2

k are the estimated mean and variance of the k-th

arm, computed from the corresponding quantile distribution estimation.

Hengshuai Yao (Huawei Hi-Silicon) June, 2019 4/10

slide-5
SLIDE 5

The naive approach is not optimal

P a r a m e t r i c u n c e r t a i n t y I n t r i n s i c u n c e r t a t i n t y

Time steps

Naive exploration bonus

◮ The naive approach favors actions with high intrinsic uncertainty

forever.

Hengshuai Yao (Huawei Hi-Silicon) June, 2019 5/10

slide-6
SLIDE 6

The motivation of Decaying exploration bonus

P a r a m e t r i c u n c e r t a i n t y I n t r i n s i c u n c e r t a t i n t y

Time steps

Decaying bonus

◮ To suppress the intrinsic uncertainty, we propose a decaying

schedule in the form of a multiplier.

Hengshuai Yao (Huawei Hi-Silicon) June, 2019 6/10

slide-7
SLIDE 7

The DLTV exploration bonus

◮ For instantiating optimism in the face of uncertainty, the upper

tail variability is more relevant than the lower tail.

◮ To increase stability, we use the left truncated measure of the

variability, σ2

+. ◮ By combining decaying schedule with σ2 + we obtain a new

exploration bonus for picking an action, which we call Decaying Left Truncated Variance (DLTV): ct

  • σ2

+

where ct = c

  • log t

t .

Hengshuai Yao (Huawei Hi-Silicon) June, 2019 7/10

slide-8
SLIDE 8

Results

◮ Our approach achieved 483 % average gain in cumulative

rewards on the set of 49 Atari games.

◮ None of the learning curves exhibit plummeting behaviour ◮ Notably the performance gain is obtained in hard games such as

Venture, PrivateEye, Montezuma Revenge and Seaquest.

DLTV QR-DQN-1

Hengshuai Yao (Huawei Hi-Silicon) June, 2019 8/10

slide-9
SLIDE 9

Application on driving safety

◮ A particularly interesting application of the (Distributional) RL

approach is driving safety.

◮ DLTV learns significantly faster than DQN and QR-DQN,

achieving higher rewards for safety driving.

Hengshuai Yao (Huawei Hi-Silicon) June, 2019 9/10

slide-10
SLIDE 10

Summary

◮ Exploration is important. ◮ Principle is optimism in the face of uncertainty. ◮ Optimism without decaying is not optimal. ◮ Truncated measure is more stable. ◮ Combining them decaying schedule and truncated measure, we

have DLTV.

◮ And it works.

Hengshuai Yao (Huawei Hi-Silicon) June, 2019 10/10