Controlling Overestimation Bias with Truncated Mixture of Continuous - - PowerPoint PPT Presentation

controlling overestimation bias
SMART_READER_LITE
LIVE PREVIEW

Controlling Overestimation Bias with Truncated Mixture of Continuous - - PowerPoint PPT Presentation

Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics Arsenii Kuznetsov 1 Pavel Shvechikov 1,2 Alexander Grishin 1,3 Dmitry Vetrov 1,3 1 Samsung AI center, Moscow 2 Higher School of Economics,


slide-1
SLIDE 1

Arsenii Kuznetsov1 Pavel Shvechikov1,2 Alexander Grishin1,3 Dmitry Vetrov1,3

Controlling Overestimation Bias

with Truncated Mixture of Continuous Distributional Quantile Critics

1 Samsung AI center, Moscow 2 Higher School of Economics, Moscow 3 Samsung HSE Laboratory

1

slide-2
SLIDE 2

1.

Value estimates are imprecise

2.

Agent pursues erroneous estimates

3.

Errors propagate through time

4.

Performance degrades

2

Overestimation bias in off-policy learning

...

slide-3
SLIDE 3

1.

Value estimates are imprecise

2.

Agent pursues erroneous estimates

3.

Errors propagate through time

4.

Performance degrades

3

We propose a novel method: Truncated Quantile Critics (TQC)

Overestimation bias in off-policy learning

...

slide-4
SLIDE 4

Key elements of TQC

1. Distributional critics

  • Impressive empirical performance
  • Captures info about return variance

4

slide-5
SLIDE 5

Key elements of TQC

1. Distributional critics

  • Impressive empirical performance
  • Captures info about return variance

2. Ensembling of the critics

  • Increases performance and stability

5

slide-6
SLIDE 6

Key elements of TQC

1. Distributional critics

  • Impressive empirical performance
  • Captures info about return variance

2. Ensembling of the critics

  • Increases performance and stability

3. Truncating the mixture of distributions

  • Alleviates overestimation

6

slide-7
SLIDE 7

1. Incorporates stochasticity of returns into the overestimation control

TQC’s novelties

7

slide-8
SLIDE 8

1. Incorporates stochasticity of returns into the overestimation control 2. Provides fine-grained and adjustable level of the overestimation control

TQC’s novelties

8

slide-9
SLIDE 9

1. Incorporates stochasticity of returns into the overestimation control 2. Provides fine-grained and adjustable level of the overestimation control 3. Decouples the overestimation control and the number of critics

TQC’s novelties

9

slide-10
SLIDE 10

TQC is a new SOTA on MuJoCo

10

slide-11
SLIDE 11

TQC is a new SOTA on MuJoCo

11

slide-12
SLIDE 12

Stochastic Agent Environment State Reward Continuous Action

Stochastic Continuous Control in MDP

12

slide-13
SLIDE 13

Stochastic Agent Environment State Reward Continuous Action

Stochastic Continuous Control in MDP

13

slide-14
SLIDE 14

Stochastic Agent Environment State Reward Continuous Action

Stochastic Continuous Control in MDP

14

slide-15
SLIDE 15

Overestimation: intuition

Actions, a

Noisy samples True, Q

15

Value, Q(a)

slide-16
SLIDE 16

Overestimation: intuition

Actions, a

Noisy samples True, Q Approximation, Q

16

Value, Q(a)

slide-17
SLIDE 17

Overestimation: intuition

Actions, a

Noisy samples True, Q Approximation, Q

17

Sources of distortion, U: 1. Insufficient data 2. Limited model capacity 3. SGD noise 4. Env’s stochasticity 5. Ongoing policy changes

Value, Q(a)

slide-18
SLIDE 18

Overestimation: intuition

Actions, a Value, Q(a)

Noisy samples True, Q Approximation, Q

18

Q(aAPPROX) Q(aAPPROX) aAPPROX

Error

Sources of distortion, U: 1. Insufficient data 2. Limited model capacity 3. SGD noise 4. Env’s stochasticity 5. Ongoing policy changes

slide-19
SLIDE 19

Overestimation: mathematical model1

Predicted maximum:

19

[1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning."

slide-20
SLIDE 20

Predicted maximum averaged over zero mean distortion:

20

[1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning."

Overestimation: mathematical model1

slide-21
SLIDE 21

21

Predicted maximum averaged over zero mean distortion: Jensen inequality

[1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning."

Overestimation: mathematical model1

slide-22
SLIDE 22

22

Predicted maximum averaged over zero mean distortion:

[1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning."

Overestimation: mathematical model1

slide-23
SLIDE 23

23

Predicted ≥ True

Predicted maximum averaged over zero mean distortion:

[1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning."

Overestimation: mathematical model1

slide-24
SLIDE 24

1. Policy exploits critic’s erroneous estimates 2. TD-learning propagates estimation errors 3. Positive feedback loop may occur

24

Predicted maximum averaged over zero mean distortion:

Predicted ≥ True

[1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning."

Overestimation: mathematical model1

slide-25
SLIDE 25

Soft Actor Critic2

25

[2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor."

slide-26
SLIDE 26

Soft Actor Critic2

Soft Policy Evaluation:

26

[2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor."

slide-27
SLIDE 27

Soft Actor Critic2

Soft Policy Evaluation:

27

[2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor."

slide-28
SLIDE 28

Soft Actor Critic2

Soft Policy Evaluation:

28

[2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor."

slide-29
SLIDE 29

Soft Actor Critic2

Soft Policy Evaluation:

29

Overestimation alleviation (Clipped Double Estimate3):

[2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." [3]: Scott Fujimoto, Herke van Hoof, David Meger “Addressing Function Approximation Error in Actor-Critic Methods”

slide-30
SLIDE 30

Soft Actor Critic2

Soft Policy Evaluation:

30

Overestimation alleviation (Clipped Double Estimate3):

[2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." [3]: Scott Fujimoto, Herke van Hoof, David Meger “Addressing Function Approximation Error in Actor-Critic Methods”

1.

slide-31
SLIDE 31

Soft Actor Critic2

Soft Policy Evaluation:

31

[2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." [3]: Scott Fujimoto, Herke van Hoof, David Meger “Addressing Function Approximation Error in Actor-Critic Methods”

2. 1.

Overestimation alleviation (Clipped Double Estimate3):

slide-32
SLIDE 32

32

[2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." [3]: Scott Fujimoto, Herke van Hoof, David Meger “Addressing Function Approximation Error in Actor-Critic Methods”

2. 1.

Limitations:

  • Coarse bias control
  • Wasteful aggregation

Solution: Truncated Quantile Critics Overestimation alleviation (Clipped Double Estimate3):

slide-33
SLIDE 33

33

TQC step 1: Prediction of N distributions

slide-34
SLIDE 34

34

TQC step 2: Pooling

slide-35
SLIDE 35

35

TQC step 3: Truncation

slide-36
SLIDE 36

36

TQC step 4: Discounting and Shifting

slide-37
SLIDE 37

Training

For each Z-network:

37

slide-38
SLIDE 38

Training

For each Z-network:

38

4 [4]: Dabney, Will, et al. "Distributional reinforcement learning with quantile regression." Thirty-Second AAAI Conference on Artificial Intelligence. 2018.

slide-39
SLIDE 39

Training

For each Z-network:

39

4 [4]: Dabney, Will, et al. "Distributional reinforcement learning with quantile regression." Thirty-Second AAAI Conference on Artificial Intelligence. 2018.

slide-40
SLIDE 40

Training

For each Z-network: Policy: Maximizes nontruncated average of all atoms of the mixture

40

4 [4]: Dabney, Will, et al. "Distributional reinforcement learning with quantile regression." Thirty-Second AAAI Conference on Artificial Intelligence. 2018.

slide-41
SLIDE 41

Our contribution: Truncated Quantile Critics

1. Uses return stochasticity for overestimation control

41

A novel direction: interplay between overestimation and stochasticity

slide-42
SLIDE 42

Our contribution: Truncated Quantile Critics

1. Uses return stochasticity for overestimation control

42

Mean Mean A novel direction: interplay between overestimation and stochasticity

Variance increases Overestimation increases

slide-43
SLIDE 43

Our contribution: Truncated Quantile Critics

1. Uses return stochasticity for overestimation control

43

Mean Mean

Variance increases

Mean Mean

Truncation

A novel direction: interplay between overestimation and stochasticity

Overestimation increases Overestimation compensation

slide-44
SLIDE 44

Our contribution: Truncated Quantile Critics

1. Uses return stochasticity for overestimation control 2. Method provides adjustable and fine-grained overestimation bias control

# of quantiles

44

M = 5 M = 10

Resolution increases

1 / 5 2 / 5 Fraction of dropped quantiles

Overestimation compensation

... ...

3 / 10

slide-45
SLIDE 45

Our contribution: Truncated Quantile Critics

# of networks Fraction of dropped quantiles N = 1 N = 2 N = 3

Better performance More computations ... ... ...

1. Uses return stochasticity for overestimation control 2. Method provides adjustable and fine-grained overestimation bias control 3. Decouples overestimation control and number of approximators

45

...

slide-46
SLIDE 46

Our contribution: Truncated Quantile Critics

1. Uses return stochasticity for overestimation control 2. Method provides fine-grained overestimation bias control 3. Decouples overestimation control and multiplicity of approximators 4. New SOTA on MuJoCo locomotion suite

46

slide-47
SLIDE 47

47

Substantial improvement on all environments

slide-48
SLIDE 48

48

– Monte-Carlo estimate TQC

Overestimation measurement

TQC

slide-49
SLIDE 49

49

Overestimation measurement

There is a clear optimum in terms of performance.

TQC TQC

slide-50
SLIDE 50

50

SAC with min over multiple critics N=1..5

Overestimation measurement

slide-51
SLIDE 51

51

SAC with linear combination of minimum and maximum Q-functions4:

[4] Fujimoto, Scott, David Meger, and Doina Precup. "Off-Policy Deep Reinforcement Learning without Exploration."

Overestimation measurement

slide-52
SLIDE 52

52

Sources of TQC performance:

  • Distributional critics
  • Different procedure of overestimation correction
  • Benefits from larger networks

Overestimation measurement

TQC TQC SAC SAC

slide-53
SLIDE 53

53

Links

1. Arxiv Paper https://arxiv.org/abs/2005.04269 2. Method Page https://bayesgroup.github.io/tqc/ 3. Tensorflow Code https://github.com/bayesgroup/tqc 4. Pytorch Code https://github.com/bayesgroup/tqc_pytorch 5. Video https://www.youtube.com/watch?v=idp4k1L9UhM