Statistics and Samples in Distributional Reinforcement Learning - PowerPoint PPT Presentation

Statistics and Samples in Distributional Reinforcement Learning Rowland, Dadashi, Kumar, Munos, Bellemare, Dabney Topic: Distributional RL Presenter: Isaac Waller

Distributional RL Instead of approximating the return with a value function, learn the distribution of the return = 𝜃(𝑦, 𝑏) . ➢ A better model for multi-modal return distributions Image https://reinforcement-learning-kr.github.io/2018/09/27/Distributional_intro/

Categorical Distributional RL (CDRL) Assumes a categorical form for return distributions 𝜃(𝑦, 𝑏) Fixed set of supports 𝑨 1 … 𝑨 𝐿 Learn probability 𝑞 𝑙 (𝑦, 𝑏) for each 𝑙 Image https://joshgreaves.com/reinforcement-learning/understanding-rl-the-bellman-equations/

Quantile Distributional RL (QDRL) Learn 𝐿 quantiles of the return distributions 𝜃 𝑦, 𝑏 Each learnable parameter 𝑨 𝑙 has equal probability mass Image https://joshgreaves.com/reinforcement-learning/understanding-rl-the-bellman-equations/

Motivation Lack of a unifying framework for these distributional RL algorithms A general approach will - Assess how well these algorithms model return distributions - Inform the development of new distributional RL algorithms

Contributions - Demonstrates that distributional RL algorithms can be decomposed into some statistics and an imputation mechanism - Shows that CDRL and QDRL inherently cannot learn exactly the true statistics of the return distribution - Develops a new algorithm – EDRL – which can exactly learn the true expectiles of the return distribution - Empirically demonstrates that EDRL is competitive and sometimes an improvement on past algorithms

Bellman equations Bellman equation Distributional Bellman equation?

CDRL and QDRL Bellman updates CDRL QDRL Update 𝑞 𝑙 (𝑦, 𝑏) to the probability Update quantiles 𝑨 𝑙 to the mass for 𝑨 𝑙 when 𝑎 𝜌 (𝑦, 𝑏) is observed quantiles of 𝑎 𝜌 (𝑦, 𝑏) . projected onto only 𝑨 1 … 𝑨 𝑙 . (See Appendix A.2) (See Appendix A.3)

Any algorithm = Statistics + imputation strategies CDRL QDRL Statistics: 𝒕 𝟐 … 𝒕 𝑳 Statistics: 𝒕 𝟐 … 𝒕 𝑳 𝐿 probability masses of return 𝐿 quantiles of return distribution distribution projected onto supports 𝑨 1 … 𝑨 𝑙 Imputation strategy 𝛀 : Imputation strategy 𝛀 : 𝑳 𝒕 𝟐…𝑳 = 𝟐 𝑳 𝛀 ො 𝑳 ෍ 𝜺 ො 𝒕 𝒍 𝛀 ො ො 𝒕 𝟐…𝑳 = ෍ 𝒕 𝒍 𝜺 𝒜 𝒍 Bellman update:

Any algorithm = Statistics + imputation strategies

Bellman closedness Bellman closedness: a set of statistics is Bellman closed if, for each 𝑦, 𝑏 ∈ 𝑌 × 𝐵 , the statistics 𝑡 1…𝐿 𝜃 𝜌 𝑦, 𝑏 can be expressed purely in terms of the random variables 𝑆 0 and 𝑡 1…𝐿 𝜃 𝜌 𝑌 1 , 𝐵 1 |𝑌 0 = 𝑦, 𝐵 0 = 𝑏 and the discount factor 𝛿 . Theorem 4.3 : Collections of moments are “effectively” the only finite sets of statistics that are Bellman closed. Proof in Appendix B.2

Bellman closedness The sets of statistics used by CDRL and QDRL are not Bellman closed Those algorithms are not capable of exactly learning their statistics (* but in practice seem to be effective anyways…) Does not imply that they are incapable of correctly learning expected returns, only distribution

New algorithm: EDRL Using expectiles Can be exactly learned using Bellman updates

New algorithm: EDRL Imputation strategy: Find a distribution satisfying (7) Or (equivalently) that minimizes (8)

Learnt return distributions

Experimental Results Distance to goal Above: estimation error EDRL best approximates statistics

Experimental Results EDRL does best job at estimating true mean

Experimental Results Figure 8. Mean and median human normalised scores across all 57 Atari games. Number of statistics learnt for each algorithm indicated in parentheses.

Discussion of results • EDRL matches or exceeds performance of the other distributional RL algorithms • Using imputation strategies grounded in the theoretical framework can improve accuracy of learned statistics • Conclusion: the theoretical framework is sound and useful. Should be incorporated into future study in distributional RL.

Critique / Limitations / Open Issues • EDRL does not give enormous improvements in performance over other DRL algorithms and is significantly more complex. • Is it truly important to learn the exact return distribution? Learning an inexact distribution appears to perform fine with regards to policy performance, which is what matters in the end. • Or: perhaps test scenarios are not complex enough to allow distributional RL to showcase true power

Contributions (Recap) - Demonstrates that distributional RL algorithms can be decomposed into some statistics and an imputation mechanism - Shows that CDRL and QDRL inherently cannot learn exactly the true statistics of the return distribution - Develops a new algorithm – EDRL – which can exactly learn the true expectiles of the return distribution - Empirically demonstrates that EDRL is competitive and sometimes an improvement on past algorithms

Practice questions 1. Prove the set of statistics learned under QDRL is not Bellman closed. (Hint: prove by counterexample) 2. Give an example of a set of statistics which is Bellman closed that is not expectiles or the mean.

Statistics and Samples in Distributional Reinforcement Learning - PowerPoint PPT Presentation

Statistics and Samples in Distributional Reinforcement Learning Rowland, Dadashi, Kumar, Munos, Bellemare, Dabney Topic: Distributional RL Presenter: Isaac Waller Distributional RL Instead of approximating the return with a value function,

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

Samples Advertising of samples and handing out samples Advertising Education and Assurance

-Samples [AB98] Hyp: domain S is a smooth curve or surface. S 1 -Samples [AB98] Hyp:

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 14, 2019 1 / 125

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 13, 2019 1 / 125

STAT 113 Independent vs. Paired Samples Colin Reimer Dawson Oberlin College November 16, 2017

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

MutaPon Analysis in Frozen and FFPE Tumor Samples Gad Getz, PhD KrisPn Ardlie, PhD Broad

Lecture 6: samples and populations Todays lecture Look at fundamental concepts of samples and

Combining Point and Line Samples for Direct Illumination Points only Points + Lines Katherine

Labeling Blood Samples There are documented occurrences and near misses of mislabeling of blood

MVA method in channel @CEPC FANGYI GUO 1 2019/6/17 MC samples and

Counting Words: Type probabilities Population models Type-rich populations, samples, ZM &

Sampling a Signal an analog signal together with some samples of the signal. The samples

User Interface Design Prof. Dr. Jan M. Pawlowski Autumn 2013 Contents Introduction Definitions

This graph shows the evidence from the samples giving an indication of the predominance of the

bias correction quantile mapping challenges Renate A. I. Wilcke Rossby Centre, SMHI,

Dataanalyse - Hypotesetest - Kursusgang 3 Ege Rubak - rubak@math.aau.dk http://www.math.aau.dk/

D3 Tutorial Data Transformation and Scale Functions Edit by Jiayi Xu and Han-Wei Shen, The Ohio

Extreme Value Theory and Dimension GARDES Inference on reduction for the study of hyperspectral

Compmng Qanmes mn arov eward odes mcmae mmes German erospace

Understanding the Literature on Model Selection and Model Combination Yuhong Yang School of

Final Review Chapters 1, 2, & 3 MA1032 Data, Functions & Graphs Sidney Butler

Study and implementation of an algebraic method to solve systems with fuzzy coefficients Jrmy

Statistics and Samples in Distributional Reinforcement Learning - PowerPoint PPT Presentation

Statistics and Samples in Distributional Reinforcement Learning Rowland, Dadashi, Kumar, Munos, Bellemare, Dabney Topic: Distributional RL Presenter: Isaac Waller Distributional RL Instead of approximating the return with a value function,

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

Samples Advertising of samples and handing out samples Advertising Education and Assurance

-Samples [AB98] Hyp: domain S is a smooth curve or surface. S 1 -Samples [AB98] Hyp:

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 14, 2019 1 / 125

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 13, 2019 1 / 125

STAT 113 Independent vs. Paired Samples Colin Reimer Dawson Oberlin College November 16, 2017

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

MutaPon Analysis in Frozen and FFPE Tumor Samples Gad Getz, PhD KrisPn Ardlie, PhD Broad

Lecture 6: samples and populations Todays lecture Look at fundamental concepts of samples and

Combining Point and Line Samples for Direct Illumination Points only Points + Lines Katherine

Labeling Blood Samples There are documented occurrences and near misses of mislabeling of blood

MVA method in channel @CEPC FANGYI GUO 1 2019/6/17 MC samples and

Counting Words: Type probabilities Population models Type-rich populations, samples, ZM &amp;

Sampling a Signal an analog signal together with some samples of the signal. The samples

User Interface Design Prof. Dr. Jan M. Pawlowski Autumn 2013 Contents Introduction Definitions

This graph shows the evidence from the samples giving an indication of the predominance of the

bias correction quantile mapping challenges Renate A. I. Wilcke Rossby Centre, SMHI,

Dataanalyse - Hypotesetest - Kursusgang 3 Ege Rubak - rubak@math.aau.dk http://www.math.aau.dk/

D3 Tutorial Data Transformation and Scale Functions Edit by Jiayi Xu and Han-Wei Shen, The Ohio

Extreme Value Theory and Dimension GARDES Inference on reduction for the study of hyperspectral

Compmng Qanmes mn arov eward odes mcmae mmes German erospace

Understanding the Literature on Model Selection and Model Combination Yuhong Yang School of

Final Review Chapters 1, 2, &amp; 3 MA1032 Data, Functions &amp; Graphs Sidney Butler

Study and implementation of an algebraic method to solve systems with fuzzy coefficients Jrmy

Counting Words: Type probabilities Population models Type-rich populations, samples, ZM &

Final Review Chapters 1, 2, & 3 MA1032 Data, Functions & Graphs Sidney Butler