CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy - PowerPoint PPT Presentation

Actor-Critic Algorithms CS 285 Instructor: Sergey Levine UC Berkeley

Recap: policy gradients fit a model to estimate return generate samples (i.e. run the policy) improve the policy “reward to go”

Improving the policy gradient “reward to go”

What about the baseline?

State & state-action value functions fit a model to estimate return generate samples (i.e. run the policy) improve the policy the better this estimate, the lower the variance unbiased, but high variance single-sample estimate

Value function fitting fit a model to estimate return generate samples (i.e. run the policy) improve the policy

Policy evaluation fit a model to estimate return generate samples (i.e. run the policy) improve the policy

Monte Carlo evaluation with function approximation the same function should fit multiple samples!

Can we do better?

Policy evaluation examples TD-Gammon, Gerald Tesauro 1992 AlphaGo, Silver et al. 2016

From Evaluation to Actor Critic

An actor-critic algorithm fit a model to estimate return generate samples (i.e. run the policy) improve the policy

Aside: discount factors episodic tasks continuous/cyclical tasks

Aside: discount factors for policy gradients

Which version is the right one? Further reading: Philip Thomas, Bias in natural actor-critic algorithms. ICML 2014

Actor-critic algorithms (with discount)

Actor-Critic Design Decisions

Architecture design two network design + simple & stable - no shared features between actor & critic shared network design

Online actor-critic in practice works best with a batch (e.g., parallel workers) synchronized parallel actor-critic asynchronous parallel actor-critic

Critics as Baselines

Critics as state-dependent baselines + lower variance (due to critic) - not unbiased (if the critic is not perfect) + no bias - higher variance (because single-sample estimate) + no bias + lower variance (baseline is closer to rewards)

Control variates: action-dependent baselines + no bias - higher variance (because single-sample estimate) + goes to zero in expectation if critic is correct! - not correct use a critic without the bias (still unbiased), provided second term can be evaluated Gu et al. 2016 (Q-Prop)

Eligibility traces & n-step returns + lower variance - higher bias if value is wrong (it always is) + no bias - higher variance (because single-sample estimate) Can we combine these two, to control bias/variance tradeoff?

Generalized advantage estimation Do we have to choose just one n? Cut everywhere all at once! weighted combination of n-step returns How to weight? Mostly prefer cutting earlier (less variance) exponential falloff similar effect as discount! remember this? discount = variance reduction! Schulman, Moritz, Levine, Jordan, Abbeel ‘16

Review, Examples, and Additional Readings

Review • Actor-critic algorithms: • Actor: the policy fit a model to • Critic: value function estimate return • Reduce variance of policy gradient generate • Policy evaluation samples (i.e. • Fitting value function to policy run the policy) • Discount factors improve the • Carpe diem Mr. Robot policy • …but also a variance reduction trick • Actor-critic algorithm design • One network (with two heads) or two networks • Batch-mode, or online (+ parallel) • State-dependent baselines • Another way to use the critic • Can combine: n-step returns or GAE

Actor-critic examples • High dimensional continuous control with generalized advantage estimation (Schulman, Moritz, L., Jordan, Abbeel ‘16) • Batch-mode actor-critic • Blends Monte Carlo and function approximator estimators (GAE)

Actor-critic examples • Asynchronous methods for deep reinforcement learning (Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu ‘16) • Online actor-critic, parallelized batch • N-step returns with N = 4 • Single network for actor and critic

Actor-critic suggested readings • Classic papers • Sutton, McAllester, Singh, Mansour (1999). Policy gradient methods for reinforcement learning with function approximation: actor-critic algorithms with value function approximation • Deep reinforcement learning actor-critic papers • Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu (2016). Asynchronous methods for deep reinforcement learning: A3C -- parallel online actor-critic • Schulman, Moritz, L., Jordan, Abbeel (2016). High-dimensional continuous control using generalized advantage estimation: batch-mode actor-critic with blended Monte Carlo and function approximator returns • Gu, Lillicrap, Ghahramani, Turner, L. (2017). Q-Prop: sample-efficient policy- gradient with an off-policy critic: policy gradient with Q-function control variate

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy - PowerPoint PPT Presentation

Actor-Critic Algorithms CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate return generate samples (i.e. run the policy) improve the policy reward to go Improving the policy gradient

Performa 285 Performa 285 High Alloy Zinc Nickel High Alloy Zinc Nickel Alloy Zinc Automotive

Ichthys LNG Project Ichthys Project Location Abadi WA 285 P Ichthys Field WA 285

I-285 Top End Express Lanes I-285 Westside Express Lanes 1 Unprecedented Growth in Metro

Ichthys LNG Project Ichthys NG roject Ichthys Project Location Abadi WA 285 P Ichthys

BLU-285: A potent and highly selective inhibitor designed to target malignancies driven by KIT and

GIST: imatinib and beyond Clinical activity of BLU-285 in advanced gastrointestinal stromal tumor

Particulate Air Quality Around Wisconsin Frac Sand Mines #285 B A Presentation by Dr. Crispin

Quality Candles ...in a modern design www.diana-candles.com 285 employees Aprox .

the public sector with Lorraine Forrest-Turner governmentevents.co.uk | 0330 0584 285 |

Clinical activity in a Phase 1 study of BLU-285, a potent, highly-selective inhibitor of KIT D816V

Visual disability Low vision 2015 Estimated blind people 2020 Visually impaired 285 M Blind

Southern Companys Demonstration of a 285 MW Coal-Based Transport Gasifier Project Project

Georgia DOT Updates: MMIP and Transform 285/400 January 23, 2018 Tim Matthews, P.E. MMIP

Lanes and I-285 Top End Express Lanes Fulton County Schools Briefing Tim Matthews, P.E.

COST OR PRICE COST OR PRICE REASONABLENESS REASONABLENESS (CPR) (CPR) UH APM A8.285 RCUH

Introduction to Intelligent Transportation Systems (ITS): I-285 Variable Speed Limits Andrew

Unexpected biases in the distribution of consecutive primes Robert J. Lemke Oliver, Kannan

Combining Biased and Unbiased Estimators in High Dimensions Bill Strawderman Rutgers University

15: Ethics in Machine Learning, plus Artificial General Intelligence and some old Science Fiction

TO JOIN BY TELEPHONE: TO JOIN BY TELEPHONE: Phone: (5 Phone: (510) 2 ) 210-8882 0-8882 | Access

Bias-Variance Tradeoff David Dalpiaz STAT 432, Fall 2019 1 Announcements See Compass2g

Lets Talk About It! Exploring, Disrupting and Coping with Implicit Bias A conversation hosted

Bias-Variance Tradeoff Machine Learning 1 Bias and variance Every learning algorithm requires

Unconscious Bias 1 Questions to Start: Are we aware of our unconscious biases? Do we accept