CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. - PowerPoint PPT Presentation

Optimal Control and Planning CS 285 Instructor: Sergey Levine UC Berkeley

Today’s Lecture 1. Introduction to model-based reinforcement learning 2. What if we know the dynamics? How can we make decisions? 3. Stochastic optimization methods 4. Monte Carlo tree search (MCTS) 5. Trajectory optimization • Goals: • Understand how we can perform planning with known dynamics models in discrete and continuous spaces

Recap: the reinforcement learning objective

Recap: model-free reinforcement learning assume this is unknown don’t even attempt to learn it

What if we knew the transition dynamics? • Often we do know the dynamics 1. Games (e.g., Atari games, chess, Go) 2. Easily modeled systems (e.g., navigating a car) 3. Simulated environments (e.g., simulated robots, video games) • Often we can learn the dynamics 1. System identification – fit unknown parameters of a known model 2. Learning – fit a general-purpose model to observed transition data Does knowing the dynamics make things easier? Often, yes!

Model-based reinforcement learning 1. Model-based reinforcement learning: learn the transition dynamics, then figure out how to choose actions 2. Today: how can we make decisions if we know the dynamics? a. How can we choose actions under perfect knowledge of the system dynamics? b. Optimal control, trajectory optimization, planning 3. Next week: how can we learn unknown dynamics? 4. How can we then also learn policies? ( e.g. by imitating optimal control ) policy system dynamics

The objective 1. run away 2. ignore 3. pet

The deterministic case

The stochastic open-loop case why is this suboptimal?

Aside: terminology what is this “loop”? open-loop closed-loop only sent at t = 1, then it’s one -way!

The stochastic closed-loop case (more on this later)

Open-Loop Planning

But for now, open-loop planning

Stochastic optimization simplest method: guess & check “random shooting method”

Cross-entropy method (CEM) can we do better? typically use Gaussian distribution see also: CMA-ES (sort of like CEM with momentum)

What’s the upside? 1. Very fast if parallelized 2. Extremely simple What’s the problem? 1. Very harsh dimensionality limit 2. Only open-loop planning

Discrete case: Monte Carlo tree search (MCTS)

Discrete case: Monte Carlo tree search (MCTS) e.g., random policy

Discrete case: Monte Carlo tree search (MCTS) +15 +10

Discrete case: Monte Carlo tree search (MCTS) Q = 22 Q = 38 Q = 12 Q = 30 Q = 22 Q = 10 N = 3 N = 2 N = 1 N = 2 N = 1 N = 3 Q = 8 Q = 16 Q = 12 N = 1 N = 1 N = 1 Q = 10 N = 1

Additional reading 1. Browne, Powley, Whitehouse, Lucas, Cowling, Rohlfshagen, Tavener, Perez, Samothrakis, Colton. (2012). A Survey of Monte Carlo Tree Search Methods. • Survey of MCTS methods and basic summary.

Trajectory Optimization with Derivatives

Can we use derivatives?

Shooting methods vs collocation shooting method: optimize over actions only

Shooting methods vs collocation collocation method: optimize over actions and states, with constraints

Linear case: LQR linear quadratic

Linear case: LQR

Linear case: LQR quadratic linear linear

Linear case: LQR

LQR for Stochastic and Nonlinear Systems

Stochastic dynamics

The stochastic closed-loop case

Nonlinear case: DDP/iterative LQR

Case Study and Additional Readings

Case study: nonlinear model-predictive control

Additional reading 1. Mayne, Jacobson. (1970). Differential dynamic programming. • Original differential dynamic programming algorithm. 2. Tassa, Erez, Todorov. (2012). Synthesis and Stabilization of Complex Behaviors through Online Trajectory Optimization. • Practical guide for implementing non-linear iterative LQR. 3. Levine, Abbeel. (2014). Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics. • Probabilistic formulation and trust region alternative to deterministic line search.

What’s wrong with known dynamics? Next time: learning the dynamics model

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. - PowerPoint PPT Presentation

Optimal Control and Planning CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Introduction to model-based reinforcement learning 2. What if we know the dynamics? How can we make decisions? 3. Stochastic optimization methods

Performa 285 Performa 285 High Alloy Zinc Nickel High Alloy Zinc Nickel Alloy Zinc Automotive

Ichthys LNG Project Ichthys Project Location Abadi WA 285 P Ichthys Field WA 285

I-285 Top End Express Lanes I-285 Westside Express Lanes 1 Unprecedented Growth in Metro

Ichthys LNG Project Ichthys NG roject Ichthys Project Location Abadi WA 285 P Ichthys

BLU-285: A potent and highly selective inhibitor designed to target malignancies driven by KIT and

GIST: imatinib and beyond Clinical activity of BLU-285 in advanced gastrointestinal stromal tumor

Particulate Air Quality Around Wisconsin Frac Sand Mines #285 B A Presentation by Dr. Crispin

Quality Candles ...in a modern design www.diana-candles.com 285 employees Aprox .

the public sector with Lorraine Forrest-Turner governmentevents.co.uk | 0330 0584 285 |

Clinical activity in a Phase 1 study of BLU-285, a potent, highly-selective inhibitor of KIT D816V

Visual disability Low vision 2015 Estimated blind people 2020 Visually impaired 285 M Blind

Southern Companys Demonstration of a 285 MW Coal-Based Transport Gasifier Project Project

Georgia DOT Updates: MMIP and Transform 285/400 January 23, 2018 Tim Matthews, P.E. MMIP

Lanes and I-285 Top End Express Lanes Fulton County Schools Briefing Tim Matthews, P.E.

COST OR PRICE COST OR PRICE REASONABLENESS REASONABLENESS (CPR) (CPR) UH APM A8.285 RCUH

Introduction to Intelligent Transportation Systems (ITS): I-285 Variable Speed Limits Andrew

Webinar: Monitoring & Evaluation Methods for Trainings 18th June 2020 Please complete the

Management of CAPI and CATI Management of CAPI and CATI at the Labor Force Survey at the Labor

ConnectHome Nation Webinar ConnectHome Webinar: Baseline Survey and Tracking November 29, 2017

SURVEY METHODS IN MACROECONOMICS Matthew D. Shapiro Department of Economics and Survey Research

1 SOCIAL SURVEY RESEARCH: AN OVERVIEW OF SURVEY RESEARCH PRINCIPLES FOR CONSUMERS AND DECISION

Swimming in the data river Or, when streaming analytics isnt Gian Merlino gian@imply.io

Hardware-Software Codesign 9. Worst Case Execution Time Analysis Lothar Thiele Swiss Federal

Not everything that counts can be counted, and not everything that can be counted counts.

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. - PowerPoint PPT Presentation

Optimal Control and Planning CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Introduction to model-based reinforcement learning 2. What if we know the dynamics? How can we make decisions? 3. Stochastic optimization methods

Performa 285 Performa 285 High Alloy Zinc Nickel High Alloy Zinc Nickel Alloy Zinc Automotive

Ichthys LNG Project Ichthys Project Location Abadi WA 285 P Ichthys Field WA 285

I-285 Top End Express Lanes I-285 Westside Express Lanes 1 Unprecedented Growth in Metro

Ichthys LNG Project Ichthys NG roject Ichthys Project Location Abadi WA 285 P Ichthys

BLU-285: A potent and highly selective inhibitor designed to target malignancies driven by KIT and

GIST: imatinib and beyond Clinical activity of BLU-285 in advanced gastrointestinal stromal tumor

Particulate Air Quality Around Wisconsin Frac Sand Mines #285 B A Presentation by Dr. Crispin

Quality Candles ...in a modern design www.diana-candles.com 285 employees Aprox .

the public sector with Lorraine Forrest-Turner governmentevents.co.uk | 0330 0584 285 |

Clinical activity in a Phase 1 study of BLU-285, a potent, highly-selective inhibitor of KIT D816V

Visual disability Low vision 2015 Estimated blind people 2020 Visually impaired 285 M Blind

Southern Companys Demonstration of a 285 MW Coal-Based Transport Gasifier Project Project

Georgia DOT Updates: MMIP and Transform 285/400 January 23, 2018 Tim Matthews, P.E. MMIP

Lanes and I-285 Top End Express Lanes Fulton County Schools Briefing Tim Matthews, P.E.

COST OR PRICE COST OR PRICE REASONABLENESS REASONABLENESS (CPR) (CPR) UH APM A8.285 RCUH

Introduction to Intelligent Transportation Systems (ITS): I-285 Variable Speed Limits Andrew

Webinar: Monitoring &amp; Evaluation Methods for Trainings 18th June 2020 Please complete the

Management of CAPI and CATI Management of CAPI and CATI at the Labor Force Survey at the Labor

ConnectHome Nation Webinar ConnectHome Webinar: Baseline Survey and Tracking November 29, 2017

SURVEY METHODS IN MACROECONOMICS Matthew D. Shapiro Department of Economics and Survey Research

1 SOCIAL SURVEY RESEARCH: AN OVERVIEW OF SURVEY RESEARCH PRINCIPLES FOR CONSUMERS AND DECISION

Swimming in the data river Or, when streaming analytics isnt Gian Merlino gian@imply.io

Hardware-Software Codesign 9. Worst Case Execution Time Analysis Lothar Thiele Swiss Federal

Not everything that counts can be counted, and not everything that can be counted counts.

Webinar: Monitoring & Evaluation Methods for Trainings 18th June 2020 Please complete the