Introduction to Reinforcement Learning Finale Doshi-Velez Harvard - PowerPoint PPT Presentation

Introduction to Reinforcement Learning Finale Doshi-Velez Harvard University Buenos Aires MLSS 2018

We often must make decisions under uncertainty. How to get to work, walk or bus?

We often must make decisions under uncertainty. What projects to work on? https://imgs.xkcd.com/comics/automation.png

We often must make decisions under uncertainty. How to improvise with a new recipe? https://s-media-cache-ak0.pinimg.com/originals/23/ce/4b/23ce4b2fc9014b26d4b811209550ef5b.jpg

Some Real Applications of RL

Why are these problems hard? ● Must learn from experience (may have prior experience on the same or related task) ● Delayed rewards/actions may have long term effects (delayed credit assignment) ● Explore or exploit? Learn and plan together. ● Generalization (new developments, don’t assume all information has been identified)

Reinforcement learning formalizes this problem actions Agent: World: models, Black policies, Box etc. observation reward t r t ] E [ ∑ t γ Objective: Maximize (finite or infinite horizon)

Concept Check: Reward Adjustment ● If I adjust every reward r by r + c, does the policy change? ● If I adjust every reward r by c*r, does the policy change?

Key Terms ● Policy π(s,a) or π(s) = a ● State s ● History {s0,a0,r0,s1,a1…} Start Markov Property: p(s t+1 | h t ) = p( s t+1 | h t-1 , s t , a t ) = p( s t+1 | s t , a t ) … we'll come back to identifying state later!

Markov Decision Process ● T( s' | s , a ) = Pr( state s' after taking action a in state s ) ● R( s , a , s ' ) = E[ reward after taking action a in state s and transitioning to s' ] … but may depend on less, e.g. R( s , a ) or even R( s ) P = 1 , R = 0 P = 1 , R = 2 P = .25 , R = 1 State 0 State 1 P = 1 , R = 3 P = .75 , R = 0 Notice given a policy, we have a Markov chain to analyze!

How to Solve an MDP: Value Functions Value: V π (s) = E π [ Σ t γ t r t | s 0 = s ] … in s, follow π

How to Solve an MDP: Value Functions Value: V π (s) = E π [ Σ t γ t r t | s 0 = s ] … in s, follow π π

Concept Check: Discounts 4 (1) In functions of γ, what are the values A of policies A, B, and B C? S 5 1 C (2) When is it better to do B? C? 0 5

How to Solve an MDP: Value Functions Value: V π (s) = E π [ Σ t γ t r t | s 0 = s ] … in s, follow π Action-Value: Q π (s,a) = E π [ Σ t γ t r t | s 0 = s, a 0 = a ] … in s, do a, follow π π

Expanding the expression... t r t ∣ s 0 = s ] V π ( s )= E π [ ∑ t γ t r t ∣ s 0 = s' ]] V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ E π [ ∑ t γ Next action Next state Next reward Discounted future rewards action: you choose state: world chooses

Expanding the expression... t r t ∣ s 0 = s ] V π ( s )= E π [ ∑ t γ t r t ∣ s 0 = s' ]] V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ E π [ ∑ t γ Next action Next state Next reward Discounted future rewards

Expanding the expression... t r t ∣ s 0 = s ] V π ( s )= E π [ ∑ t γ t r t ∣ s 0 = s' ]] V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ E π [ ∑ t γ Next action Next state Next reward Discounted future rewards V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ V π ( s' )]

Expanding the expression... t r t ∣ s 0 = s ] V π ( s )= E π [ ∑ t γ t r t ∣ s 0 = s' ]] V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ E π [ ∑ t γ Next action Next state Next reward Discounted future rewards V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ V π ( s' )] Exercise: Rewrite in finite horizon case, making the rewards and transitions depend on time t… notice how thinking about the future is the same as thinking backward from the end!

Optimal Value Functions Don't average, take the best! V ( s )= max a Q ( s,a ) V ( s )= max a ∑ s' T ( s' ∣ s,a )[ r ( s ,a, s' )+γ V ( s' )] Q-table is the set of values Q(s,a) Note: we still have problems – system must be Markov in s, the size of {s} might be large

Can we solve this? Policy Evaluation V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ V π ( s' )] This is a system of linear equations!

Can we solve this? Policy Evaluation V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ V π ( s' )] This is a system of linear equations! We can also do it iteratively: 0 ( s )= c V π k ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ V π k − 1 ( s' )] V π Will converge because the Bellman iterator is a contraction – the initial value V 0 (s) is pushed into the past as the “collected data” r(s,a) takes over.

Can we solve this? Policy Evaluation V π ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ V π ( s' )] This is a system of linear equations! We can also do it iteratively: 0 ( s )= c V π k ( s )= ∑ a π( a ∣ s ) ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ V π k − 1 ( s' )] V π Will converge because the Bellman iterator is a contraction – the initial value V 0 (s) is pushed into the past as the “collected data” r(s,a) takes over. Finally, can apply Monte carlo: many simulations from s, and see what V π (s) is.

Policy Improvement Theorem Let π, π' be two policies that are the same except for the action that they recommend at state s. If Q π ( s, π'(s) ) > Q π ( s, π(s) ) Then V π' (s) > V π (s) Gives us a way to improve policies: just be greedy with respect to Q!

Policy Iteration Select Solve Solve Improve ... some For For policy policy V π V π π π Will converge; each step requires a potentially expensive policy evaluation computation

Value Iteration k ( s )= max a ∑ s ' T ( s' ∣ s ,a )[ r ( s ,a,s' )+γ V k − 1 ( s' )] V Policy Improvement Policy Evaluation Also converges (contraction) Note that in the tabular case, this is a bunch of inexpensive matrix operations!

Linear programming min ∑ s V ( s )μ( s ) s.t .V ( s )≥ ∑ s' T ( s' ∣ s,a )[ r ( s,a, s' )+γ V ( s' )]∀ a,s For any μ; equality for the best action at optimality

Learning from Experience: Reinforcement Learning Now, instead of the transition T and reward R, we assume that we only have histories. Why is this case interesting? ● May not have the model ● Even if have model (e.g. rules of go, or Atari simulator code), focuses attention on right place

Taxonomy of Approaches ● Forward Search/Monte Carlo: Simulate the future, pick the best one (with or without a model). ● Value function: Learn V(s) ● Policy Search: parametrize policy π θ (s) and search for the best parameters θ, often good for systems in which the cardinality of θ is small.

Monte Carlo Policy Evaluation 1) Generate N sequences of length T from state s 0 to estimate V π (s 0 ). 2)If π has some randomness, or we do s 0 , a 0 , then π, can do policy improvement. … might need a lot of data! But okay if we have a blackbox simulator. π π π

Monte Carlo Policy Evaluation 1) Generate N sequences of length T from state s 0 to estimate V π (s 0 ). 2)If π has some randomness, or we do s 0 , a 0 , then π, can do policy improvement. … might need a lot of data! But okay if we have a blackbox simulator. π π π Add sophistication: UCT, MCTS

Taxonomy of Approaches ● Forward Search/Monte Carlo: Simulate the future, pick the best one (with or without a model). ● Value function: Learn V(s) ● Policy Search: parametrize policy π θ (s) and search for the best parameters θ, often good for systems in which the cardinality of θ is small.

Temporal Difference V π ( s )= E π [ ∑ t γ t r t ∣ s 0 = s ]= E π [ r 0 +γ V π ( s' )] Monte Carlo Estimate Dynamic Programming TD: Start with some V(s), do π(s), and update: V π ( s )← V π ( s )+α t ( r 0 +γ V π ( s' )− V π ( s )) Original Value Temporal Difference: Error between the sampled value of where you went and the stored value ∑ t α t →∞ , ∑ t α t 2 → C Will converge if

Monte Carlo (only one trajectory)

Value Iteration (all actions)

Temporal Difference

Example (S&B 6.4, Let γ = 1) Two states (A,B). Two rewards (0,1). Suppose we have seen the histories: MC estimate of V(B)? A0B0 TD estimate of V(B)? B1 B1 B1 B1 B1 B1 B0

Example (S&B 6.4, Let γ = 1) Two states (A,B). Two rewards (0,1). Suppose we have seen the histories: MC estimate of V(B)? V MC (B) = ¾ A0B0 B1 TD estimate of V(B)? V TD (B) = ¾ B1 B1 B1 B1 B1 B0

Example (S&B 6.4, Let γ = 1) Two states (A,B). Two rewards (0,1). Suppose we have seen the histories: MC estimate of V(B)? V MC (B) = ¾ A0B0 B1 TD estimate of V(B)? V MC (B) = ¾ B1 MC estimate of V(A)? B1 TD estimate of V(A)? B1 B1 B1 B0

Introduction to Reinforcement Learning Finale Doshi-Velez Harvard - PowerPoint PPT Presentation

Introduction to Reinforcement Learning Finale Doshi-Velez Harvard University Buenos Aires MLSS 2018 We often must make decisions under uncertainty. How to get to work, walk or bus? We often must make decisions under uncertainty. What

Partially-Observable Markov Decision Processes as Dynamical Causal Models Finale Doshi-Velez

Prediction Constrained Reinforcement Learning AUTHORS: JOSEPH FUTOMA, MICHAEL C. HUGHES, FINALE

Restructuring Presentation by: Gautam Doshi 04-06-2020 Gautam Doshi 2 Case Study 1:

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Binary Classification Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale

Regression Basics Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale

Decision Trees Prof. Mike Hughes Many slides attributable to: Erik Sudderth (UCI) Finale

Linear Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale

Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale

Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale

Binary Classification Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Belinda Doshi Partner, Field Fisher Waterhouse LLP Belinda.Doshi@ffw.com +44 (0) 7976 351553 18

DAPPER LOUNGE By Doshi Levien Designed by Doshi Levien for HAY, the Dapper lounge chair

Advanced Econometrics 2, Hilary term 2021 Reinforcement learning Maximilian Kasy Department of

Recap: MDPs Op)mal Quan))es Markov decision processes:

Learning to Randomize and Remember in Partially-Observed Environments Radford M. Neal, University

Reinforcement Learning Lecture 8 Reinforcement Learning November 24, 2015 1 Wentworth

Matching skills needs with skills reserves: Protecting workers and communities for a Just

Outline Storage local/mounted on Compute Elements $OSG_APP, $OSG_WN_TMP, $OSG_DATA

Multivariate GLMs Author: Nicholas Reich, transcribed by Kate Hoff Shutta and Herb Susmann

Finite mixture models Dr. Jarad Niemi STAT 615 - Iowa State University November 28, 2017 Jarad