Machine Learning 10-701 Tom M. Mitchell Machine Learning Department - PDF document

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University April 26, 2011 Today: Readings: • Mitchell, chapter 13 • Learning of control policies • Markov Decision Processes • Kaelbling, et al., Reinforcement • Temporal difference learning Learning: A Survey • Q learning Thanks to Aarti Singh for several slides Tom Mitchell, April 2011 Reinforcement Learning [Sutton and Barto 1981; Samuel 1957; ...] Tom Mitchell, April 2011 1

Reinforcement Learning: Backgammon [Tessauro, 1995] Learning task: • chose move at arbitrary board states Training signal: • final win or loss Training: • played 300,000 games against itself Algorithm: • reinforcement learning + neural network Result: • World-class Backgammon player Tom Mitchell, April 2011 Outline • Learning control strategies – Credit assignment and delayed reward – Discounted rewards • Markov Decision Processes – Solving a known MDP • Online learning of control strategies – When next-state function is known: value function V * (s) – When next-state function unknown: learning Q * (s,a) • Role in modeling reward learning in animals Tom Mitchell, April 2011 2

Tom Mitchell, April 2011 Markov Decision Process = Reinforcement Learning Setting • Set of states S • Set of actions A • At each time, agent observes state s t ∈ S, then chooses action a t ∈ A • Then receives reward r t , and state changes to s t+1 • Markov assumption: P(s t+1 | s t , a t , s t-1 , a t-1 , ...) = P(s t+1 | s t , a t ) • Also assume reward Markov: P(r t | s t , a t , s t-1 , a t-1 ,...) = P(r t | s t , a t ) • The task: learn a policy π : S  A for choosing actions that maximizes for every possible starting state s 0 Tom Mitchell, April 2011 3

HMM, Markov Process, Markov Decision Process Tom Mitchell, April 2011 HMM, Markov Process, Markov Decision Process Tom Mitchell, April 2011 4

Reinforcement Learning Task for Autonomous Agent Execute actions in environment, observe results, and • Learn control policy π : S  A that maximizes from every state s ∈ S Example: Robot grid world, deterministic reward r(s,a) Tom Mitchell, April 2011 Reinforcement Learning Task for Autonomous Agent Execute actions in environment, observe results, and • Learn control policy π : S  A that maximizes from every state s ∈ S Yikes!! • Function to be learned is π : S  A • But training examples are not of the form <s, a> • They are instead of the form < <s,a>, r > Tom Mitchell, April 2011 5

Value Function for each Policy • Given a policy π : S  A, define assuming action sequence chosen according to π , starting at state s • Then we want the optimal policy π * where • For any MDP, such a policy exists! • We’ll abbreviate V π * (s) as V*(s) • Note if we have V*(s) and P(s t+1 |s t ,a), we can compute π *(s) Tom Mitchell, April 2011 Value Function – what are the V π (s) values? Tom Mitchell, April 2011 6

Value Function – what are the V * (s) values? Tom Mitchell, April 2011 Immediate rewards r(s,a) State values V*(s) Tom Mitchell, April 2011 7

Recursive definition for V*(S) assuming actions are chosen according to the optimal policy, π * Tom Mitchell, April 2011 Value Iteration for learning V* : assumes P(S t+1 |S t , A) known Initialize V(s) arbitrarily Loop until policy good enough Loop for s in S Loop for a in A • End loop End loop V(s) converges to V*(s) Dynamic programming Tom Mitchell, April 2011 8

Value Iteration Interestingly, value iteration works even if we randomly traverse the environment instead of looping through each state and action methodically • but we must still visit each state infinitely often on an infinite run • For details: [Bertsekas 1989] • Implications: online learning as agent randomly roams If max (over states) difference between two successive value function estimates is less than ε , then the value of the greedy policy differs from the optimal policy by no more than Tom Mitchell, April 2011 So far: learning optimal policy when we know P(s t | s t-1 , a t-1 ) What if we don’t? Tom Mitchell, April 2011 9

Q learning Define new function, closely related to V* If agent knows Q(s,a), it can choose optimal action without knowing P(s t+1 |s t ,a) ! And, it can learn Q without knowing P(s t+1 |s t ,a) Tom Mitchell, April 2011 Immediate rewards r(s,a) State values V*(s) State-action values Q*(s,a) Bellman equation. Consider first the case where P(s’| s,a) is deterministic Tom Mitchell, April 2011 10

Tom Mitchell, April 2011 Tom Mitchell, April 2011 11

Tom Mitchell, April 2011 Use general fact: Tom Mitchell, April 2011 12

Tom Mitchell, April 2011 Tom Mitchell, April 2011 13

Tom Mitchell, April 2011 MDP’s and RL: What You Should Know • Learning to choose optimal actions A • From delayed reward • By learning evaluation functions like V(S), Q(S,A) Key ideas: • If next state function S t x A t  S t+1 is known – can use dynamic programming to learn V(S) – once learned, choose action A t that maximizes V(S t+1 ) • If next state function S t x A t  S t+1 un known – learn Q(S t ,A t ) = E[V(S t+1 )] – to learn, sample S t x A t  S t+1 in actual world – once learned, choose action A t that maximizes Q(S t ,A t ) Tom Mitchell, April 2011 14

MDPs and Reinforcement Learning: Further Issues • What strategy for choosing actions will optimize – learning rate? ( explore uninvestigated states) – obtained reward? ( exploit what you know so far) • Partially observable Markov Decision Processes – state is not fully observable – maintain probability distribution over possible states you’re in • Convergence guarantee with function approximators? – our proof assumed a tabular representation for Q, V – some types of function approximators still converge (e.g., nearest neighbor) [Gordon, 1999] • Correspondence to human learning? Tom Mitchell, April 2011 15

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department - PDF document

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University April 26, 2011 Today: Readings: Mitchell, chapter 13 Learning of control policies Markov Decision Processes Kaelbling, et al.,

Machine Learning Machine Learning 10 10- -701/15 701/15- -781, Fall 2006 781, Fall 2006

701 HARRISON Planning Commission Hearing April 30th, 2020 701 HARRISON PROJECT SITE ASSESSOR'S

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

10-701 Machine Learning (Spring 2012) Principal Component Analysis Yang Xu This note is partly

9.1 Overview 9 Deep Learning Alexander Smola Introduction to Machine Learning 10-701

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Linear Symmetries in Integer Convex Optimization Achill Schrmann (University of Rostock) (

The Versatile Synchronous Observer John Rushby Computer Science Laboratory SRI International

Query Processing vanilladb.org Where are we? VanillaCore JDBC Interface (at Client Side)

Netw ork Services I nterface Nordic I nfrastructure for Research & Education Nordic

Instance Based Learning [Read Ch. 8] k -Nearest Neigh b or Lo cally w eigh

Computational Learning Theory [read Chapter 7] [Suggested exercises: 7.1, 7.2, 7.5, 7.8]

Machine Learning: Course Overview Yingyu Liang Computer Sciences 760 Fall 2017

AIXI Tutorial John Aslanides and Tom Part II Everitt Short Recap Intuitions, Approximations,

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department - PDF document

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University April 26, 2011 Today: Readings: Mitchell, chapter 13 Learning of control policies Markov Decision Processes Kaelbling, et al.,

Machine Learning Machine Learning 10 10- -701/15 701/15- -781, Fall 2006 781, Fall 2006

701 HARRISON Planning Commission Hearing April 30th, 2020 701 HARRISON PROJECT SITE ASSESSOR'S

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

10-701 Machine Learning (Spring 2012) Principal Component Analysis Yang Xu This note is partly

9.1 Overview 9 Deep Learning Alexander Smola Introduction to Machine Learning 10-701

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Linear Symmetries in Integer Convex Optimization Achill Schrmann (University of Rostock) (

The Versatile Synchronous Observer John Rushby Computer Science Laboratory SRI International

Query Processing vanilladb.org Where are we? VanillaCore JDBC Interface (at Client Side)

Netw ork Services I nterface Nordic I nfrastructure for Research &amp; Education Nordic

Instance Based Learning [Read Ch. 8] k -Nearest Neigh b or Lo cally w eigh

Computational Learning Theory [read Chapter 7] [Suggested exercises: 7.1, 7.2, 7.5, 7.8]

Machine Learning: Course Overview Yingyu Liang Computer Sciences 760 Fall 2017

AIXI Tutorial John Aslanides and Tom Part II Everitt Short Recap Intuitions, Approximations,

Netw ork Services I nterface Nordic I nfrastructure for Research & Education Nordic