Q-learning 3-23-16 Markov Decision Processes (MDPs) States: S - PowerPoint PPT Presentation

Q-learning 3-23-16

Markov Decision Processes (MDPs) ● States: S ● Actions ○ A vs. A s ● Transition probabilities: P(s’ | s, a) ● Rewards ○ R(s) vs. R(s,a) vs. R(s,s’) ● Discount factor (sometimes considered part of the environment, sometimes part of the agent).

Reward vs. Value ● Reward is how the agent receives feedback in the moment. ● The agent wants to maximize reward over the long term. ● Value is the reward the agent expects in the future. ○ Expected sum of discounted future reward. � = discount r t = reward at time t

Why do we use discounting? We want the agent to act over infinite horizons. ● Without discounting, the sum of rewards would be infinite. ● With discounting, as long as rewards are bounded, the sum converges. We also want the agent to accomplish its goals quickly when possible. ● Discounting causes the agent to prefer receiving rewards sooner.

Known vs. unknown MDPs If we know the full MDP: ● All states and actions ● All transition probabilities ● All rewards Then we can use value iteration to find an optimal policy before we start acting. If we don’t know the full MDP: ● Missing states (we generally assume we know actions) ● Missing transition probabilities ● Missing rewards Then we need to try out various actions to see what happens. This is RL.

Temporal difference (TD) learning Key idea: Use the differences in utilities between successive states Update rule: Equivalently: � = learning rate � = discount temporal difference s ′ = next state

How the heck does TD learning work? TD learning maintains no model of the environment. ● It never learns transition probabilities. Yet TD learning converges to correct value estimates. Why? Consider how values will be modified... ● when all values are initially 0. ● when future value is higher than current. ● when future value is lower than current. ● when discount is close to 1. ● when discount is close to 0.

Q-learning Key idea: temporal difference learning on (state, action) pairs. ● Q(s,a) denotes the expected value of doing action a in state s. ● Store Q values in a table, and update them incrementally. Update rule: Equivalently:

Exercise: carry out Q-learning discount: 0.9 learning rate: 0.2 0 0 0 We’ve already seen the terminal states. +1 0 0 0 0 0 0 2 Use these exploration traces: 0 0 0 (0,0)→(1,0)→(2,0)→(2,1)→(3,1) 0 0 -1 0 0 0 0 1 (0,0)→(0,1)→(0,2)→(1,2)→(2,2)→(3,2) 0 0 (0,0)→(0,1)→(0,2)→(1,2)→(2,2)→(2,1)→(3,1) 0 0 0 0 0 0 0 0 0 0 0 0 (0,0)→(1,0)→(2,0)→(2,1)→(2,2)→(3,2) 0 0 0 0 0 (0,0)→(1,0)→(2,0)→(3,0)→(3,1) 0 1 2 3 (0,0)→(0,1)→(0,2)→(1,2)→(2,2)→(3,2)

Exploration policy vs. optimal policy Where do the exploration traces come from? ● We need some policy for acting in the environment before we understand it. ● We’d like to get decent rewards while exploring. ○ Explore/exploit tradeoff. In lab, we’re using an epsilon-greedy exploration policy. After exploration, taking random bad moves doesn’t make much sense. ● If Q-value estimates are correct a greedy policy is optimal.

Q-learning 3-23-16 Markov Decision Processes (MDPs) States: S - PowerPoint PPT Presentation

Q-learning 3-23-16 Markov Decision Processes (MDPs) States: S Actions A vs. A s Transition probabilities: P(s | s, a) Rewards R(s) vs. R(s,a) vs. R(s,s) Discount factor (sometimes considered part of the

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

What is mobile learning, mobile learning policies and technologies Dr. Mohamed Ally Learning

Year 7 Learning Evening 2017 W elcome! Year 7 Learning Evening 2017 Year 7 Learning Evening

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Welcome to Welcome to The Learning Tree Workshop Series on Learning Differences, Learning

Impasse, Conflict Impasse, Conflict and Learning of CS Notions and Learning of CS Notions David

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Why e Learning can actually be effective for learning an understanding from psycho

Fairness in Machine Learning Fairness in Supervised Learning Make decisions by machine learning:

Objectives Objectives Objectives Objectives Learning Learning Learning Learning

Learning Sciences: Impact on Learning Technologies & Learning Activities Phillip D. Long,

What Do We Want AI and ML to Do? } Short answer: Lots of things! } Intelligent robot and vehicle

Announcements Homework k 3: Game Trees s (lead TA: Zhaoqing) Due Tue 1 Oct at 11:59pm

Real Time Macro Monitoring and Fiscal Policy Florian Misch Florian Misch Centre for European

Clean Energy Group Webinar: Financing Resilient Power November 20, 2014 Rob Sanders, Clean

Stormwater Management - TBD Precipitation - TBD Rainfall Future projections Topics for

Mean-field methods: what can go wrong? with some applications to bike-sharing systems and caching

Background estimation in searches for binary inspiral Patrick Brady Inspiral Working Group LIGO

Time Series Analysis Henrik Madsen hm@imm.dtu.dk Informatics and Mathematical Modelling

Q-learning 3-23-16 Markov Decision Processes (MDPs) States: S - PowerPoint PPT Presentation

Q-learning 3-23-16 Markov Decision Processes (MDPs) States: S Actions A vs. A s Transition probabilities: P(s | s, a) Rewards R(s) vs. R(s,a) vs. R(s,s) Discount factor (sometimes considered part of the

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

What is mobile learning, mobile learning policies and technologies Dr. Mohamed Ally Learning

Year 7 Learning Evening 2017 W elcome! Year 7 Learning Evening 2017 Year 7 Learning Evening

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Welcome to Welcome to The Learning Tree Workshop Series on Learning Differences, Learning

Impasse, Conflict Impasse, Conflict and Learning of CS Notions and Learning of CS Notions David

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Why e Learning can actually be effective for learning an understanding from psycho

Fairness in Machine Learning Fairness in Supervised Learning Make decisions by machine learning:

Objectives Objectives Objectives Objectives Learning Learning Learning Learning

Learning Sciences: Impact on Learning Technologies &amp; Learning Activities Phillip D. Long,

What Do We Want AI and ML to Do? } Short answer: Lots of things! } Intelligent robot and vehicle

Announcements Homework k 3: Game Trees s (lead TA: Zhaoqing) Due Tue 1 Oct at 11:59pm

Real Time Macro Monitoring and Fiscal Policy Florian Misch Florian Misch Centre for European

Clean Energy Group Webinar: Financing Resilient Power November 20, 2014 Rob Sanders, Clean

Stormwater Management - TBD Precipitation - TBD Rainfall Future projections Topics for

Mean-field methods: what can go wrong? with some applications to bike-sharing systems and caching

Background estimation in searches for binary inspiral Patrick Brady Inspiral Working Group LIGO

Time Series Analysis Henrik Madsen hm@imm.dtu.dk Informatics and Mathematical Modelling

Learning Sciences: Impact on Learning Technologies & Learning Activities Phillip D. Long,