Q-Learning 2/22/17 MDP Examples MDPs model environments where - PowerPoint PPT Presentation

Q-Learning 2/22/17

MDP Examples MDPs model environments where state transitions are affected both by the agent’s action and by external random elements. • Gridworld • Randomness from noisy movement control • PacMan • Randomness from movement of ghosts • Autonomous vehicle path planning • Randomness from controls and dynamic environment • Stock market investing • Randomness from unpredictable price movements

What is value? The value of a state (or action) is the expected sum of discounted future rewards. 𝛿 = discount r t = reward at time t " ∞ # X γ t r t V = E t =0 V ( s ) = R ( s ) + γ max Q ( s, a ) a P ( s 0 | s, a ) V ( s 0 ) X Q ( s, a ) = s 0

VI Pseudocode (again) values = {state : R(state) for each state} until values don’t change: prev = copy of values for each state s: initialize best_EV for each action: EV = 0 for each next state ns: EV += prob * prev[ns] best_EV = max(EV, best_EV) values[s] = R(s) + gamma*best_EV

Optimal Policy from Value Iteration Once we know values, the optimal policy is easy: • Greedily maximize value. • Pick the action with the highest expected value. • We don’t need to think about the future, just the value of states that can be reached in one action. Why does this work? Why don’t we need to consider the future? The state-values already incorporate the future • Sum of discounted future rewards.

What if we don’t know the MDP? • We might not know all the states. • We might not know the transition probabilities. • We might not know the rewards. • The only way to figure it out is to explore. • We now need two things: • A policy to use while exploring. • A way to learn expected values without knowing exact transition probabilities.

Known vs. Unknown MDPs If we know the full MDP: If we don’t know the MDP: • All states and actions • Missing states • Generally know actions • All transition probabilities • Missing transition probabilities • Missing rewards • All rewards Then we need to try out Then we can use value various actions to see what iteration to find an optimal happens. This is called RL: policy before we start acting. Reinforcement Learning .

Temporal Difference (TD) Learning Key idea: Update estimates based on experience, using differences in utilities between successive states. Update rule: V ( s ) = α [ R ( s ) + γ V ( s 0 )] + (1 − α ) V ( s ) Equivalently: V ( s ) += α [ R ( s ) + γ V ( s 0 ) − V ( s )] temporal difference

How the heck does TD learning work? TD learning maintains no model of the environment. • It never learns transition probabilities. Yet TD learning converges to correct value estimates. Why? Consider how values will be modified... • when all values are initially 0. • when s’ has a high value. • when s’ has a low value. • when discount is close to 1. • when discount is close to 0. • over many, many runs.

Q-learning Key idea: TD learning on (state, action) pairs. • Q(s,a) is the expected value of doing action a in state s. • Store Q values in a table; update them incrementally. Update rule: h h ii a 0 Q ( s 0 , a 0 ) Q ( s, a ) = α R ( s ) + γ max + (1 − α ) Q ( s, a ) V(s’) Equivalently: h h i i a 0 Q ( s 0 , a 0 ) Q ( s, a ) += α R ( s ) + γ max − Q ( s, a )

Exercise: carry out Q-learning discount: 0.9 learning rate: 0.2 0 0 0 +1 0 0 0 0 0 0 We’ve already seen the terminal states. 0 0 0 0 0 -1 0 0 0 0 Use these exploration traces: 0 0 (0,0)→(1,0)→(2,0)→(2,1)→(3,1) 0 0 0 0 0 0 0 0 0 0 0 0 (0,0)→(0,1)→(0,2)→(1,2)→(2,2)→(3,2) 0 0 0 0 (0,0)→(0,1)→(0,2)→(1,2)→(2,2)→(2,1)→(3,1) (0,0)→(1,0)→(2,0)→(2,1)→(2,2)→(3,2) (0,0)→(1,0)→(2,0)→(3,0)→(3,1) (0,0)→(0,1)→(0,2)→(1,2)→(2,2)→(3,2)

Optimal Policy from Q-Learning Once we know values, the optimal policy is easy: • Greedily maximize value. • Pick the action with the highest Q-value . • We don’t need to think about the future, just the Q- value of each action . If our value estimates are correct, then this policy is optimal.

Exploration Policy During Q-Learning What policy should we follow while we’re learning (before we have good value estimates)? • We want to explore: try out each action enough times that we have a good estimate of its value. • We want to exploit: we update other Q-values based on the best action, so we want a good estimate of the value of the best action. We need a policy that handles this tradeoff. • One option: ε-greedy

Q-Learning 2/22/17 MDP Examples MDPs model environments where - PowerPoint PPT Presentation

Q-Learning 2/22/17 MDP Examples MDPs model environments where state transitions are affected both by the agents action and by external random elements. Gridworld Randomness from noisy movement control PacMan Randomness from

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

What is mobile learning, mobile learning policies and technologies Dr. Mohamed Ally Learning

Year 7 Learning Evening 2017 W elcome! Year 7 Learning Evening 2017 Year 7 Learning Evening

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Welcome to Welcome to The Learning Tree Workshop Series on Learning Differences, Learning

Impasse, Conflict Impasse, Conflict and Learning of CS Notions and Learning of CS Notions David

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Why e Learning can actually be effective for learning an understanding from psycho

Fairness in Machine Learning Fairness in Supervised Learning Make decisions by machine learning:

Objectives Objectives Objectives Objectives Learning Learning Learning Learning

Learning Sciences: Impact on Learning Technologies & Learning Activities Phillip D. Long,

Learning Domain-Independent Heuristics over Hypergraphs William Shen , Felipe Trevizan, Sylvie

Weakly-supervised learning from videos and scripts Ivan Laptev ivan.laptev@inria.fr WILLOW,

Partnering For Leadership A Unique Business/ Non-Profit Action Learning Approach - ASTD

Illinois Early Childhood Collaborations: Community Highlights and Peer Exchange Part 1 of 2:

What AI is A.Y. 2019/2020 A taste of AI http://bit.ly/2RW7xlv All problems present in a few

Introduction to AI & Intelligent Agents This Lecture Chapters 1 and 2 Next Lecture

CSC421 Intro to Artificial Intelligence UNIT 00: Overview & Introduction Overview

Plan for the 2nd hour What is an agent? EDAF70: Applied Artificial Intelligence PEAS

Q-Learning 2/22/17 MDP Examples MDPs model environments where - PowerPoint PPT Presentation

Q-Learning 2/22/17 MDP Examples MDPs model environments where state transitions are affected both by the agents action and by external random elements. Gridworld Randomness from noisy movement control PacMan Randomness from

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

What is mobile learning, mobile learning policies and technologies Dr. Mohamed Ally Learning

Year 7 Learning Evening 2017 W elcome! Year 7 Learning Evening 2017 Year 7 Learning Evening

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Welcome to Welcome to The Learning Tree Workshop Series on Learning Differences, Learning

Impasse, Conflict Impasse, Conflict and Learning of CS Notions and Learning of CS Notions David

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Why e Learning can actually be effective for learning an understanding from psycho

Fairness in Machine Learning Fairness in Supervised Learning Make decisions by machine learning:

Objectives Objectives Objectives Objectives Learning Learning Learning Learning

Learning Sciences: Impact on Learning Technologies &amp; Learning Activities Phillip D. Long,

Learning Domain-Independent Heuristics over Hypergraphs William Shen , Felipe Trevizan, Sylvie

Weakly-supervised learning from videos and scripts Ivan Laptev ivan.laptev@inria.fr WILLOW,

Partnering For Leadership A Unique Business/ Non-Profit Action Learning Approach - ASTD

Illinois Early Childhood Collaborations: Community Highlights and Peer Exchange Part 1 of 2:

What AI is A.Y. 2019/2020 A taste of AI http://bit.ly/2RW7xlv All problems present in a few

Introduction to AI &amp; Intelligent Agents This Lecture Chapters 1 and 2 Next Lecture

CSC421 Intro to Artificial Intelligence UNIT 00: Overview &amp; Introduction Overview

Plan for the 2nd hour What is an agent? EDAF70: Applied Artificial Intelligence PEAS

Learning Sciences: Impact on Learning Technologies & Learning Activities Phillip D. Long,

Introduction to AI & Intelligent Agents This Lecture Chapters 1 and 2 Next Lecture

CSC421 Intro to Artificial Intelligence UNIT 00: Overview & Introduction Overview