Recap: MDPs Markov decision processes: States S Start state s 0 - PowerPoint PPT Presentation

Recap: MDPs Ø Markov decision processes: • States S • Start state s 0 • Actions A • Transition p (s’|s,a) (or T(s,a,s’)) • Reward R(s,a,s’) (and discount ϒ ) Ø MDP quantities: • Policy = Choice of action for each (MAX) state • Utility (or return) = sum of discounted rewards 1

Optimal Utilities Ø The value of a state s: • V*(s) = expected utility starting in s and acting optimally Ø The value of a Q-state (s,a): • Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally Ø The optimal policy: • π *(s) = optimal action from state s

Solving MDPs Ø Value iteration • Start with V 1 (s) = 0 • Given V i , calculate the values for all states for depth i+1: V s max T s a s , , ' R s a s , , ' V s ' ( ) ( ) ( ) ( ) ∑ ← ⎡ + γ ⎤ ⎣ ⎦ i 1 i + a s ' • Repeat until converge • Use V i as evaluation function when computing V i+ 1 Ø Policy iteration • Step 1: policy evaluation: calculate utilities for some fixed policy • Step 2: policy improvement: update policy using one-step look-ahead with resulting utilities as future values • Repeat until policy converges 3

Reinforcement learning Ø Don’t know T and/or R, but can observe R • Learn by doing • can have multiple trials 4

The Story So Far: MDPs and RL Things we know how to do: Techniques: Ø If we know the MDP • Computation • Compute V*, Q*, π* exactly • Value and policy • Evaluate a fixed policy π iteration Ø If we don’t know T and R • Policy evaluation • If we can estimate the MDP then • Model-based RL solve • sampling • We can estimate V for a fixed policy π • Model-free RL: • We can estimate Q*(s,a) for the • Q-learning optimal policy while executing an exploration policy 5

Model-Free Learning Ø Model-free (temporal difference) learning • Experience world through trials (s,a,r,s’,a’,r’,s’’,a’’,r’’,s’’’…) • Update estimates each transition (s,a,r,s’) • Over time, updates will mimic Bellman updates Q-Value Iteration (model-based, requires known MDP) ⎡ ⎤ Q s a , T s a s , , ' R s a s , , ' max Q s a ', ' ( ) ( ) ( ) ( ) ∑ ← + γ i 1 i + ⎣ ⎦ a ' s ' Q-Learning (model-free, requires only experienced transitions) ⎡ ⎤ Q s a ( , ) (1 ) ( , ) Q s a r max Q s a ', ' ( ) ← − α + α + γ ⎣ ⎦ a ' 6

Q-Learning Ø Q-learning produces tables of q-values: 7

Exploration / Exploitation Ø R andom actions (ε greedy) • Every time step, flip a coin • With probability ε, act randomly • With probability 1-ε, act according to current policy 8

Today: Q-Learning with state abstraction Ø In realistic situations, we cannot possibly learn about every single state! • Too many states to visit them all in training • Too many states to hold the Q-tables in memory Ø Instead, we want to generalize: • Learn about some small number of training states from experience • Generalize that experience to new, similar states • This is a fundamental idea in machine learning, and we’ll see it over and over again 9

Example: Pacman Ø Let’s say we discover through experience that this state is bad: Ø In naive Q-learning, we know nothing about this state or its Q-states: Ø Or even this one! 10

Feature-Based Representations Ø Solution: describe a state using a vector of features (properties) • Features are functions from states to real numbers (often 0/1) that capture important properties of the state • Example features: • Distance to closest ghost • Distance to closest dot • Number of ghosts • 1/ (dist to dot) 2 Similar to a evaluation function • Is Pacman in a tunnel? (0/1) • …etc • Can also describe a Q-state (s,a) with features (e.g. action moves closer to food) 11

Linear Feature Functions Ø Using a feature representation, we can write a Q function for any state using a few weights: ( ) = w 1 f 1 s , a ( ) + w 2 f 2 s , a ( ) +  + w n f n s , a ( ) Q s , a Ø Advantage: more efficient learning from samples Ø Disadvantage: states may share features but actually be very different in value! 12

Function Approximation ( ) = w 1 f 1 s , a ( ) + w 2 f 2 s , a ( ) +  + w n f n s , a ( ) Q s , a Ø Q-learning with linear Q-functions: transition = (s,a,r,s’) ⎡ ⎤ difference r max Q s a ', ' Q s a ( , ) ( ) = + γ − ⎣ ⎦ a ' Exact Q’s " $ Q ( s , a ) ← Q ( s , a ) + α difference # % Approximate Q’s " $ ( ) w i ← w i + α difference % f i s , a # Ø Intuitive interpretation: • Adjust weights of active features • E.g. if something unexpectedly bad happens, disprefer all states with that state’s features 13

Example: Q-Pacman s Q ( s,a ) = 4.0 f DOT ( s,a ) - 1.0 f GST ( s,a ) f DOT ( s, NORTH)=0.5 f GST ( s, NORTH)=1.0 Q ( s,a )=+1 a = North R ( s,a,s’ )=-500 r = -500 s’ difference =-501 w DOT ← 4.0+ α [-501]0.5 w GST ← -1.0+ α [-501]1.0 Q ( s,a ) = 3.0 f DOT ( s,a ) - 3.0 f GST ( s,a ) 14

Linear Regression prediction prediction  = w 0 + w 1 f 1 x  = w 0 + w 1 f 1 x ( ) ( ) + w 2 f 2 x ( ) y y 15

Ordinary Least Squares (OLS) 2 # & 2  ( ) ( ) total error = ∑ y i − y ∑ y i − ∑ w k f k x = % ( i $ ' i i k 16

Minimizing Error Imagine we had only one point x with features f(x): 2 1 ⎛ ⎞ error w y w f x ( ) ( ) ∑ = − ⎜ ⎟ k k 2 ⎝ ⎠ k error w ( ) ∂ ⎛ ⎞ y w f x f x ( ) ( ) ∑ = − − ⎜ ⎟ k k m w ∂ ⎝ ⎠ k m ⎛ ⎞ w w y w f x f x ( ) ( ) ∑ ← + α − ⎜ ⎟ m m k k m ⎝ ⎠ k Approximate q update as a one-step gradient descent : “target” “prediction” ⎡ ⎤ w w r max Q s ( ', ') a Q ( , ) s a f x ( ) ← + α + γ − m m m ⎣ ⎦ a 17

How many features should we use? Ø As many as possible? • computational burden • overfitting Ø Feature selection is important • requires domain expertise 18

Overfitting 19

Overview of Project 3 Ø MDPs • Q1: value iteration • Q2: find parameters that lead to certain optimal policy • Q3: similar to Q2 Ø Q-learning • Q4: implement the Q-learning algorithm • Q5: implement ε greedy action selection • Q6: try the algorithm Ø Approximate Q-learning and state abstraction • Q7: Pacman Ø Tips • make your implementation general 20

Recap: MDPs Markov decision processes: States S Start state s 0 - PowerPoint PPT Presentation

Recap: MDPs Markov decision processes: States S Start state s 0 Actions A Transition p (s|s,a) (or T(s,a,s)) Reward R(s,a,s) (and discount ) MDP quantities: Policy = Choice of action for each (MAX) state

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

Parity Objectives in Countable MDPs Stefan Kiefer Richard Mayr Mahsa Shirmohammadi Dominik

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 1 / 23

POMDPs (Ch. 17.4-17.6) Markov Decision Process Recap of Markov Decision Processes (MDPs): Know:

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley

Partially-Observable MDPs RN, Chapter 17.4 17.5 Decision Theoretic Agents Introduction

Online Convex Optimization in Adversarial MDPs Aviv Rosenberg Yishay Mansour Motivation:

Planning and Optimization G1. Factored MDPs Malte Helmert and Thomas Keller Universit at

Lecture 2: Infinite Horizon and Indefinite Horizon MDPs B9140 Dynamic Programming &

Computational Approaches for Stochastic Shortest Path on Succinct MDPs Krishnendu Chatterjee 1

Lecture 13 Reachability in MDPs Dr. Dave Parker Department of Computer Science University

Outline CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Formalism

Semiotics: Recap Examples References Jrg Cassens Data and Process Visualization SoSe 2017

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Outline Decision making Utility Theory Lecture 11 Decision Trees Utility Theory

EFFICIENT NONMYOPIC ACTIVE SEARCH Shali Jiang, Gustavo Malkomes, Geoff Converse, Alyssa Shofner,

A PERSPECTIVE ON THE ECB S A SSET P URCHASE P ROGRAMME Franois Villeroy de Galhau Governor,

Estimating Consumer Price Inflation by Household Jess Diamond Kota Watanabe Tsutomu Watanabe

Reinforcement Learning, cont. Models an agent trying to learn what to do / how to act

Links visited in class class website graphics, Expected utility and expected return for portfolio

Exponential and power utility maximization problems under partial information: some convergence

Relaxed Utility Maximization in Complete Markets Paolo Guasoni (Joint work with Sara Biagini)

Recap: MDPs Markov decision processes: States S Start state s 0 - PowerPoint PPT Presentation

Recap: MDPs Markov decision processes: States S Start state s 0 Actions A Transition p (s|s,a) (or T(s,a,s)) Reward R(s,a,s) (and discount ) MDP quantities: Policy = Choice of action for each (MAX) state

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

Parity Objectives in Countable MDPs Stefan Kiefer Richard Mayr Mahsa Shirmohammadi Dominik

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 1 / 23

POMDPs (Ch. 17.4-17.6) Markov Decision Process Recap of Markov Decision Processes (MDPs): Know:

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley

Partially-Observable MDPs RN, Chapter 17.4 17.5 Decision Theoretic Agents Introduction

Online Convex Optimization in Adversarial MDPs Aviv Rosenberg Yishay Mansour Motivation:

Planning and Optimization G1. Factored MDPs Malte Helmert and Thomas Keller Universit at

Lecture 2: Infinite Horizon and Indefinite Horizon MDPs B9140 Dynamic Programming &amp;

Computational Approaches for Stochastic Shortest Path on Succinct MDPs Krishnendu Chatterjee 1

Lecture 13 Reachability in MDPs Dr. Dave Parker Department of Computer Science University

Outline CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Formalism

Semiotics: Recap Examples References Jrg Cassens Data and Process Visualization SoSe 2017

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Outline Decision making Utility Theory Lecture 11 Decision Trees Utility Theory

EFFICIENT NONMYOPIC ACTIVE SEARCH Shali Jiang, Gustavo Malkomes, Geoff Converse, Alyssa Shofner,

A PERSPECTIVE ON THE ECB S A SSET P URCHASE P ROGRAMME Franois Villeroy de Galhau Governor,

Estimating Consumer Price Inflation by Household Jess Diamond Kota Watanabe Tsutomu Watanabe

Reinforcement Learning, cont. Models an agent trying to learn what to do / how to act

Links visited in class class website graphics, Expected utility and expected return for portfolio

Exponential and power utility maximization problems under partial information: some convergence

Relaxed Utility Maximization in Complete Markets Paolo Guasoni (Joint work with Sara Biagini)

Lecture 2: Infinite Horizon and Indefinite Horizon MDPs B9140 Dynamic Programming &