Todays Outline Reinforcement Learning Dan Weld Reinforcement - PDF document

5/7/2012 CSE 473: Artificial Intelligence Today’s Outline Reinforcement Learning Dan Weld  Reinforcement Learning  Q-value iteration  Q-learning  Exploration / exploitation  Linear function approximation  Linear function approximation Many slides adapted from either Dan Klein, Stuart Russell, Luke Zettlemoyer or Andrew Moore 1 Recap: MDPs Bellman Equations  Markov decision processes: s  States S a  Actions A s, a  Transitions T(s,a,s ʼ ) aka P(s ʼ |s,a) s,a,s ʼ  Rewards R(s,a,s ʼ ) (and discount  ) s ʼ  Start state s 0 (or distribution P 0 ) 0 ( 0 )  Algorithms  Value Iteration  Q-value iteration Q*(a, s) =  Quantities:  Policy = map from states to actions  Utility = sum of discounted future rewards  Q-Value = expected utility from a q-state Andrey Markov  Ie. from a state/action pair (1856 ‐ 1922) 4 Bellman Backup Q-Value Iteration  Regular Value iteration: find successive approx optimal values Q 5 (s,a 1 ) = 2 +  0  Start with V 0 * (s) = 0 ~ 2  Given V i * , calculate the values for all states for depth i+1: Q 5 (s,a 2 ) = 5 +  0.9~ 1 s 1 a 1 V 4 = 0 Q i+1 (s,a) +  0.1~ 2 V 5 = 6.5 ~ 6.1 5 s 0 s 0 a 2 a 2 Q 5 (s,a 3 ) = 4.5 +  2 s 2 V 4 = 1 ~ 6.5 a 3  Storing Q-values is more useful! s 3  Start with Q 0 V 4 = 2 * (s,a) = 0 max  Given Q i * , calculate the q-values for all q-states for depth i+1: V i (s’) ] 1

5/7/2012 Q-Value Iteration Reinforcement Learning  Markov decision processes: Initialize each q-state: Q 0 (s,a) = 0  States S s  Actions A Repeat a  Transitions T(s,a,s ʼ ) aka P(s ʼ |s,a) For all q-states, s,a s, a  Rewards R(s,a,s ʼ ) (and discount  ) Compute Q i+1 (s,a) from Q i by Bellman backup at s,a. s,a,s ʼ  Start state s 0 (or distribution P 0 ) 0 ( 0 ) Until max s,a |Q i+1 (s,a) – Q i (s,a)| <  s ʼ  Algorithms  Q-value iteration  Q-learning  Approaches for mixing exploration & exploitation V i (s’) ]   -greedy  Exploration functions Applications Stanford Autonomous Helicopter http://heli.stanford.edu/  Robotic control  helicopter maneuvering, autonomous vehicles  Mars rover - path planning, oversubscription planning g g  elevator planning  Game playing - backgammon, tetris, checkers  Neuroscience  Computational Finance, Sequential Auctions  Assisting elderly in simple tasks  Spoken dialog management  Communication Networks – switching, routing, flow control  War planning, evacuation planning 10 Two main reinforcement learning Two main reinforcement learning approaches approaches  Model-based approaches:  Model-based approaches:  explore environment & learn model, T=P( s ʼ | s , a ) and R( s , a ), Learn T + R (almost) everywhere |S| 2 |A| + |S||A| parameters (40,000)  use model to plan policy, MDP-style  approach leads to strongest theoretical results h l d t t t th ti l lt  often works well when state-space is manageable  Model-free approach:  Model-free approach: Learn Q  don ʼ t learn a model; learn value function or policy directly |S||A| parameters (400)  weaker theoretical results  often works better when state space is large 2

5/7/2012 Recap: Sampling Expectations Recap: Exp. Moving Average  Want to compute an expectation weighted by P(x):  Exponential moving average  Makes recent samples more important  Model-based: estimate P(x) from samples, compute expectation  Forgets about the past (distant past values were wrong anyway)  Easy to compute from the running average  Model-free: estimate expectation directly from samples  Decreasing learning rate can give converging averages  Why does this work? Because samples appear with the right frequencies! Q-Learning Update Exploration-Exploitation tradeoff  You have visited part of the state space and found a  Q-Learning = sample-based Q-value iteration reward of 100  is this the best you can hope for???  How learn Q*(s,a) values?  Exploitation : should I stick with what I know and find  Receive a sample (s a s ʼ r)  Receive a sample (s,a,s ,r) a good policy w.r.t. this knowledge? d li hi k l d ?  Consider your old estimate:  at risk of missing out on a better reward somewhere  Consider your new sample estimate:  Exploration : should I look for states w/ more reward?  Incorporate the new estimate into a running average:  at risk of wasting time & getting some negative reward 16 Q-Learning:  Greedy Exploration / Exploitation  Several schemes for action selection  Simplest: random actions (  greedy )  Every time step, flip a coin  With probability  , act randomly  With probability 1-  , act according to current policy With b bilit 1 t di t t li QuickTime™ and a H.264 decompressor  Problems with random actions? are needed to see this picture.  You do explore the space, but keep thrashing around once learning is done  One solution: lower  over time  Another solution: exploration functions 3

5/7/2012 Exploration Functions Q-Learning Final Solution  When to explore  Q-learning produces tables of q-values:  Random actions: explore a fixed amount  Better idea: explore areas whose badness is not (yet) established  Exploration function Exploration function  Takes a value estimate and a count, and returns an optimistic utility , e.g. (exact form not important)  Exploration policy π ( s ’ )= vs. Q-Learning – Small Problem Q-Learning Properties  Doesn’t work  Amazing result: Q-learning converges to optimal policy  If you explore enough  If you make the learning rate small enough  In realistic situations, we can’t possibly learn about  … but not decrease it too quickly! every single state!  Not too sensitive to how you select actions (!) y ( )  Too many states to visit them all in training Too many states to visit them all in training  Too many states to hold the q-tables in memory  Neat property: off-policy learning  learn optimal policy without following it (some caveats)  Instead, we need to generalize :  Learn about a few states from experience  Generalize that experience to new, similar states (Fundamental idea in machine learning) S E S E Example: Pacman Feature-Based Representations  Let ʼ s say we discover  Solution: describe a state using a vector of features (properties) through experience  Features are functions from states to that this state is bad: real numbers (often 0/1) that capture important properties of the state  Example features:  Example features:  In naïve Q learning,  Distance to closest ghost we know nothing  Distance to closest dot about related states  Number of ghosts  1 / (dist to dot) 2 and their Q values:  Is Pacman in a tunnel? (0/1)  …… etc.  Or even this third one!  Can also describe a q-state (s, a) with features (e.g. action moves closer to food) 4

5/7/2012 Linear Feature Functions Function Approximation  Using a feature representation, we can write a q function (or value function) for any state  Q-learning with linear q-functions: using a linear combination of a few weights: Exact Q ʼ s  Advantage: our experience is summed up in Approximate Q ʼ s a few powerful numbers  Intuitive interpretation: |S| 2 |A| ? |S||A| ?  Adjust weights of active features  E.g. if something unexpectedly bad happens, disprefer all states  Disadvantage: states may share features but with that state ʼ s features actually be very different in value!  Formal justification: online least squares Example: Q-Pacman Linear Regression 40 26 24 20 22 20 20 30 40 20 30 20 10 0 10 0 20 0 0 Prediction Prediction Ordinary Least Squares (OLS) Minimizing Error Imagine we had only one point x with features f(x): Error or “ residual ” Error or residual Observation Prediction Approximate q update: “ target ” “ prediction ” 0 0 20 5

5/7/2012 Overfitting Which Algorithm? 30 Q-learning, no features, 50 learning trials: 25 20 Degree 15 polynomial 15 10 QuickTime™ and a 5 GIF decompressor are needed to see this picture. 0 -5 -10 -15 0 2 4 6 8 10 12 14 16 18 20 Which Algorithm? Which Algorithm? Q-learning, no features, 1000 learning trials: Q-learning, simple features, 50 learning trials: QuickTime™ and a QuickTime™ and a GIF decompressor GIF decompressor are needed to see this picture. are needed to see this picture. Partially observable MDPs A POMDP: Ghost Hunter  Markov decision processes:  States S  Actions A b  Transitions P(s ʼ |s,a) (or T(s,a,s ʼ ))  Rewards R(s,a,s ʼ ) (and discount  ) a  Start state distribution b 0 =P(s 0 ) b, a QuickTime™ and a H.264 decompressor are needed to see this picture. o  POMDPs, just add: b ʼ  Observations O  Observation model P(o|s,a) (or O(s,a,o)) 6

Todays Outline Reinforcement Learning Dan Weld Reinforcement - PDF document

5/7/2012 CSE 473: Artificial Intelligence Todays Outline Reinforcement Learning Dan Weld Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Linear function

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

What is the League Today 1 1/23/2017 What is the League Today What is the League Today 2

Social/Network/Analysis mohamed.bouguessa@uqo.ca/ 1 Web/today 2

Lecture 15 Logistics HW4 is due today HW5 posted today HW5 posted today Exam

WIEMANN LAMPHERE ARCHITECTS MONTPELIER TODAY MONTPELIER TODAY PARKING! VEHICLES ARE

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

Welcome back. Today. Welcome back. Today. Continue Sampling combinatorial structures. Welcome

AST 1420 Galactic Structure and Dynamics Today: disks! NGC 5907 M31 Today: disks! Outline

1. Abertis today 2. 2016 Financial Year 3. Outlook 4. Conclusions Abertis today 2016

Matt Fisher EUA Coordinator Overview of Parramatta today Overview of Parramatta today Overview

Course Business New dataset on CourseWeb: bpd.csv Midterm project due today Today

Featherweight Scala Week 14 January 31 1 Today Previously: Featherweight Java Today:

Stuff New HW on the web later today No lab today Tests graded by Thurs Last Time

Welcome back. Today. Welcome back. Today. Review: Spectral gap, Edge expansion h ( G ) ,

Sorting 15-121 Fall 2020 Margaret Reid-Miller Today Margaret will have office hours today

Exceptions Announcements Exceptions Today's Topic: Handling Errors 4 Today's Topic: Handling

Single-pass restore after a media failure Caetano Sauer , Goetz Graefe, Theo Hrder 20% of

ECE232: Hardware Organization and Design Lecture 7: Binary Numbers and Adders Adapted from

Program Generation Workflow C**program compiler assemblycode nd libraryobjectcode

5DV118 Computer Organization and Architecture Ume University Department of Computing Science

Facultat d'Informtica de Barcelona Univ. Politcnica de Catalunya Administraci de Sistemes

RNN Tau Identi f ication IN THE ATLAS High-Level Trigger M ARIEL P ETTEE October 26 th , 2018 US

Doing PITR Right (Point-In-Time-Recovery) Who Am I? Stephen Frost Sr. Database Engineer @

Virtual File System Don Porter CSE 506 History Early OSes provided a single file system

Todays Outline Reinforcement Learning Dan Weld Reinforcement - PDF document

5/7/2012 CSE 473: Artificial Intelligence Todays Outline Reinforcement Learning Dan Weld Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Linear function

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

What is the League Today 1 1/23/2017 What is the League Today What is the League Today 2

Social/Network/Analysis mohamed.bouguessa@uqo.ca/ 1 Web/today 2

Lecture 15 Logistics HW4 is due today HW5 posted today HW5 posted today Exam

WIEMANN LAMPHERE ARCHITECTS MONTPELIER TODAY MONTPELIER TODAY PARKING! VEHICLES ARE

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

Welcome back. Today. Welcome back. Today. Continue Sampling combinatorial structures. Welcome

AST 1420 Galactic Structure and Dynamics Today: disks! NGC 5907 M31 Today: disks! Outline

1. Abertis today 2. 2016 Financial Year 3. Outlook 4. Conclusions Abertis today 2016

Matt Fisher EUA Coordinator Overview of Parramatta today Overview of Parramatta today Overview

Course Business New dataset on CourseWeb: bpd.csv Midterm project due today Today

Featherweight Scala Week 14 January 31 1 Today Previously: Featherweight Java Today:

Stuff New HW on the web later today No lab today Tests graded by Thurs Last Time

Welcome back. Today. Welcome back. Today. Review: Spectral gap, Edge expansion h ( G ) ,

Sorting 15-121 Fall 2020 Margaret Reid-Miller Today Margaret will have office hours today

Exceptions Announcements Exceptions Today's Topic: Handling Errors 4 Today's Topic: Handling

Single-pass restore after a media failure Caetano Sauer , Goetz Graefe, Theo Hrder 20% of

ECE232: Hardware Organization and Design Lecture 7: Binary Numbers and Adders Adapted from

Program Generation Workflow C**program compiler assembly*code nd* library*object*code

5DV118 Computer Organization and Architecture Ume University Department of Computing Science

Facultat d'Informtica de Barcelona Univ. Politcnica de Catalunya Administraci de Sistemes

RNN Tau Identi f ication IN THE ATLAS High-Level Trigger M ARIEL P ETTEE October 26 th , 2018 US

Doing PITR Right (Point-In-Time-Recovery) Who Am I? Stephen Frost Sr. Database Engineer @

Virtual File System Don Porter CSE 506 History Early OSes provided a single file system

Program Generation Workflow C**program compiler assemblycode nd libraryobjectcode