Reinforcement Learning Lecture 8 Reinforcement Learning November - PowerPoint PPT Presentation

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Reinforcement Learning Lecture 8 Reinforcement Learning November 24, 2015 1

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Outline 1. Context 2. TD Learning 3. Issues Reinforcement Learning November 24, 2015 2

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Machine Learning Tasks • Supervised – Given a training set and a target variable, generalize ; measured over a testing set • Unsupervised – Given a dataset, find “interesting” patterns; potentially no “right” answer • Reinforcement – Learn an optional action policy over time; given an environment that provides states, affords actions, and provides feedback as numerical reward , maximize the expected future reward • Never given I/O pairs • Focus: online (balancing exploration/exploitation) Reinforcement Learning November 24, 2015 3

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Success Stories Reinforcement Learning November 24, 2015 4

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky The Agent-Environment Interface Agent state ac'on s t a t reward r t+1 Environment (stochas2c) s t+1 Reinforcement Learning November 24, 2015 5

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Pole Balancing Reinforcement Learning November 24, 2015 6

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Multi-Armed Bandit Reinforcement Learning November 24, 2015 7

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Types of Tasks • Some tasks are continuous , meaning they are an ongoing sequence of decisions • Some tasks are episodic , meaning there exist terminal states that reset the problem Reinforcement Learning November 24, 2015 8

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Policies A policy is a function that associates a probability with taking a particular action in a particular state π ( s, a ) The goal of RL is to learn an “effective” policy for a particular task Reinforcement Learning November 24, 2015 9

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Objective Select actions so that the sum of the discounted rewards it receives over the future is maximized – Discount rate: 0 ≤ γ ≤ 1 R t = r t +1 + γ r t +2 + γ 2 r t +3 + . . . ∞ X γ k r t + k +1 = k =0 Reinforcement Learning November 24, 2015 10

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Environmental Modeling • An important issue in RL is state representation – Current sensors (observability!) – Past history? • A stochastic process has the Markov property if the conditional probability distribution of future states of the process depends only upon the present state – Given the present, the future does not depend on the past – Memoryless, pathless Reinforcement Learning November 24, 2015 11

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Implications of the Markov Property Often the process is not strictly Markovian, but we can either (i) approximate it as such and yield good results, or (ii) include a fixed window of history as state Thus we can approximate P ( s t +1 = s 0 , r t +1 = r | s t , a t , s t � 1 , a t � 1 , . . . r 1 , s 0 , a 1 ) via P ( s t +1 = s 0 , r t +1 = r | s t , a t ) Reinforcement Learning November 24, 2015 12

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Markov Decision Processes If a process is Markovian, we can model it as a 5-tuple MDP : ( S, A, P ( · , · ) , R ( · , · ) , γ ) – S: set of states – A: set of actions – P a (s, s’): transition function – R a (s, s’): immediate reward Reinforcement Learning November 24, 2015 13

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Recycling Robot MDP Reinforcement Learning November 24, 2015 14

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Value Functions Almost all RL algorithms are based on estimating value functions – functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state) Value functions are defined with respect to particular policies Reinforcement Learning November 24, 2015 15

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky State-Value Function V π ( s ) = E π [ R t | s t = s ] ∞ X γ k r t + k +1 | s t = s ] = E π [ k =0 Reinforcement Learning November 24, 2015 16

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Action-Value Function Q π ( s, a ) = E π [ R t | s t = s, a t = a ] ∞ X γ k r t + k +1 | s t = s, a t = a ] = E π [ k =0 Reinforcement Learning November 24, 2015 17

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Example: Golf Reinforcement Learning November 24, 2015 18

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Temporal Difference (TD) Learning • Combines ideas from Monte Carlo sampling and dynamic programming • Learns directly from raw experience without a model of environment dynamics • Update estimates based in part on other learned estimates, without waiting for a final outcome Reinforcement Learning November 24, 2015 19

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Visual TD Learning Reinforcement Learning November 24, 2015 20

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Q-Learning: Off-Policy TD Control 1. Initialize Q(s,a) – Random, optimistic, realistic, knowledge 2. Repeat (for each episode): a. Initialize s b. Repeat (for each step of episode) i. Choose action via Q ii. Take action, observe r, s’ iii. a 0 Q ( s 0 , a 0 ) − Q ( s, a )] Q ( s, a ) ← Q ( s, a ) + α [ r + γ max iv. s = s’ until s is terminal Reinforcement Learning November 24, 2015 21

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Choosing Actions • Given a Q function, a common approach to selecting action is ε -greedy 1. Select a random value in [0,1] Ø If > ε , take action with highest estimated value Ø Else, select randomly • In the limit, every action will be sampled an infinite number of times Reinforcement Learning November 24, 2015 22

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Function Representation • Given large state-action spaces, there is a practical problem of how to sample the space, and how to represent it • Modern approaches include hierarchical methods and neural networks Reinforcement Learning November 24, 2015 23

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Application: Michigan Liar’s Dice • Multi-agent opponents • Varying degrees of background knowledge – Opponent modeling – Probabilistic calculation – Symbolic heuristics Reinforcement Learning November 24, 2015 24

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Evaluation: Learning vs. Static 250 200 Games Won 150 100 50 0 0 2 4 6 8 10 12 Blocks of Training (250 Games/Block) Reinforcement Learning November 24, 2015 25

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Evaluation: Learning vs. Learned 250 200 Games Won 150 100 50 0 0 2 4 6 8 10 12 Blocks of Training (250 Games/Block) Reinforcement Learning November 24, 2015 26

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Evaluation: Value-Function Initialization 250 200 PMH PM Games Won 150 PH P 100 PMH-0 PM-0 50 PH-0 P-0 0 0 5 10 15 20 Blocks of Trainings (250 Games/Block) Reinforcement Learning November 24, 2015 27

Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Summary • Reinforcement Learning (RL) is the problem of learning an effective action policy for obtaining reward • Most RL algorithms model the task as a Markov Decision Process (MDP) and estimate the value of states/state-actions in a value function • Temporal-Difference (TD) Learning is one effective method that is online and model-free Reinforcement Learning November 24, 2015 28

Reinforcement Learning Lecture 8 Reinforcement Learning November - PowerPoint PPT Presentation

Wentworth Institute of Technology COMP4050 Machine Learning | Fall 2015 | Derbinsky Reinforcement Learning Lecture 8 Reinforcement Learning November 24, 2015 1 Wentworth Institute of Technology COMP4050 Machine Learning | Fall 2015 |

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

The Power of Teacher Collaboration to Support Effective Teaching and Learning Diane J. Briars

NOW Handout Page 1 1 Styles of Vector Architectures Components of Vector Processor Vector

. 1 / 151 Computer Algebra Basic Information Working defjnition of Computer Algebra:

Exact computations with an arithmetic known to be approximate MaGiX@LiX conference 2011

Learning to Randomize and Remember in Partially-Observed Environments Radford M. Neal, University

Recap: MDPs Op)mal Quan))es Markov decision processes:

Advanced Econometrics 2, Hilary term 2021 Reinforcement learning Maximilian Kasy Department of

Introduction to Reinforcement Learning Finale Doshi-Velez Harvard University Buenos Aires MLSS