Reinforcement Learning Steven J Zeil Old Dominion Univ. Fall 2010 - PowerPoint PPT Presentation

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Reinforcement Learning Steven J Zeil Old Dominion Univ. Fall 2010 1

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Reinforcement Learning Learning policies for which ultimate payoff is only after many steps 2

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Reinforcement Learning Learning policies for which ultimate payoff is only after many steps e.g., games, robotics 2

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Reinforcement Learning Learning policies for which ultimate payoff is only after many steps e.g., games, robotics Unlike supervised learning, correct I/O pairs are not avail 2

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Reinforcement Learning Learning policies for which ultimate payoff is only after many steps e.g., games, robotics Unlike supervised learning, correct I/O pairs are not avail May not be a “correct” ouput 2

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Reinforcement Learning Learning policies for which ultimate payoff is only after many steps e.g., games, robotics Unlike supervised learning, correct I/O pairs are not avail May not be a “correct” ouput Heavy emphasis on on-line learning 2

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Short-term versus Long-term Reward Goal is to optimize a reward that may be given at the end of a sequence of state transitions Approximated by a series of immediate rewards after each transition Requires balance of short-term versus long-term planning At any given step, may engage in exploitation of what we know or exploration of unknown states 3

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Basic Components set of states S set of actions A rules for transitioning between states rules for immediate reward of a transition rules for what the agent can observe 4

Introduction Model-based Learning Temporal Difference Learning Partially Observable States K-armed Bandit Among K levers, choose the one that pays best Q ( a ): value of action a Reward is r a Set Q ( a ) = r a Choose a ∗ if Q ( a ∗ ) = max a Q ( a ) Rewards stochastic (keep an expected reward): Q t +1 ( a ) ← Q t ( a )+ η [ r t +1 ( a ) − Q t ( 5

Introduction Model-based Learning Temporal Difference Learning Partially Observable States K-armed Bandit variants This problem becomes more interesting if we don’t know all the r a Trade-off of exploitation and exploration 6

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Policies and Cumulative Rewards Policy: π : S → A . a t = π ( s t ) Value of a policy: V π ( s t ) Cumulative value: Finite-horizon (episodic): T V π = E [ � r t +1 ] i =1 Infinite horizon: ∞ V π = E [ � γ i − 1 r t +1 ] i =1 0 ≤ γ < 1 is the discount rate 7

Introduction Model-based Learning Temporal Difference Learning Partially Observable States State-Action pairs V ( s t ) is a measure of how good it is for the agent to be in state s t Alternative, we can talk about Q ( s t , a t ), how good it is to perform action a t when in state s t Q ∗ ( s t , a t ) is the expected cumulative reward of of action a t taken in state s t assuming we follow an optimal policy afterwards 8

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Optimal Policies V ∗ ( s t ) = max V π ( s t ) , ∀ s t π V ∗ ( s t ) = max a t Q ∗ ( s t , a t ) Bellman’s equation: � � � V ∗ ( s t ) = max P ( s t +1 | s t , a t ) V ∗ ( s t +1 ) E [ r t +1 ] + γ a t s t +1 � Q ∗ ( s t , a t ) = E [ r t +1 ] + γ a t Q ∗ ( s t +1 , a t +1 ) P ( s t +1 | s t , a t ) max s t +1 Choose the a t that maximizes Q ∗ ( s t , a t ) (greedy) 9

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Model-based Learning Environment, P ( s t +1 | s t , a t ), p ( r t +1 | s t , a t ), is known There is no need for exploration Can be solved using dynamic programming 10

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Value Iteration Initialize all V ( S ) to arbitrary values repeat for all s ∈ S do for all a ∈ A do Q ( s , a ) ← E [ r | s , a ] + γ � s ′ ∈ S P ( s ′ | s , a ) V ( s ′ ) end for end for V ( S ) ← max a Q ( s , a ) until V ( s ) converges 11

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Policy Iteration Initialize a policy π arbitrarily repeat π ′ ← π Compute the values V π ( s ) using π by solving Bellman’s equation Improve the policy by choosing best a at each step until π = π ′ 12

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Temporal Difference Learning If we do not know the entire environment, must do some exploration. Need to do some exploration Exploration will, in effect, take a sample from P ( s t +1 | s t , a t ) and p ( r t +1 | s t , a t ) Use the reward received in the next time step to update the value of current state (action) 13

Introduction Model-based Learning Temporal Difference Learning Partially Observable States ǫ -greedy For some ǫ , if a has probability at least 1 − ǫ of being the best choice, choose a (exploit) otherwise choose a random action (explore) Softmax: exp Q ( s , a ) T P ( a | s ) = b ∈ A exp Q ( s , b ) � T Simulated Annealing with temperature T exp Q ( s , a ) P ( a | s ) = � b ∈ A exp Q ( s , b ) 14

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Nondeterministic Rewards and Actions When next states and rewards are nondeterministic (there is an opponent or randomness in the environment), we keep averages (expected values) instead as assignments Q-learning � � Q ( s t , a t ) ← ˆ ˆ ˆ Q ( s t , a t )+ η r t +1 + γ max Q ( s t +1 , a t +1) − ˆ Q ( s t , a t ) a t +1 Off-policy vs on-policy (Sarsa) 15

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Q-learning Initialize all Q ( s , a ) to arbitrary values for all episodes do Initialize s repeat Choose a using policy derived from Q (e.g., ǫ -greedy) Take action a , observe r and s ′ Update Q(s,a) (prev. slide) s ← s ′ until s is in terminal state end for 16

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Partially Observable States The agent does not know its state but receives an observation p ( o t +1 | s t , a t ) which can be used to infer a belief about states 17

Introduction Model-based Learning Temporal Difference Learning Partially Observable States The Tiger Problem Two doors, behind one of which there is a tiger p : prob that tiger is behind the left door r(A,Z) Tiger left Tiger right Open left -100 +80 Open right +90 -100 z is hidden state: location of the tiger R ( a L ) = − 100 p + 80(1 − p ), R ( a R E ) = 90 p − 100(1 − p ) 18

Introduction Model-based Learning Temporal Difference Learning Partially Observable States . . . with Microphones We can sense with a reward of R ( a S ) = − 1 We have unreliable sensors P ( O L | Z L ) = 0 . 7 P ( O L | Z R ) = 0 . 3 P ( O R | Z L ) = 0 . 3 P ( O R | Z R ) = 0 . 7 19

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Effects of Sensors 20

Reinforcement Learning Steven J Zeil Old Dominion Univ. Fall 2010 - PowerPoint PPT Presentation

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Reinforcement Learning Steven J Zeil Old Dominion Univ. Fall 2010 1 Introduction Model-based Learning Temporal Difference Learning Partially

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Markov Decision Processes Robert Platt Northeastern University Some images and slides are used

Game Theory : Zero-Sum Games, The Minimax Theorem CSC304 - Nisarg Shah 1 Zero-Sum Games

COVID-19 Response Webinar Thursday, June 18th, 2020 - 2:15 to 3:30 PM This program made possible

day-1-slides Presentation July 2019 DOI: 10.13140/RG.2.2.21639.04001 CITATIONS READS 0 33

Next Generation ACO Model Benefit Enhancements March 28, 2017 Disclaimer The comments made on

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Programming the 808: A project-based unit for rhythm pedagogy Jus<n London SMT 2017 Pedagogy

Presented by Bla Otrek Agenda Overview Introduction - work rhythms Awareness

Reinforcement Learning Steven J Zeil Old Dominion Univ. Fall 2010 - PowerPoint PPT Presentation

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Reinforcement Learning Steven J Zeil Old Dominion Univ. Fall 2010 1 Introduction Model-based Learning Temporal Difference Learning Partially

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Markov Decision Processes Robert Platt Northeastern University Some images and slides are used

Game Theory : Zero-Sum Games, The Minimax Theorem CSC304 - Nisarg Shah 1 Zero-Sum Games

COVID-19 Response Webinar Thursday, June 18th, 2020 - 2:15 to 3:30 PM This program made possible

day-1-slides Presentation July 2019 DOI: 10.13140/RG.2.2.21639.04001 CITATIONS READS 0 33

Next Generation ACO Model Benefit Enhancements March 28, 2017 Disclaimer The comments made on

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Programming the 808: A project-based unit for rhythm pedagogy Jus&lt;n London SMT 2017 Pedagogy

Presented by Bla Otrek Agenda Overview Introduction - work rhythms Awareness

Programming the 808: A project-based unit for rhythm pedagogy Jus<n London SMT 2017 Pedagogy