Reinforcement Learning Yingyu Liang yliang@cs.wisc.edu Computer - PowerPoint PPT Presentation

Introduction to Reinforcement Learning Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from David Page, Mark Craven]

Goals for the lecture you should understand the following concepts • the reinforcement learning task • Markov decision process • value functions • value iteration 2

Reinforcement learning (RL) Task of an agent embedded in an environment repeat forever 1) sense world 2) reason 3) choose an action to perform 4) get feedback (usually reward = 0) 5) learn the environment may be the physical world or an artificial one 3

Example: RL Backgammon Player [Tesauro, CACM 1995] • world – 30 pieces, 24 locations • actions – roll dice, e.g. 2, 5 – move one piece 2 – move one piece 5 • rewards – win, lose • TD-Gammon 0.0 – trained against itself (300,000 games) – as good as best previous BG computer program (also by Tesauro) • TD-Gammon 2 – beat human champion 4

Example: AlphaGo [Nature, 2017] • world – 19x19 locations • actions – Put one stone on some empty location • rewards – win, lose • 2016 beats World Champion Lee Sedol by 4-1 • Subsequent system (AlphaGo Master/zero ) shows superior performance than humans • Trained by supervised learning + reinforcement learning 5

Reinforcement learning • set of states S agent • set of actions A • at each time t, agent observes state action state reward s t ∈ S then chooses action a t ∈ A • then receives reward r t and changes environment to state s t+1 a 0 a 1 a 2 s 0 s 1 s 2 r 0 r 1 r 2 6

Reinforcement learning as a Markov decision process (MDP) • Markov assumption agent = ( | , , , ,...) ( | , ) P s s a s a P s s a + − − + t 1 t t t 1 t 1 t 1 t t action state reward • also assume reward is Markovian environment = ( | , , , ,...) ( | , ) P r s a s a P r s a + − − + t 1 t t t 1 t 1 t 1 t t a 0 a 1 a 2 s 0 s 1 s 2 r 0 r 1 r 2 Goal: learn a policy π : S → A for choosing actions that maximizes +  +  +    2 [ ...] where 0 1 E r r r + + 1 2 t t t 7 for every possible starting state s 0

Reinforcement learning task • Suppose we want to learn a control policy π : S → A that   maximizes from every state s ∈ S  t [ ] E r t = 0 t 0 100 0 G 0 0 0 0 0 100 0 0 0 0 each arrow represents an action a and the associated number represents deterministic reward r ( s , a ) 8

VALUE FUNCTION

Value function for a policy • given a policy π : S → A define    =  t ( ) [ ] assuming action sequence chosen V s E r t according to π starting at state s = 0 t we want the optimal policy π * where • p * = argmax p V p ( s ) for all s we’ll denote the value function for this optimal policy as V * ( s ) 10

Value function for a policy π • Suppose π is shown by red arrows, γ = 0.9 0 73 81 100 0 G 0 0 0 0 0 0 100 0 0 0 0 66 90 100 V π ( s ) values are shown in red 11

Value function for an optimal policy π * • Suppose π * is shown by red arrows, γ = 0.9 0 90 100 100 0 G 0 0 0 0 0 0 100 0 0 0 0 81 90 100 V* ( s ) values are shown in red 12

Using a value function If we know V* ( s ), r ( s t , a ), and P ( s t | s t-1 , a t-1 ) we can compute π *( s )     = +  = * * ( ) arg max ( , ) ( | , ) ( ) s r s a P s s s a V s   + 1 t t t t     a A s S 13

Value iteration for learning V * ( s ) initialize V ( s ) arbitrarily loop until policy good enough { loop for s ∈ S { loop for a ∈ A {   +  ( , ) ( , ) ( ' | , ) ( ' ) Q s a r s a P s s a V s  ' s S }  ( ) max ( , ) V s Q s a a } } 14

Value iteration for learning V * ( s ) • V ( s ) converges to V *( s ) • works even if we randomly traverse environment instead of looping through each state and action methodically – but we must visit each state infinitely often • implication: we can do online learning as an agent roams around its environment • assumes we have a model of the world: i.e. know P (s t | s t-1 , a t-1 ) • What if we don’t? 15

Q-LEARNING

Q functions define a new function, closely related to V*      +  * ( , ) ( , ) ( ' ) Q s a E r s a E V s ' | , s s a if agent knows Q ( s, a ) , it can choose optimal action without knowing P ( s’ | s , a )    * * ( ) arg max ( , ) ( ) max ( , ) s Q s a V s Q s a a a and it can learn Q ( s, a ) without knowing P ( s ’ | s , a ) 17

Q values 0 100 0 90 100 G G 0 0 0 0 0 0 100 0 0 81 90 100 0 0 r ( s, a ) (immediate reward) values V* ( s ) values 90 100 G 0 81 72 81 81 90 100 81 90 72 81 Q ( s, a ) values 18

Q learning for deterministic worlds ˆ  for each s, a initialize table entry ( , ) 0 Q s a observe current state s do forever select an action a and execute it receive immediate reward r observe the new state s ’ update table entry ˆ ˆ  +  ( , ) max ( ' , ' ) Q s a r Q s a a ' s ← s ’ 19

Updating Q 72 90 100 100 63 63 81 81 a right ˆ ˆ  +  ( , ) max ( , ' ) Q s a r Q s a 1 ' 2 right a  + 0 0 . 9 max{ 63 , 81 , 100 }  90 20

Q V V Q’ s vs. V’ s Q V • Which action do we choose when we’re in a given state? • V’ s (model-based) – need to have a ‘next state’ function to generate all possible states – choose next state with highest V value. • Q’ s (model-free) – need only know which actions are legal – generally choose next state with highest Q value. 21

Exploration vs. Exploitation • in order to learn about better alternatives, we shouldn’t always follow the current policy (exploitation) • sometimes, we should select random actions (exploration) • one way to do this: select actions probabilistically according to: ˆ ( , ) Q s a c i = ( | ) P a s  ˆ i ( , ) Q s a c j j where c > 0 is a constant that determines how strongly selection favors actions with higher Q values 22

Reinforcement Learning Yingyu Liang yliang@cs.wisc.edu Computer - PowerPoint PPT Presentation

Introduction to Reinforcement Learning Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from David Page, Mark Craven] Goals for the lecture you should understand the following

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Theory and Practice Rune Djurhuus Chess Grandmaster runed@ifi.uio.no / runedj@microsoft.com

Introduction Systems Design & Programming CMPE 310 CMPE 310: Systems Design and Programming

Bitcoin Mining The Task of Bitcoin Miners Mining Hardware Energy Consumption &

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

Autonomous Intelligent Robotics Instructor: Shiqi Zhang

The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina Precup Reasoning and Learning

State Armory Board (SAB) Quarterly Meeting: 15 October 2015 0 State Armory Board Quarterly

Adversarial Search and Game Playing Russell and Norvig, Chapter 5 http://xkcd.com/601/ Games n

Reinforcement Learning Yingyu Liang yliang@cs.wisc.edu Computer - PowerPoint PPT Presentation

Introduction to Reinforcement Learning Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from David Page, Mark Craven] Goals for the lecture you should understand the following

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Theory and Practice Rune Djurhuus Chess Grandmaster runed@ifi.uio.no / runedj@microsoft.com

Introduction Systems Design &amp; Programming CMPE 310 CMPE 310: Systems Design and Programming

Bitcoin Mining The Task of Bitcoin Miners Mining Hardware Energy Consumption &amp;

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

Autonomous Intelligent Robotics Instructor: Shiqi Zhang

The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina Precup Reasoning and Learning

State Armory Board (SAB) Quarterly Meeting: 15 October 2015 0 State Armory Board Quarterly

Adversarial Search and Game Playing Russell and Norvig, Chapter 5 http://xkcd.com/601/ Games n

Introduction Systems Design & Programming CMPE 310 CMPE 310: Systems Design and Programming

Bitcoin Mining The Task of Bitcoin Miners Mining Hardware Energy Consumption &