Reinforcement Learning Part 1 Yingyu Liang yliang@cs.wisc.edu - PowerPoint PPT Presentation

Reinforcement Learning Part 1 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from David Page, Mark Craven]

Goals for the lecture you should understand the following concepts • the reinforcement learning task • Markov decision process • value functions • value iteration 2

Reinforcement learning (RL) Task of an agent embedded in an environment repeat forever 1) sense world 2) reason 3) choose an action to perform 4) get feedback (usually reward = 0) 5) learn the environment may be the physical world or an artificial one 3

Example: RL Backgammon Player [Tesauro, CACM 1995] • world – 30 pieces, 24 locations • actions – roll dice, e.g. 2, 5 – move one piece 2 – move one piece 5 • rewards – win, lose • TD-Gammon 0.0 – trained against itself (300,000 games) – as good as best previous BG computer program (also by Tesauro) • TD-Gammon 2 – beat human champion 4

Example: AlphaGo [Nature, 2017] • world – 19x19 locations • actions – Put one stone on some empty location • rewards – win, lose • 2016 beats World Champion Lee Sedol by 4-1 • Subsequent system (AlphaGo Master/zero ) shows superior performance than humans • Trained by supervised learning + reinforcement learning 5

Reinforcement learning • set of states S agent • set of actions A • at each time t, agent observes state action state reward s t ∈ S then chooses action a t ∈ A • then receives reward r t and changes environment to state s t+1 a 0 a 1 a 2 s 0 s 1 s 2 r 0 r 1 r 2 6

Reinforcement learning as a Markov decision process (MDP) • Markov assumption agent  ( | , , , ,...) ( | , ) P s s a s a P s s a     t 1 t t t 1 t 1 t 1 t t action state reward • also assume reward is Markovian environment  ( | , , , ,...) ( | , ) P r s a s a P r s a     1 1 1 1 t t t t t t t t a 0 a 1 a 2 s 0 s 1 s 2 r 0 r 1 r 2 Goal: learn a policy π : S → A for choosing actions that maximizes         2 [ ...] where 0 1 E r r r   1 2 t t t 7 for every possible starting state s 0

Reinforcement learning task • Suppose we want to learn a control policy π : S → A that   maximizes from every state s ∈ S  t [ ] E r t  0 t 0 100 0 G 0 0 0 0 0 100 0 0 0 0 each arrow represents an action a and the associated number represents deterministic reward r ( s , a ) 8

Value function for a policy • given a policy π : S → A define      t ( ) [ ] assuming action sequence chosen V s E r t according to π starting at state s  0 t we want the optimal policy π * where • p * = argmax p V p ( s ) for all s we’ll denote the value function for this optimal policy as V * ( s ) 9

Value function for a policy π • Suppose π is shown by red arrows, γ = 0.9 0 73 81 100 0 G 0 0 0 0 0 0 100 0 0 0 0 66 90 100 V π ( s ) values are shown in red 10

Value function for an optimal policy π * • Suppose π * is shown by red arrows, γ = 0.9 0 90 100 100 0 G 0 0 0 0 0 0 100 0 0 0 0 81 90 100 V* ( s ) values are shown in red 11

Using a value function If we know V* ( s ), r ( s t , a ), and P ( s t | s t-1 , a t-1 ) we can compute π *( s )         * * ( ) arg max ( , ) ( | , ) ( ) s r s a P s s s a V s    1 t t t t     a A s S 12

Value iteration for learning V * ( s ) initialize V ( s ) arbitrarily loop until policy good enough { loop for s ∈ S { loop for a ∈ A {     ( , ) ( , ) ( ' | , ) ( ' ) Q s a r s a P s s a V s  ' s S }  ( ) max ( , ) V s Q s a a } } 13

Value iteration for learning V * ( s ) • V ( s ) converges to V *( s ) • works even if we randomly traverse environment instead of looping through each state and action methodically – but we must visit each state infinitely often • implication: we can do online learning as an agent roams around its environment • assumes we have a model of the world: i.e. know P (s t | s t-1 , a t-1 ) • What if we don’t? 14

Reinforcement Learning Part 1 Yingyu Liang yliang@cs.wisc.edu - PowerPoint PPT Presentation

Reinforcement Learning Part 1 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from David Page, Mark Craven] Goals for the lecture you should understand the following concepts

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Reinforcement Learning Philipp Koehn 17 November 2015 Philipp Koehn Artificial Intelligence:

Lecture 24: Relation Extraction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse

Haskell: From Basic to Advanced Part 2 Type Classes, Laziness, IO, Modules Qualified types

Compiler Construction Compiler Construction 1 / 104 Mayer Goldberg \ Ben-Gurion University Monday

Reinforcement Learning for Control Federico Nesti, f.nesti@santannapisa.it Federico Nesti,

CAS CS 460/660 Introduction to Database Systems Relational Algebra 1.1 Relational Query

Relational Algebra: Basic Operators Projection (): id, name (Students) SELECT id,

Relational Model and Relational Algebra Rose-Hulman Institute of Technology Curt Clifton