Examples of Reinforcement Learning Robocup Soccer Teams Stone & - PDF document

Reinforcement Learning Slides by Rich Sutton Mods by Dan Lizotte Refer to “Reinforcement Learning: An Introduction” by Sutton and Barto Alpaydin Chapter 16 Up until now we have been… • Supervised Learning  Classifying, mostly  Also saw some regression  Also doing some probabilistic analysis • In comes data  Then we think for a while • Out come predictions • Reinforcement learning is in some ways similar, in some ways very different. (Like this font!) 1

Complete Agent • Temporally situated • Continual learning and planning • Objective is to affect the environment • Environment is stochastic and uncertain Environment action state reward Agent What is Reinforcement Learning? • An approach to Artificial Intelligence • Learning from interaction • Goal-oriented learning • Learning about, from, and while interacting with an external environment • Learning what to do—how to map situations to actions—so as to maximize a numerical reward signal 2

Chapter 1: Introduction Artificial Intelligence Control Theory and Operations Research Psychology Reinforcement Learning (RL) Neuroscience Artificial Neural Networks Key Features of RL • Learner is not told which actions to take • Trial-and-Error search • Possibility of delayed reward  Sacrifice short-term gains for greater long- term gains • The need to explore and exploit • Considers the whole problem of a goal-directed agent interacting with an uncertain environment 3

Examples of Reinforcement Learning • Robocup Soccer Teams Stone & Veloso, Reidmiller et al. World’s best player of simulated soccer, 1999; Runner-up 2000  • Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis 10-15% improvement over industry standard methods  • Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin World's best assigner of radio channels to mobile telephone calls  • Elevator Control Crites & Barto (Probably) world's best down-peak elevator controller  • Many Robots navigation, bi-pedal walking, grasping, switching between skills...  • TD-Gammon and Jellyfish Tesauro, Dahl World's best backgammon player  Supervised Learning Training Info = desired (target) outputs Supervised Learning Inputs Outputs System Error = (target output – actual output) 4

Reinforcement Learning Training Info = evaluations (“rewards” / “penalties”) RL Inputs Outputs (“actions”) System Objective: get as much reward as possible Today • Give an overview of the whole RL problem…  Before we break it up into parts to study individually • Introduce the cast of characters  Experience (reward)  Policies  Value functions  Models of the environment • Tic-Tac-Toe example 5

Elements of RL Policy Reward Value Model of environment • Policy : what to do • Reward : what is good • Value : what is good because it predicts reward • Model : what follows what A Somewhat Less Misleading View… memory reward external sensations RL agent state internal sensations actions 6

An Extended Example: Tic-Tac-Toe X X X X X O X O X X X X X X O X O O O O X O X O O O O X } x’s move ... x x x } o’s move ... ... ... x o o o x x x } x’s move ... ... ... ... ... } o’s move Assume an imperfect opponent: } x’s move —he/she sometimes makes mistakes x o x x o An RL Approach to Tic-Tac-Toe 1. Make a table with one entry per state: State V ( s ) – estimated probability of winning .5 ? x 2. Now play lots of games. .5 ? . . . . . . To pick our moves, x x x 1 win o look ahead one step: o . . . . . . current state x o 0 loss o x o . . . . . . various possible * next states o x o 0 draw o x x x o o Just pick the next state with the highest estimated prob. of winning — the largest V ( s ); a greedy move. But 10% of the time pick a move at random; an exploratory move . 7

RL Learning Rule for Tic-Tac-Toe Opponent's Move { Starting Position • a Our Move { b • Opponent's Move { • c * c Our Move { d • “Exploratory” move Opponent's Move { • e' * e Our Move { f • s – the state before our greedy move s – the state after our greedy move � • g * g We increment each V ( s ) toward V ( � s ) – a backup : [ ] V ( s ) � V ( s ) + � V ( � s ) � V ( s ) a small positive fraction, e.g., � = .1 the step - size parameter How can we improve this T.T.T. player? • Take advantage of symmetries  representation/generalization  How might this backfire? Do we need “random” moves? Why? •  Do we need the full 10%? • Can we learn from “random” moves? • Can we learn offline?  Pre-training from self play?  Using learned models of opponent? • . . . 8

e.g. Generalization Table Generalizing Function Approximator State V State V s 1 s 2 s 3 . . . Train here s N e.g. Generalization Table Generalizing Function Approximator State V State V s 1 s 2 s 3 . . . Train here s N 9

How is Tic-Tac-Toe Too Easy? • Finite, small number of states • One-step look-ahead is always possible • State completely observable • . . . Chapter 2: Evaluative Feedback • Evaluating actions vs. instructing by giving correct actions • Pure evaluative feedback depends totally on the action taken. Pure instructive feedback depends not at all on the action taken. Supervised learning is instructive; optimization is evaluative • • Associative vs. Nonassociative :  Associative: inputs mapped to outputs; – learn the best output for each input  Nonassociative: “learn” (find) one best output – ignoring inputs • A simpler example: n -armed bandit (at least how we treat it) is:  Nonassociative  Evaluative feedback 10

= Pause for Stats = • Suppose X is a real-valued random variable • Expectation (“Mean”) x 1 + x 2 + x 3 + ... + x n E { X } = lim n n �� • Normal Distribution  Mean μ  Standard Deviation σ  Almost all values will be  -3 σ < x < 3 σ The n -Armed Bandit Problem • Choose repeatedly from one of n actions; each choice is called a play • After each play , you get a reward , where a t r t These are unknown action values ction values r a t Distribution of depends only on t • Objective is to maximize the reward in the long term, e.g., over 1000 plays To solve the n - armed bandit problem , you must explore xplore a variety of actions and exploit xploit the best of them 11

The Exploration/Exploitation Dilemma • Suppose you form estimates * ( a ) Q t ( a ) � Q action value estimates action value estimates • The greedy action at t is a t * = argmax a t a Q t ( a ) * � exploitation a t = a t * � exploration a t � a t • If you need to learn, you can’t exploit all the time; if you need to do well, you can’t explore all the time • You can never stop exploring; but you should always reduce exploring. Maybe. Action-Value Methods • Methods that adapt action-value estimates and nothing else, e.g.: suppose by the t -th play, action a had been chosen times, producing rewards r 1 , r 2 , K , r k a , k a then “ sample average ” * ( a ) • k a � � Q t ( a ) = Q lim 12

ε -Greedy Action Selection • Greedy action selection: * = arg max a t = a t a Q t ( a ) • ε -Greedy: * with probability 1 � � a t { a t = random action with probability � . . . the simplest way to balance exploration and exploitation 10-Armed Testbed • n = 10 possible actions • Each is chosen randomly from a normal * ( a ) Q distribution with mean 0 and variance 1 • each is also normal, with mean Q * (a t ) and variance 1 r t • 1000 plays • repeat the whole thing 2000 times and average the results • Use sample average to estimate Q 13

ε -Greedy Methods on the 10-Armed Testbed Softmax Action Selection • Softmax action selection methods grade action probs. by estimated values. • The most common softmax uses a Gibbs, or Boltzmann, distribution: Choose action a on play t with probability Q t ( a ) � e , n � e Q t ( b ) � b = 1 where � is the “computational temperature” • Actions with greater value are more likely to be selected 14

Softmax and ‘Temperature’ Choose action a on play t with probability Q t ( a ) � e , n � e Q t ( b ) � b = 1 where � is the “computational temperature” Q(a 1 ) = 1.0 Q(a 2 ) = 2.0 Q(a 3 ) = -3.0 Probability ➘ 0.0180 0.9820 < 0.0001 τ = 0.25 0.1192 0.8808 < 0.0001 τ = 0.5 0.2676 0.7275 0.0049 τ = 1.0 0.3603 0.3982 0.2415 τ = 10.0 0.3366 0.3400 0.3234 τ = 100.0 Small τ is like ‘max.’ Big τ is like ‘uniform.’ Incremental Implementation Recall the sample average estimation method: The average of the first k rewards is 2 + L r Q k = r 1 + r k (dropping the dependence on ): a k Can we do this incrementally (without storing all the rewards)? We could keep a running sum and count, or, equivalently: 1 [ ] Q k + 1 = Q k + k + 1 r k + 1 � Q k This is a common form for update rules: NewEstimate = OldEstimate + StepSize [ Target – OldEstimate ] 15

Examples of Reinforcement Learning Robocup Soccer Teams Stone & - PDF document

Reinforcement Learning Slides by Rich Sutton Mods by Dan Lizotte Refer to Reinforcement Learning: An Introduction by Sutton and Barto Alpaydin Chapter 16 Up until now we have been Supervised Learning Classifying, mostly

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Class Structure Last time: Midterm This time: Fast Learning Next time: Fast Learning Lecture 11:

Foundations of AI 6. Adversarial Search Search Strategies for Games, Games with Chance, State

Introduction Introduction Lecture Outline To Game To Game Theory: Theory: Introduction

TicToc: Time Traveling Optimistic Concurrency Control Xiangyao Yu 1 , Andrew Pavlo 2 , Daniel

Outlines Topic-aware Social Influence Propagation Models by N Barbieri and et al. , ICDM

Effects and IO Monad Practice Curtis Millar CSE, UNSW (and Data61) 1 July 2020 1 External

Moodle Plugin for Game Based Learning Earlier Attempt to Build a System Moodle Games Moodle

& OTHER QUESTIONS: EQUAL A COFFEE MUG? WHAT IS TOPOLOGY? SARAH BLACKWELL WHAT IS TOPOLOGY?

Programming Abstraction in C++ Eric S. Roberts and Julie Zelenski Stanford University 2010

Examples of Reinforcement Learning Robocup Soccer Teams Stone & - PDF document

Reinforcement Learning Slides by Rich Sutton Mods by Dan Lizotte Refer to Reinforcement Learning: An Introduction by Sutton and Barto Alpaydin Chapter 16 Up until now we have been Supervised Learning Classifying, mostly

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Class Structure Last time: Midterm This time: Fast Learning Next time: Fast Learning Lecture 11:

Foundations of AI 6. Adversarial Search Search Strategies for Games, Games with Chance, State

Introduction Introduction Lecture Outline To Game To Game Theory: Theory: Introduction

TicToc: Time Traveling Optimistic Concurrency Control Xiangyao Yu 1 , Andrew Pavlo 2 , Daniel

Outlines Topic-aware Social Influence Propagation Models by N Barbieri and et al. , ICDM

Effects and IO Monad Practice Curtis Millar CSE, UNSW (and Data61) 1 July 2020 1 External

Moodle Plugin for Game Based Learning Earlier Attempt to Build a System Moodle Games Moodle

&amp; OTHER QUESTIONS: EQUAL A COFFEE MUG? WHAT IS TOPOLOGY? SARAH BLACKWELL WHAT IS TOPOLOGY?

Programming Abstraction in C++ Eric S. Roberts and Julie Zelenski Stanford University 2010

& OTHER QUESTIONS: EQUAL A COFFEE MUG? WHAT IS TOPOLOGY? SARAH BLACKWELL WHAT IS TOPOLOGY?