Introduction to RL Robert Platt Northeastern University (some - PowerPoint PPT Presentation

Introduction to RL Robert Platt Northeastern University (some slides/material borrowed from Rich Sutton)

What is reinforcement learning? RL is learning through trial-and-error without a model of the world

What is reinforcement learning? RL is learning through trial-and-error without a model of the world This is different from standard control/planning systems: vs Standard control system Reinforcement learning

What is reinforcement learning? RL is learning through trial-and-error without a model of the world This is different from standard control/planning systems: – require a model of the world – i.e. you need to hand-code the “successor function” – often require the world to be expressed in a certain way – e.g. symbolic planners assume symbolic representation – e.g. optimal control assume algebraic representation

What is reinforcement learning? RL is learning through trial-and-error without a model of the world This is different from standard control/planning systems: – require a model of the world – i.e. you need to hand-code the “successor function” – often require the world to be expressed in a certain way – e.g. symbolic planners assume symbolic representation – e.g. optimal control assume algebraic representation RL doesn’t require any of this RL intuitively resembles natural learning RL is harder than planning b/c you don’t get the model RL can be less efficient that control/planning b/c of its generality

The RL Setting Action Agent World Observation Reward On a single time step, agent does the following: 1. observe some information 2. select an action to execute 3. take note of any reward Goal of agent: select actions that maximize cumulative reward in the long run

Example: rat in a maze Move left/right/up/down Agent World Observe position in maze Reward = +1 if get cheese Goal: maximize cheese eaten

Example: robot makes coffee Move robot joints Agent World Observe camera image Reward = +1 if coffee in cup Goal: maximize coffee produced

Example: agent plays pong Joystick command Agent World Observe screen pixels Reward = game score Goal: maximize game score

Think-Pair-Share Question How would you express the problem of playing online texas hold-em as an RL problem? Action = ? Agent World Observation = ? Reward = ? Goal: ?

RL example Let’s say you want to program the computer to play tic-tac-toe How might you do it?

RL example Let’s say you want to program the computer to play tic-tac-toe How might you do it? 1. search: – mini-max tree search – plans for the optimal opponent, not actual opponent 2. evolutionary computation: – start w/ population of random policies; have them play each other – can view this as hillclimbing in policy space wrt a fitness function

RL example Let’s say you want to program the computer to play tic-tac-toe How might you do it? 3. RL: Value function: – estimate value function V(s) over states s ... – examples of states: – V(s) denotes expected reward from state s (+1 win, -1 lose, 0 draw) Game play: – the agent selects actions that lead to states with high values, V(s) – the agent gradually gets lots of experience of the results of executing various actions from different states But how estimate value function?

RL example Value function: – estimate value function V(s) over states s ... – examples of states: – V(s) denotes expected reward from state s (+1 win, -1 lose, 0 draw) Game play: – the agent selects actions that lead to states with high values, V(s) – the agent gradually gets lots of experience of the results of executing various actions from different states But how estimate value function?

RL example: MENACE Donald Michie teaching MENACE to play tic-tac-toe (1960) Can a “machine” comprised only of matchbooks learn to play tic-tac-toe?

RL example: MENACE How it works: Gameplay: – each tic-tac-toe board position corresponds to a matchbox – at the beginning of play, each matchbox is filled will beads of different colors – there are nine bead colors: one for each board position – when it is MENACE’s turn, open drawer corresponding to board configuration and select a bead randomly. Make the corresponding move. Leave bead on table and leave matchbox open. Reward: – play an entire game to its conclusion until it ends: win/lose/draw – if MENACE loses the game, remove beads from table and throw them away – if MENACE draws, replace each bead back into the box it came from. Add an extra bead of the same color to each box. – if MENACE wins, replace each bead back into the box it came from. Add THREE extra beads of the same color to each box.

RL example: MENACE Bead initialization: – First move boxes: 4 beads per move – Second move boxes: 3 beads per move – Third move boxes: 2 beads per move – Fourth move boxes: 1 bead per move How it works: Gameplay: – each tic-tac-toe board position corresponds to a matchbox – at the beginning of play, each matchbox is filled will beads of different colors – there are nine bead colors: one for each board position – when it is MENACE’s turn, open drawer corresponding to board configuration and select a bead randomly. Make the corresponding move. Leave bead on table and leave matchbox open. Reward: – play an entire game to its conclusion until it ends: win/lose/draw – if MENACE loses the game, remove beads from table and throw them away – if MENACE draws, replace each bead back into the box it came from. Add an extra bead of the same color to each box. – if MENACE wins, replace each bead back into the box it came from. Add THREE extra beads of the same color to each box.

Think-Pair-Share Question Questions: – why did Michie use that particular bead initialization? – why add an extra bead when you get to a draw? – how might this learning algorithm fail? How would you fix it? What tradeoff do you face?

Where does RL live?

Key challenges in RL – no model of the environment – agent only gets a scalar reward signal – delayed feedback – need to balance exploration of the world exploitation of learned knowledge – real world problems can be non-stationary

Major historical RL successes • L e a r n e d t h e w o r l d ’ s b e s t p l a y e r o f B a c k g a m m o n ( T e s a u r o 1 9 9 5 ) • L e a r n e d a c r o b a t i c h e l i c o p t e r a u t o p i l o t s ( N g , A b b e e l , C o a t e s e t a l 2 0 0 6 + ) • Wi d e l y u s e d i n t h e p l a c e m e n t a n d s e l e c t i o n o f a d v e r t i s e m e n t s a n d p a g e s o n t h e w e b ( e . g . , A - B t e s t s ) • U s e d t o m a k e s t r a t e g i c d e c i s i o n s i n J e o p a r d y ! ( I B M ’ s Wa t s o n 2 0 1 1 ) • A c h i e v e d h u m a n - l e v e l p e r f o r m a n c e o n A t a r i g a m e s f r o m p i x e l - l e v e l v i s u a l i n p u t , i n c o n j u n c t i o n w i t h d e e p l e a r n i n g ( G o o g l e D e e p m i n d 2 0 1 5 ) • I n a l l t h e s e c a s e s , p e r f o r m a n c e w a s b e t t e r t h a n c o u l d b e o b t a i n e d b y a n y o t h e r m e t h o d , a n d w a s o b t a i n e d w i t h o u t h u m a n i n s t r u c t i o n

Example: TD-Gammon

RL + Deep Learing on Atari Games

Major historical RL successes • L e a r n e d t h e w o r l d ’ s b e s t p l a y e r o f B a c k g a m m o n ( T e s a u r o 1 9 9 5 ) • L e a r n e d a c r o b a t i c h e l i c o p t e r a u t o p i l o t s ( N g , A b b e e l , C o a t e s e t a l 2 0 0 6 + ) • Wi d e l y u s e d i n t h e p l a c e m e n t a n d s e l e c t i o n o f a d v e r t i s e m e n t s a n d p a g e s o n t h e w e b ( e . g . , A - B t e s t s ) • U s e d t o m a k e s t r a t e g i c d e c i s i o n s i n J e o p a r d y ! ( I B M ’ s Wa t s o n 2 0 1 1 ) • A c h i e v e d h u m a n - l e v e l p e r f o r m a n c e o n A t a r i g a m e s f r o m p i x e l - l e v e l v i s u a l i n p u t , i n c o n j u n c t i o n w i t h d e e p l e a r n i n g ( G o o g l e D e e p m i n d 2 0 1 5 ) • I n a l l t h e s e c a s e s , p e r f o r m a n c e w a s b e t t e r t h a n c o u l d b e o b t a i n e d b y a n y o t h e r m e t h o d , a n d w a s o b t a i n e d w i t h o u t h u m a n i n s t r u c t i o n

The singularity

Introduction to RL Robert Platt Northeastern University (some - PowerPoint PPT Presentation

Introduction to RL Robert Platt Northeastern University (some slides/material borrowed from Rich Sutton) What is reinforcement learning? RL is learning through trial-and-error without a model of the world What is reinforcement learning? RL is

INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION

Introduction ATV Introduction A T V Introduction A lphabet T V Introduction A lphabet

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Shenzhen Cuilu jewelry Co., Ltd was founded in 1996 and its a large private enterprise

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Spectrum Painting Richard Shipman MW0RCZ ADARS 6th Jan 2020 Introduction Introduction

Introduction Introduction Introduction Introduction Outline Motivation Failures

Introduction Introduction Introduction Nationwide Cause for Concern 1

Team Introduction Experiments Outreach Problem Project Brainstorm Introduction Introduction

Lecture 1 Andreas Habegger Introduction Zynq Introduction Zynq Introduction Zynq PS vs. PL

Introduction to Web Design & Computer Principles Class 1 CSCI-UA 4 Introduction and Overview

Introduction to CICS Course introduction Course introduction What is CICS? What is an

INF5110 Compiler Construction Introduction Spring 2016 1 / 33 Outline 1. Introduction

INTRODUCTION I Syllabus INTRODUCTION I Syllabus I Why study labor economics? INTRODUCTION I

2018.06 01 SMILE5 Introduction S E 5 02 Alpha Cloud M I L 03 Company Introduction 04

What is a Program? A recipe for doing something A precise set of instructions

UDT 2020 Warhead technologys impact on future capability Andrew Carr 1 , Andy Burn 2 ,

Comparing Real Wages: the McWage Index Professor Orley Ashenfelter Joseph Douglas Green 1895

7 th Grade PSI Inheritance and Variation of Traits 2015-11-02 www.njctl.org Slide 3 / 141

24001 Ridge Road, Germantown, Maryland 20876 ~ Tel: 240-740-6194 fax: 301 253 -0933 August 2019

Object Oriented Programming COP3330 / CGS5409 Class Templates Bitwise Operators With

A Question of Craftsmanship Kevlin Henney kevlin@curbralan.com @KevlinHenney Art. Craft.

Progressive Photon Beams Wojciech Jarosz 1 Derek

Sambuz

Useful Links

Newsletter

Mail Us

Introduction to RL Robert Platt Northeastern University (some - PowerPoint PPT Presentation

Introduction to RL Robert Platt Northeastern University (some slides/material borrowed from Rich Sutton) What is reinforcement learning? RL is learning through trial-and-error without a model of the world What is reinforcement learning? RL is

INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION

Introduction ATV Introduction A T V Introduction A lphabet T V Introduction A lphabet

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Shenzhen Cuilu jewelry Co., Ltd was founded in 1996 and its a large private enterprise

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Spectrum Painting Richard Shipman MW0RCZ ADARS 6th Jan 2020 Introduction Introduction

Introduction Introduction Introduction Introduction Outline Motivation Failures

Introduction Introduction Introduction Nationwide Cause for Concern 1

Team Introduction Experiments Outreach Problem Project Brainstorm Introduction Introduction

Lecture 1 Andreas Habegger Introduction Zynq Introduction Zynq Introduction Zynq PS vs. PL

Introduction to Web Design &amp; Computer Principles Class 1 CSCI-UA 4 Introduction and Overview

Introduction to CICS Course introduction Course introduction What is CICS? What is an

INF5110 Compiler Construction Introduction Spring 2016 1 / 33 Outline 1. Introduction

INTRODUCTION I Syllabus INTRODUCTION I Syllabus I Why study labor economics? INTRODUCTION I

2018.06 01 SMILE5 Introduction S E 5 02 Alpha Cloud M I L 03 Company Introduction 04

What is a Program? A recipe for doing something A precise set of instructions

UDT 2020 Warhead technologys impact on future capability Andrew Carr 1 , Andy Burn 2 ,

Comparing Real Wages: the McWage Index Professor Orley Ashenfelter Joseph Douglas Green 1895

7 th Grade PSI Inheritance and Variation of Traits 2015-11-02 www.njctl.org Slide 3 / 141

24001 Ridge Road, Germantown, Maryland 20876 ~ Tel: 240-740-6194 fax: 301 253 -0933 August 2019

Object Oriented Programming COP3330 / CGS5409 Class Templates Bitwise Operators With

A Question of Craftsmanship Kevlin Henney kevlin@curbralan.com @KevlinHenney Art. Craft.

Progressive Photon Beams Wojciech Jarosz 1 Derek

Sambuz

Useful Links

Newsletter

Mail Us

Introduction to Web Design & Computer Principles Class 1 CSCI-UA 4 Introduction and Overview