CS 4100: Artificial Intelligence Reinforcement Learning Ja Jan-Wi - - PDF document

cs 4100 artificial intelligence
SMART_READER_LITE
LIVE PREVIEW

CS 4100: Artificial Intelligence Reinforcement Learning Ja Jan-Wi - - PDF document

CS 4100: Artificial Intelligence Reinforcement Learning Ja Jan-Wi Willem van de Meent Northeastern University [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at


slide-1
SLIDE 1

CS 4100: Artificial Intelligence

Reinforcement Learning

Ja Jan-Wi Willem van de Meent

Northeastern University

[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Reinforcement Learning

slide-2
SLIDE 2

Reinforcement Learning

  • Ba

Basic ic id idea:

  • Receive feedback in the form of re

reward rds

  • Agent’s utility is defined by the reward function
  • Must (learn to) act so as to ma

maximi mize ze expected rewards

  • All learning is based on observed samples of outcomes!

Environment

Agent

Actions: a State: s Reward: r

Example: Learning to Walk (RoboCup)

Initial A Learning Trial After Learning [1K Trials]

[Kohl and Stone, ICRA 2004]

slide-3
SLIDE 3

Example: Learning to Walk

Initial (lab-trained)

[Video: AIBO WALK – initial] [Kohl and Stone, ICRA 2004]

Example: Learning to Walk

Training

[Video: AIBO WALK – training] [Kohl and Stone, ICRA 2004]

slide-4
SLIDE 4

Example: Learning to Walk

Finished

[Video: AIBO WALK – finished] [Kohl and Stone, ICRA 2004]

Example: Sidewinding

[Andrew Ng] [Video: SNAKE – climbStep+sidewinding]

slide-5
SLIDE 5

Example: Toddler Robot

[Tedrake, Zhang and Seung, 2005] [Video: TODDLER – 40s]

The Crawler!

[Demo: Crawler Bot (L10D1)] [You, in Project 3]

slide-6
SLIDE 6

Video of Demo Crawler Bot Reinforcement Learning

  • Still assume a Marko

kov decision process (MDP):

  • A se

set o

  • f st

states s s Î S

  • A se

set o

  • f a

actions ( s (per st state) A

  • A mo

model T( T(s,a s,a,s ,s’) ’)

  • A re

reward rd functio ion R( R(s,a s,a,s ,s’) ’)

  • Still looki

king for a policy p(s) s)

  • Ne

New twist st: We d We don’t ’t kn know T or

  • r R
  • I.e. we don’t know which states are good or what the actions do
  • Must actually try out actions and states to learn
slide-7
SLIDE 7

Offline (MDPs) vs. Online (RL)

Offline Solution Online Learning

Model-Based Learning

slide-8
SLIDE 8

Model-Based Learning

  • Mo

Model-Base sed Idea:

  • Learn an approxi

ximate model based on experiences

  • Solve

ve for va values as if the learned model were correct

  • St

Step p 1: Learn empir piric ical l MDP DP mode del

  • Co

Count outcomes s’ s’ for each s, a

  • Normalize

ze to give an estimate of

  • Disc

scove ver each when we experience (s, s, a, s’ s’)

  • Step 2: Solve

ve the learned MDP

  • For example, use va

value iteration, as before

Example: Model-Based Learning

Input Policy p

Assume: g = 1

Observed Episodes (Training) A

B C

D

E

B, east, C, -1 C, east, D, -1 D, exit, x, +10 B, east, C, -1 C, east, D, -1 D, exit, x, +10 E, north, C, -1 C, east, A, -1 A, exit, x, -10

Episode 1 Episode 2 Episode 3 Episode 4

E, north, C, -1 C, east, D, -1 D, exit, x, +10

T(s,a,s’).

T(B, east, C) = 1.00 T(C, east, D) = 0.75 T(C, east, A) = 0.25 …

Learned Model

R(B, east, C) = -1 R(C, east, D) = -1 R(D, exit, x) = +10 …

slide-9
SLIDE 9

Example: Expected Age

Goal: Compute expected age of CS4100 students

Unknown P(A): “Model Based” Unknown P(A): “Model Free”

Without P(a), instead collect samples [a1, a2, … aN]

Known P(A) Why does this work? Because samples appear with the right frequencies. Why does this work? Because eventually you learn the right model.

Model-Free Learning

slide-10
SLIDE 10

Passive Reinforcement Learning Passive Reinforcement Learning

  • Simplified task:

sk: policy y eva valuation

  • In

Inpu put: t: a fixed policy p(s) s)

  • You don’t know the transitions T(

T(s, s,a,s’) ’)

  • You don’t know the rewards R(

R(s, s,a,s’) ’)

  • Go

Goal: learn the state values V(s) s)

  • In this

s case se:

  • Learner is “along for the ride”
  • No choice about what actions to take
  • Just execute the policy and learn from experience
  • This is NOT offline planning! You actually take actions in the world.
slide-11
SLIDE 11

Direct Evaluation

  • Go

Goal: Compute va values for each st state under p

  • Id

Idea: Average over observed sample values

  • Ac

Act according to p

  • Every time you visit a state, write down what

the sum of discounted rewards turned out to be

  • Ave

verage those samples

  • This is called direct eva

valuation

Example: Direct Evaluation

Input Policy p

Assume: g = 1

Observed Episodes (Training) Output Values A

B C

D

E

B, east, C, -1 C, east, D, -1 D, exit, x, +10 B, east, C, -1 C, east, D, -1 D, exit, x, +10 E, north, C, -1 C, east, A, -1 A, exit, x, -10

Episode 1 Episode 2 Episode 3 Episode 4

E, north, C, -1 C, east, D, -1 D, exit, x, +10

A

B C

D

E

+8 +4 +10

  • 10
  • 2
slide-12
SLIDE 12

Problems with Direct Evaluation

  • Wh

What at’s ’s g good ab about d direct ect ev eval aluat ation?

  • It’s easy to understand
  • It doesn’t require any knowledge of T,

T, R

  • It eventually computes the correct average

values, using just sample transitions

  • Wh

What at’s ’s b bad ad ab about i it?

  • It wastes information about state connections
  • Each state must be learned separately
  • So, it takes a long time to learn

Output Values A

B C

D

E

+8 +4 +10

  • 10
  • 2

If B and E both go to C under this policy, how can their values be different?

Why Not Use Policy Evaluation?

  • Si

Simplifi fied Bellman updates calculate V fo for a fi fixed policy:

  • Ea

Each round, replace V with a on

  • ne-st

step-lo look-ah ahead ead

  • This approach fully exploited the connections between the states
  • Un

Unfort rtunately ly, we need T and R to do it!

  • Key

Key ques estion: how can can we e do this updat ate e to V without kn knowing T an and R?

  • In other words, how to we take a weighted average without knowing the weights?

p(s) s s, p(s) s, p(s),s’ s’

slide-13
SLIDE 13

Sample-Based Policy Evaluation?

  • We w

We wan ant t to im improve ou

  • ur

r estima mate of

  • f V by

by compu

  • mputing

g av aver erag ages es:

  • Id

Idea: Take samples of ou

  • utcome
  • mes s’

s’ (by doing the action!) and average

p(s) s s, p(s) s1' s2' s3' s, p(s),s’ s'

Almost! But we can’t rewind time to get sample after sample from state s.

Temporal Difference Learning

slide-14
SLIDE 14

Temporal Difference Learning

  • Bi

Big idea: learn n from every experienc nce!

  • Up

Update V( V(s) each time we experience a transition (s (s, a, s’, r)

  • Like

kely outcomes s’ s’ will contribute updates more often

  • Te

Tempor

  • ral diffe

fference learning g of

  • f values
  • Po

Policy is still fi fixed, still doing evaluation!

  • Mo

Move values toward value of whatever successor occurs: ru running a avera rage

p(s) s s, p(s) s’ Sa Samp mple of V( V(s): Up Update to V( V(s): Sa Same me update:

Exponential Moving Average

  • Exp

xponential movi ving ave verage

  • Runni

unning ng in interpola latio ion update:

  • Makes recent sa

samples s more important:

  • Forgets

s about the past st (distant past values were wrong anyway)

  • Decreasi

sing le learnin ing rate α can give converging averages

slide-15
SLIDE 15

Example: Temporal Difference Learning

Assume: g = 1, α = 1/2

Ob Observed Transitions

B, east, C, -2

8

  • 1

8

  • 1

3

8

C, east, D, -2

A

B C

D

E

St States

Problems with TD Value Learning

  • TD

TD value leaning g is a mode model-fr free way to do pol policy evaluation

  • n,

mimicking Bellman updates with running sample averages

  • However, if we want to turn va

values into a (new) pol policy, we’re sunk:

  • Id

Idea: learn Q-va values, not va values

  • Makes action selection model-free too!

a s s, a s,a,s’ s’

slide-16
SLIDE 16

Active Reinforcement Learning Active Reinforcement Learning

  • Ful

Full rei reinf nforcement

  • rcement learni

earning ng: optimal policies s (like value iteration)

  • You don’t know the transitions T(

T(s, s,a,s’) ’)

  • You don’t know the rewards R(

R(s, s,a,s’) ’)

  • You choose the actions now
  • Go

Goal: l : learn th the o

  • pti

ptimal policy y / va values

  • In this

s case se:

  • Learner makes choices!
  • Fund

Fundam ament ental al trad adeof eoff: exploration vs. exploitation

  • This

s is s NOT offline planning! You actually take actions in the world and find out what happens…

slide-17
SLIDE 17

Detour: Q-Value Iteration

  • Va

Value ite terati tion: find successive (depth-limited) va values

  • St

Start with V0(s) = = 0, which we know is right

  • Giv

Given Vk, calculate the depth k+1 k+1 values for all states:

  • But

But Q-va values ar are e more e usefu eful, so compute them instead

  • St

Start with Q0(s, s,a) = = 0, which we know is right

  • Giv

Given Qk, calculate the depth k+1 k+1 q-values for all q-states:

Q-Learning

  • Q-Learni

Learning ng: sample-based Q-va value iteration

  • Learn

Learn Q( Q(s, s,a) ) va values s as s yo you go

  • Receive a sample (s,

s,a,s’ s’,r)

  • Consider your old estimate:
  • Consider your new sample estimate:
  • Incorporate the new estimate into running average:

[Demo: Q-learning – gridworld (L10D2)] [Demo: Q-learning – crawler (L10D3)]

slide-18
SLIDE 18

Q-Learning -- Gridworld Q-Learning -- Crawler

slide-19
SLIDE 19

Q-Learning Properties

  • Amazi

zing resu sult: Q-le learnin ing converges to optimal policy y even if you’re acting suboptimally!

  • This

s is s called of

  • ff-policy

y learning

  • Cave

veats: s:

  • You have to explore enough
  • You have to eventually make

the learning rate small enough

  • … but not decrease it too quickly
  • Basically, in the limit, it doesn’t matter how you select actions (!)