Reinforcement Learning CS 4100: Artificial Intelligence - - PDF document

reinforcement learning cs 4100 artificial intelligence
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning CS 4100: Artificial Intelligence - - PDF document

Reinforcement Learning CS 4100: Artificial Intelligence Reinforcement Learning II Still assume a Marko kov decision process (MDP): A se set o of st states s s S A se set o of a actions ( s (per st state) A A mo model


slide-1
SLIDE 1

CS 4100: Artificial Intelligence

Reinforcement Learning II

Jan-Willem van de Meent – Northeastern University

[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Reinforcement Learning

  • Still assume a Marko

kov decision process (MDP):

  • A se

set o

  • f st

states s s Î S

  • A se

set o

  • f a

actions ( s (per st state) A

  • A mo

model T( T(s,a s,a,s ,s’) ’)

  • A re

reward rd functio ion R( R(s,a s,a,s ,s’) ’)

  • Still looki

king for a policy p(s) s)

  • Ne

New twist st: We d We don’t kn know T or

  • r R
  • Id

Idea: Don’t have T and R, but we do have sa samples (s (s, a, r, s’)

The Story So Far: MDPs and RL

Known MDP: Offline Solution

Goal Technique

Compute V*, Q*, p* Value / policy iteration Evaluate a fixed policy p Policy evaluation

Model-Based: Learn T Model-Free: Estimate Q or V

Goal Technique

Compute V*, Q*, p* VI/PI on approx. MDP Evaluate a fixed policy p PE on approx. MDP

Goal Technique

Compute V*, Q*, p* Q Learning Evaluate a fixed policy p Value Learning

Model-Free Learning

  • Mod

Model el-fr free ( (te temporal d diffe fference) l learning

  • Exp

xperience world through episo sodes (s, s, a, r, s’ s’, a’, r’, s’ s’’, a’’, r’’, …) …)

  • Up

Update estimates each transition (s, s,a,r,s’) ’)

  • Over time, updates will mi

mimi mic c Bel ellman man up updat ates es

r a s s, a s’ a’ s’, a’ s’’

Q-Learning

  • We’d like

ke to do Q-va value updates s for all (s, s,a):

  • Bu

But can’t compute this update without knowing T, T, R

  • Inst

stead, compute ave verage as s we go

  • Receive

ve a sa sample transition (s, s,a,r,s’) ’)

  • Use this sample to estimate a

a new new Q(s, s, a)

  • But we want to compute average all sa

samples s (s, a , a)

  • So

Solution: compute running average

Q-Learning Properties

  • Amazi

zing resu sult: Q-le learnin ing converges to optimal policy y even if you’re acting su suboptimally!

  • This

s is s called of

  • ff-policy

y learning

  • Cave

veats: s:

  • You have to exp

xplore enough

  • You have to eventually make

the learning rate sm small enough

  • … but not decrease it too quickl

kly

  • Basically, in the limit, it doesn’t matter how you select actions (!)

Q-Learning for Cliff Grid Exploration vs. Exploitation

slide-2
SLIDE 2

How to Explore?

  • Seve

veral sc schemes s for forcing exp xploration

  • Simplest

st: random actions (e-gr greedy dy)

  • Every time step, flip a coin
  • With (small) probability e, act

act rand andom

  • mly
  • With (large) probability 1-e, act

act on

  • n cur

current ent pol

  • licy

cy

  • Problems

s with random actions? s?

  • You do eventually explore the space, but keep

thrashing around once learning is done

  • One so

solution: lower e over time

  • Another so

solution: exploration functions

[Demo: Q-learning – manual exploration – bridge grid (L11D2)] [Demo: Q-learning – epsilon-greedy -- crawler (L11D3)]

Q-learning – Epsilon-Greedy – Crawler

  • When to exp

xplore?

  • Random actions:

s: explore a fixed amount

  • Be

Better idea: explore areas whose badness is not (yet) established, eventually stop exploring

  • Exp

xploration function

  • Takes a va

value est stimate u and a vi visi sit count n, and returns an optimist stic ut utility, e.g.

  • No

Note: this propagates the “bonus” s” back to states that lead to unknown states!

Exploration Functions

Modified Q-Update: Regular Q-Update:

[Demo: exploration – Q-learning – crawler – exploration function (L11D4)]

Q-learning – Exploration Function – Crawler

Regret

  • To l

To lear earn n the the op

  • pti

timal al p pol

  • licy

cy, yo you need need to to make ke mist stake kes s along the way! y!

  • Re

Regret measu sures s the cost st of yo your mist stake kes: s:

  • Difference between your (expected) rewards,

including youthful suboptimality, and (expected) rewards for an agent that optimally balances exploration and exploitation.

  • Minimizi

zing regret is a stronger condition than acting optimally y – also measures whether you have le learned optim imally lly

  • Exa

xample: random exploration and exploration functions both end up

  • ptimal, but random exploration

has higher regret

Approximate Q-Learning Generalizing Across States

  • Basi

sic Q-Lear Learni ning ng ke keeps s a table of all q-va values

  • In realist

stic si situations, s, we cannot possi ssibly y le learn about eve very y si single st state!

  • Too many states to visit them all in training
  • Too many states to hold the q-tables in memory
  • Inst

stead, we want to generalize ze:

  • Learn about some small number
  • f training states from experience
  • Generalize that experience to new, similar situations
  • This is a fundamental idea in machine learning,

and we’ll see it over and over again

[demo – RL pacman]

Q-Learning Pacman – Tiny World

slide-3
SLIDE 3

Q-Learning Pacman – Tiny – Silent Train Q-Learning Pacman – Tricky

Example: Pacman

[Demo: Q-learning – pacman – tiny – watch all (L11D5)],[Demo: Q-learning – pacman – tiny – silent train (L11D6)], [Demo: Q-learning – pacman – tricky – watch all (L11D7)]

Let’s say we discover through experience that this state is bad: In naïve q-learning, we know nothing about this state: Or even this one!

Feature-Based Representations

  • Ide

Idea: : desc scribe a st state s usi sing a ve vector

  • f
  • f hand

hand-cr craf afted ed fe featu tures (properties) s)

  • Features

s are functions from st states to real numbers s (often 0/1) that capture properties of the state

  • Exa

xample features: s:

  • Distance to closest ghost
  • Distance to closest dot
  • Number of ghosts
  • 1 / (dist to dot)2
  • Is Pacman in a tunnel? (0/1)
  • …… etc.
  • Is it the exact state on this slide?
  • Can also describe a q-st

state (s, s, a) with features (e.g. action moves closer to food)

Linear Value Functions

  • Usi

sing a a feature represe sentation, we can define a q fu q functi tion (or va value function) for any st state s s using a small number of weights w:

  • Adva

vantage: our experience is summed up in a few powerful numbers

  • Disa

sadva vantage: states that share features may have very different values!

Approximate Q-Learning

  • Q-le

learnin ing wit with lin linear Q-fu functi tions:

  • Intuitive

ve interpretation:

  • Adjust we

weig ights ts of active ve features

  • If something unexpectedly bad happens, blame the features that were ‘on’.
  • Formal just

stification: online least squares

Exact Q’s Approximate Q’s

Example: Q-Pacman

[Demo: approximate Q- learning pacman (L11D10)]

Approximate Q-Learning -- Pacman

slide-4
SLIDE 4

Q-Learning and Least Squares

20 20 40 10 20 30 40 10 20 30 20 22 24 26

Linear Approximation: Regression*

Prediction: Prediction:

Optimization: Least Squares*

20

Error or “residual” Prediction Observation

Minimizing Error*

Approximate q-up update explained: Imagine we had only one point x, with features f( f(x), target value y, and weights w: “target y” “prediction f(x)”

2 4 6 8 10 12 14 16 18 20

  • 15
  • 10
  • 5

5 10 15 20 25 30

Degree 15 polynomial

Overfitting: Why Limiting Capacity Can Help* Policy Search Policy Search

  • Pr

Probl blem: o : ofte ften th the fe featu ture-base sed policies s that work k well (win games, s, maxi ximize ze utilities) s) aren’t the ones s that have ve the best st V V / Q / Q approxi ximation

  • E.g. your va

value functions s from pr project 2 were probably horrible estimates of future rewards, but they st still produced good decisi sions

  • Q-learning’s

s priority: y: get Q-va values s close (modeling)

  • Action se

selection priority: y: get ordering of Q-va values s right (prediction)

  • We’ll see this distinction between modeling and prediction again later in the course
  • So

Soluti tion: : learn po policies that maximize rewards, not the va values s that predict them

  • Policy

y se search: start with an ok solution (e.g. Q-le learnin ing) then fi fine-tu tune by hill climbing on feature weights

Policy Search

  • Simplest

st policy y se search:

  • Start with an initial linear va

value function or Q-fu functi tion

  • Nu

Nudge each feature we weig ight up and down wn and see if your policy is better than before

  • Problems:

s:

  • How do we tell the policy got better?
  • Need to run many sample episodes!
  • If there are a lot of features, this can be impractical
  • Better methods exploit lookahead structure,

sample wisely, change multiple parameters…

slide-5
SLIDE 5

RL: Helicopter Flight

[Andrew Ng] [Video: HELICOPTER]

RL: Learning Locomotion

[Video: GAE]

[Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016]

RL: Learning Soccer

[Bansal et al, 2017]

RL: Learning Manipulation

[Levine*, Finn*, Darrell, Abbeel, JMLR 2016]

RL: NASA SUPERball

[Geng*, Zhang*, Bruce*, Caluwaerts, Vespignani, Sunspiral, Abbeel, Levine, ICRA 2017] Pieter Abbeel -- UC Berkeley | Gradescope | Covariant.AI

RL: In-Hand Manipulation

Pieter Abbeel -- UC Berkeley | Gradescope | Covariant.AI

OpenAI: Dactyl

Trained with domain randomization [OpenAI]

Conclusion

  • We’r

We’re e done

  • ne wit

with h Pa Part I t I: Se : Search a and Pl d Planning! g!

  • We’ve

ve se seen how AI methods s can so solve ve problems s in:

  • Se

Search

  • Const

straint Satisf sfaction Problems

  • Ga

Games

  • Marko

kov v Decisi sion Problems

  • Re

Reinforcement Learning

  • Next

xt up up Pa Part II t II: Uncertainty y and Learning!