RL LECTURE 3 LEARNING FROM INTERACTION with environment to - - PDF document

rl lecture 3 learning from interaction
SMART_READER_LITE
LIVE PREVIEW

RL LECTURE 3 LEARNING FROM INTERACTION with environment to - - PDF document

RL LECTURE 3 LEARNING FROM INTERACTION with environment to achieve some goal Baby playing. No teacher. Sensorimotor connection to environment. Cause effect Action consequences How to achieve goals Learning to


slide-1
SLIDE 1

RL LECTURE 3 LEARNING FROM INTERACTION

– with environment – to achieve some goal

Baby playing. No teacher. Sensorimotor connection to

environment. – Cause – effect – Action – consequences – How to achieve goals

Learning to drive car, hold conversation, etc.

– Environment’s response affects our subsequent ac- tions – We find out the effects of our actions later

1

slide-2
SLIDE 2

SIMPLE LEARNING TAXONOMY

Supervised Learning

– “Teacher” provides required response to inputs. De- sired behaviour known. “Costly”

Unsupervised Learning

– Learner looks for patterns in inputs. No “right” an- swer

Reinforcement Learning

– Learner not told which actions to take, but gets re- ward/punishment from environment and adjusts/learns the action to pick next time.

2

slide-3
SLIDE 3

REINFORCEMENT LEARNING

Learning a mapping from situations to actions in order to maximise a scalar reward/reinforcement signal

HOW?

Try out actions to learn which produces highest reward – trial-and-error search Actions affect immediate reward

  • next situation
  • all sub-

sequent rewards – delayed effects, delayed reward Situations, Actions, Goals Sense situations, choose actions TO achieve goals Environment uncertain

3

slide-4
SLIDE 4

EXPLORATION/EXPLOITATION TRADE- OFF

High rewards from trying previously-well-rewarded actions – EXPLOITATION BUT Which actions are best? Must try ones not tried before – EXPLORATION MUST DO BOTH Especially if task stochastic, try each action many times per situations to get reliable estimate of reward. Gradually prefer those actions that prove to lead to high re- ward. (Doesn’t arise in supervised learning)

4

slide-5
SLIDE 5

EXAMPLES

Animal learning to find food and avoid predators Robot trying to learn how to dock with charging station Backgammon player learning to beat opponent Football team trying to find strategies to score goals Infant learning to feed itself with spoon Cornet player learning to produce beautiful sounds Temperature controller keeping FH warm while minimis-

ing fuel consumption

5

slide-6
SLIDE 6

FRAMEWORK

State/ ENVIRONMENT AGENT Action at t t Reward r rt+1 st+1 Situation s

Agent in situation

✂✁ chooses action ✄☎✁

One tick later in situation

  • ✁✝✆✟✞ gets reward
✠ ✁✝✆✟✞

POLICY

✡ ✁ ☛ ✌☞✍✄✏✎✒✑ ✓✔✠☎✕✖✄ ✁ ✑ ✄✘✗✙ ✁ ✑ ✛✚

Given the situation at time

✜ is the policy gives the proba-

bility the agent’s action will be

✄ .

Reinforcement learning

Get/find/learn the policy

6

slide-7
SLIDE 7

EXAMPLE POLICIES

Find the coffee machine

1 2 3 4 start

✡ ☛✁ ✎ ✢

turn left

  • r
✡ ☛✁ ☞✄✂✄☎✝✆✁✞✠✟☛✡✌☞✍✂✍✎ ✢
☛✏✎ ✎ ✢

straight on

✡ ☛✁ ☞✒✑✁✂✒✆✔✓✖✕✘✗✚✙✛✂✢✜✣✞ ✎ ✢ ✤ ✡ ☛✦✥ ✎ ✢

turn right

✡ ☛✁ ☞✄✂✄☎✝✆✁✞✧✆✁✕✘✗✚✙✛✂✍✎ ✢ ✤ ✡ ☛✩★ ✎ ✢

go through door

✡ ☛✁ ☞✒✗✪✜✫✂✄✙✝✆✬✜✣☎✭✗✚✙✠✮✭✜✯✜✚✆ ✎ ✢ ✤

etc. Bandit problem 10 arms, Q table gives the Q value for each arm

✰ -greedy policy: ✡ ☛
✄✲✱ ✗ ✄✲✱ ✑ ✓✳✆✁✗✵✴✠✓✷✶✲✸✺✹ ☛ ✄✏✎ ✎✒✑ ✼✻ ✰
✾ ✿✺✾

else

✡ ☛
✄✏✎✒✑ ✽ ✾ ✿✺✾ ☞✛✗❁❀ ✗ ✑

7

slide-8
SLIDE 8

JARGON

Policy

✡ ☛ ✌☞✍✄ ✎

Decision on what action to do in that state Reward function Defines goal, and good and bad experience for learner Value function Predicts reward. Estimate of total future reward Model of the environment Maps states and actions onto states

❀ ✂ . If in state
  • ✞ we take

action

✄ ✞ predicts ☎✄ (and sometimes

reward

✠ ✄ ).

Not all agents use models. Reward function and environmental model fixed external to agent. Policy, value function, estimate of model adjusted during learning.

8

slide-9
SLIDE 9

VALUE FUNCTIONS

How desirable is it to be in a certain state?

What is its value?

Value Value is (an estimate of) the expected future reward from that state

Value vs. reward

Long-term vs. immediate

Want actions that lead to states of high value, not neces- sarily high immediate reward

Learn policy via learning value – when we know the values
  • f states we can choose to go to states of high value
  • cf. GAGP discover policy directly
Genotypical vs. phenotypical learning? (GAGP vs. RL)

9

slide-10
SLIDE 10

GENERAL RL ALGORITHM

  • 1. Initialise learner’s internal state (e.g.

Q values, other statistics)

  • 2. Do for a long time
Observe current world state
  • Choose action
✄ using the policy Execute action ✄ Let ✠ be immediate reward,
  • ✱ new world state
Update internal state based on ✌☞✍✄ ☞ ✠✖☞
  • ✱ , previous in-

ternal state

  • 3. Output a policy based on, e.g. learnt Q values and follow

it We need:

Decision on what constitutes an internal state Decision on what constitutes a world state Sensing of a world state Action-choice mechanism (policy) based usually on an evaluation (of current world and internal state) func-

tion

A means of executing the action A way of updating the internal state

10

slide-11
SLIDE 11

Environment (simulator?) provides

Transitions between world states, i.e. model A reward function

But of course the learner has to discover what these are while exploring the world.

11

slide-12
SLIDE 12

EXAMPLE - 0 AND X

See Sutton and Barto Section 1.4 and Figure 1.1.

12

slide-13
SLIDE 13

EXAMPLE

Construct a player to play against an imperfect opponent For each board state, set up

– estimate of probability of winning from that state XXX

  • OOO
✑ ✤

Rest

✑ ✤✁ ✂ initially

Play many games Move selection

mostly pick move leading to state with highest
  • sometimes explore

Value adjustment

back-up value of states after non-exploratory moves to

states preceding moves

e.g.
☎✄ ✎✒✑
☎✄ ✎
  • ✆✞✝
✟✄ ✆✟✞ ✎ ✻
✟✄ ✎✡✠

Reduce

  • ver time
  • converges to probabilities of winning – optimal policy

13