ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja - - PowerPoint PPT Presentation

artificial intelligence reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja - - PowerPoint PPT Presentation

Utrecht University INFOB2KI 2019-2020 The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html Outline


slide-1
SLIDE 1

ARTIFICIAL INTELLIGENCE

Lecturer: Silja Renooij

Reinforcement learning

Utrecht University The Netherlands

These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

INFOB2KI 2019-2020

slide-2
SLIDE 2

Outline

  • Reinforcement learning basics
  • Relation with MDPs
  • Model‐based and model‐free learning
  • Exploitation vs. exploration
  • (Approximate Q‐learning)

2

slide-3
SLIDE 3

Reinforcement learning

RL methods are employed to address two related problems: the Prediction Problem and the Control Problem.

  • Prediction: learn value function for a (fixed) policy and

use that to predict reward for future actions.

  • Control: learn, by interacting with the environment, a

policy which maximizes the reward when traveling through state space  obtain an optimal policy which allows for action planning and optimal control.

3

slide-4
SLIDE 4

Control learning

Consider learning to choose actions, e.g.

  • Robot learns to dock on battery charger
  • Learn to optimize factory output
  • Learn to play Backgammon

Note several problem characteristics:

  • Delayed reward
  • Opportunity for active exploration
  • Possibility that state only partially observable

4

slide-5
SLIDE 5

Examples of Reinforcement Learning

  • Robocup Soccer Teams (Stone & Veloso, Reidmiller et al.)

– World’s best player of simulated soccer, 1999; Runner‐up 2000

  • Inventory Management (Van Roy, Bertsekas, Lee & Tsitsiklis)

– 10‐15% improvement over industry standard methods

  • Dynamic Channel Assignment (Singh & Bertsekas, Nie & Haykin)

– World's best assigner of radio channels to mobile telephone calls

  • Elevator Control (Crites & Barto)

– (Probably) world's best down‐peak elevator controller

  • Many Robots

– navigation, bi‐pedal walking, grasping, switching between skills...

  • Games: TD‐Gammon, Jellyfish (Tesauro, Dahl), AlphaGo (Deepmind)

– World's best backgammon & Go players

(Alpha Go: https://www.youtube.com/watch?v=SUbqykXVx0A)

5

slide-6
SLIDE 6

Key Features of RL

  • Agent learns by interacting with environment
  • Agent learns from the consequences of its actions, rather than

from being explicitly taught, by receiving a reinforcement signal

  • Because of chance, agent has to try things repeatedly
  • Agent makes mistakes, even if it learns intelligently (regret)
  • Agent selects its actions based on its past experiences

(exploitation) and also on new choices (exploration) trial and error learning

  • Possibly sacrifices short‐term gains for larger long‐term gains

6

slide-7
SLIDE 7

Reinforcement vs Supervised

Learning System

Input x from environment Output (based on) h(x) Training Info

The general learning task: learn a model or function h, that approximates the true function f, from a training set. Training info is of following form:

  • (x,~f(x)) for supervised learning
  • (x, reinforcement signal from environment)

for reinforcement learning

7

slide-8
SLIDE 8

Reinforcement Learning: idea

  • Basic idea:

– Receive feedback in the form of rewards – Agent’s return in long run is defined by the reward function – Must (learn to) act so as to maximize expected return – All learning is based on observed samples of outcomes! Environment

Agent

Actions: a State: s Reward: r

8

slide-9
SLIDE 9

Agent Environment

action

at st

reward

rt rt+1 st+1

state

The Agent-Environment Interface

1 1 2 1       t s R t r ) t A(s t a S t s , , , t K

t

. . . st a rt +1 st +1

t +1

a rt +2 st +2

t +2

a rt +3 st +3 . . .

t +3

a

Agent:

  • Interacts with environment at time
  • Observes state at step t:
  • Produces action at step t:
  • Gets resulting reward:
  • And resulting next state:

9

slide-10
SLIDE 10

Degree of Abstraction

  • Time steps: need not be fixed intervals of real time.
  • Actions:

– low level (e.g., voltages to motors) – high level (e.g., accept a job offer) – “mental” (e.g., shift in focus of attention), etc.

  • States:

– low‐level “sensations” – abstract, symbolic, based on memory, ... – subjective (e.g., the state of being “surprised” or “lost”).

  • Reward computation: in the agent’s environment

(because the agent cannot change it arbitrarily)

  • The environment is not necessarily unknown to the agent,
  • nly incompletely controllable.

10

slide-11
SLIDE 11

RL as MDP

The best studied case is when RL can be formulated as a (finite) Markov Decision Process (MDP), i.e. we assume:

  • A (finite) set of states s  S
  • A set of actions (per state) A
  • A model T(s,a,s’)
  • A reward function R(s,a,s’)
  • Markov assumption
  • Still looking for a policy (s)
  • New twist: we don’t know T or R !

– I.e. we don’t know which states are good or what the actions do – Must actually try actions and states out to learn

11

slide-12
SLIDE 12

An Example: Recycling robot

  • At each step, robot has to decide whether it

should

– (1) actively search for a can, – (2) wait for someone to bring it a can, or – (3) go to home base and recharge.

  • Searching is better but runs down the battery;

if it runs out of power while searching, it has to be rescued (which is bad).

  • Actions are chosen based on current energy level (states):

high, low.

  • Reward = number of cans collected

12

slide-13
SLIDE 13

search

high low

1, 0 1— β , —3 search recharge wait wait

search

1— α , R β , R

search

α, Rsearch 1, R wait 1, R wait

Recycling Robot MDP

S  high, low

 

A(high)  search, wait

 

A(low)  search, wait, recharge

 

wait search wait search

waiting while cans

  • f

no. expected searching while cans

  • f

no. expected R R R R   

13

slide-14
SLIDE 14

MDPs and RL

Known MDP: offline Solution, no learning

Goal Technique

Compute Vπ Policy evaluation Compute V*, * Value / policy iteration

Unknown MDP: Model‐Based Unknown MDP: Model‐Free

Goal Technique

Compute V*, * VI/PI

  • n approximated

MDP

Goal Technique

Compute Vπ Direct evaluation TD‐learning Compute Q*, * Q‐learning

14

slide-15
SLIDE 15

Model-Based Learning

  • Model‐Based Idea:

– Learn an approximate model based on experiences – Solve for values, as if the learned model were correct

  • Step 1: Learn empirical MDP model

– Count outcomes s’ for each s, a – Normalize to give an estimate of – Discover each when we experience (s, a, s’)

  • Step 2: Solve the learned MDP

– For example, use value iteration, as before:

15

slide-16
SLIDE 16

Model-Free Learning

  • Model‐Free idea:

– Directly learn (approximate) state values, based on experiences

  • Methods (a.o.):

I. Direct evaluation II. Temporal difference learning

  • III. Q‐learning

Remember: this is NOT offline planning! You actually take actions in the world.

Passive: use fixed policy Active: ‘off‐policy’

16

slide-17
SLIDE 17

I: Direct Evaluation

  • Goal: Compute V(s) under given 
  • Idea: Average ‘reward to go’ of visits

1. First act according to  for several episodes/epochs 2. Afterwards, for every state s and every time t that s is visited: determine the rewards rt … r⊤ subsequently received in epoch 3. Sample for s at time t = sum of discounted future rewards 𝑡𝑏𝑛𝑞𝑚𝑓 𝑆 𝑡 𝑠

𝛿𝑆 𝑡′ 𝑆⊤ 𝑡 𝑠

⊤)

given experience tuples <s, (s), rt , s’>

  • 4. Average samples over all visits of s

Note: this is the simplest Monte Carlo method

17

slide-18
SLIDE 18

Example: Direct Evaluation

Input: Policy 

Assume:  = 1

Observed Episodes (Training) Output Values A

B C

D

E

B, east, ‐1, C C, east, ‐1, D D, exit, +10,  B, east, ‐1, C C, east, ‐1, D D, exit, +10,  E, north, ‐1, C C, east, ‐1, A A, exit, ‐10, 

Episode 1 Episode 2 Episode 3 Episode 4

E, north, ‐1, C C, east, ‐1, D D, exit, +10, 

A

B C

D

E

+8 +4 +10 ‐10 ‐2

18

States

slide-19
SLIDE 19

Example: Direct Evaluation

Assume:  = 1

Observed Episodes (Training) Sample computations

B, east, ‐1, C C, east, ‐1, D D, exit, +10,  B, east, ‐1, C C, east, ‐1, D D, exit, +10,  E, north, ‐1, C C, east, ‐1, A A, exit, ‐10, 

Episode 1 Episode 2 Episode 3 Episode 4

E, north, ‐1, C C, east, ‐1, D D, exit, +10, 

19

A: samplet4‐3 = ‐10 B: samplet1‐1 = ‐1 ‐ 1+ 210 = 8 samplet2‐1 = ‐1 ‐ 1+ 210 = 8 C: samplet1‐2 = ‐1 + 10 = 9 samplet2‐2 = ‐1 + 10 = 9 samplet3‐2 = ‐1 + 10 = 9 samplet4‐2 = ‐1 ‐ 10 = ‐11 D: samplet1‐3 = 10 samplet2‐3 = 10 samplet3‐3 = 10 E: samplet3‐1 = ‐1 ‐ 1+ 210 = 8 samplet4‐1 = ‐1 ‐ 1 ‐ 210 = ‐12

t1‐1 t1‐2 t1‐3 t2‐1 t2‐2 t2‐3 t3‐1 t3‐2 t3‐3 t4‐1 t4‐2 t4‐3

slide-20
SLIDE 20

Properties of Direct Evaluation

  • Benefits:

– easy to understand – doesn’t require any knowledge of T, R – eventually computes the correct average values, using just sample transitions

  • Drawbacks:

– wastes information about state connections – each state must be learned separately  takes a long time to learn

Output Values A

B C

D

E

+8 +4 +10 ‐10 ‐2

If B and E both go to C under this policy, how can their values be different?

20

slide-21
SLIDE 21

II: Temporal Difference Learning

  • Goal: Compute V(s) under given 
  • Big idea: update after every experience!

 Likely outcomes will contribute updates more often

  • Temporal difference learning of values
  • 1. Initialize each V(s) with some value
  • 2. Observe experience tuple < s, (s), r, s’ >
  • 3. Use observation in rough estimate of long‐term reward V(s)
  • 4. Update V(s) by moving values slightly towards estimate:

where 0 ≤ α ≤ 1 is the learning rate. (s) s s’ 𝑡𝑏𝑛𝑞𝑚𝑓 𝑡 𝑠 𝛿 · 𝑊s’ 𝑊(s) ← 𝑊(s) 𝛽 · 𝑡𝑏𝑛𝑞𝑚𝑓 𝑡 𝑊(s)

21

slide-22
SLIDE 22

Example: TD‐ Learning

Assume:  = 1, α = 1/2

8

A

B C

D

E

States

init

22

Each V(s) can be initialised with an arbitrary value. Reward function is unknown; but perhaps we do know that we receive a reward of 8 after ending up in D…this can be exploited.

Input: Policy  A

B C

D

E

slide-23
SLIDE 23

Example: TD‐ Learning

Experience < s, π(s), r, s’ >

B, east, ‐2, C

8

‐1

8

init

23

Assume:  = 1, α = 1/2

A

B C

D

E

States Input: Policy  A

B C

D

E

𝑊(s) ← 𝑊(s) 𝛽 · 𝑡𝑏𝑛𝑞𝑚𝑓 𝑡 𝑊(s) 1 𝛽𝑊(s) 𝛽 · 𝑡𝑏𝑛𝑞𝑚𝑓 𝑡 1 𝛽𝑊(s) 𝛽𝑠 γ · 𝑊 (s′ ) 𝑊(B) sample(B): 2 γ · 0 2 Update : 1 𝛽 · 0 𝛽 · 𝑡𝑏𝑛𝑞𝑚𝑓𝐶

slide-24
SLIDE 24

Example: TD‐ Learning

Experienced < s, π(s), r, s’ >

8

‐1

8

‐1 3

8

C, east, ‐2, D init

24

Assume:  = 1, α = 1/2

A

B C

D

E

States Input: Policy  A

B C

D

E

𝑊(C) sample(C): 2 γ · 8 6 Update : 1 𝛽 · 0 𝛽 · 𝑡𝑏𝑛𝑞𝑚𝑓𝐷

slide-25
SLIDE 25

Properties of TD Value Learning

  • Benefits:

– Model free – ≈ Bellman updates: connections between states used – Updates upon each action

  • Drawback:

– Values are learnt per policy  Good for policy evaluation  Long way from establishing optimal policy (Note that same holds for Direct evaluation)

25

slide-26
SLIDE 26

V

putt

sand

green

−1

s a n d

−2 −2 −3 −4 −1 −5 −6 −4 −3 −3 −2 −4

−∞ −∞

Golf example: how ‘valuable’ is a state?

  • State is ball location
  • Reward of –1 for each

stroke until the ball is in the hole

  • Actions:

– putt (use putter) – driver (use driver)

  • putt succeeds anywhere
  • n the green
  • Value of a state??

26

slide-27
SLIDE 27

Optimal quantities revisited

  • State s has value V(s):

V*(s) = expected reward starting in s and acting optimally

  • q‐state (s,a) has value Q(s,a):

Q*(s,a) = expected reward having taken action a from state s and (thereafter) acting optimally

  • The optimal policy:

*(s) = optimal action from state s

a

state s

q‐state

state s’

slide-28
SLIDE 28

Bellman equation revisited

 

) , ( max ) ' ( ) ' , , ( ) ' , , ( max ) (

*

* ' *

a s Q s V s a s R s a s T s V

a s a

  

Recall the Bellman equation for the optimal value function: The optimal policy now directly (no look‐ahead) follows with argmax:

) , ( max arg ) (

* *

a s Q s

a

 

Now, since also , we have that

) ' , ' ( max ) ' (

* ' *

a s Q s V

a

         ) ' , ' ( max ) ' , , ( ) ' , , ( ) , (

* ' ' *

a s Q s a s R s a s T a s Q

a s

28

slide-29
SLIDE 29

Gridworld: V and Q values

Noise = 0.2 Optimal policy? Discount γ = 0.9 Living reward R(s) = 0

29

slide-30
SLIDE 30

III: Q-Learning

  • Idea: do Q‐value updates to each q‐state (like VI):

– But: can’t compute this update without knowing T, R

  • Instead: incorporate estimates as we go (like TD)
  • 1. Initialize Q(s,a) = 0 for each s,a pair
  • 2. Select action a and observe experience <s, a, r, s’>
  • 3. Use observation in rough estimate of Q(s, a):
  • 4. Update Q(s,a) by moving values slightly towards estimate:

) ' , ' ( max ) , (

'

a s Q r a s sample

a

  

 

) , ( ) , ( ) 1 ( ) , ( ) , ( ) , ( ) , ( a s sample a s Q a s Q a s sample a s Q a s Q            

30

slide-31
SLIDE 31

Q*(s,driver)

sand

green

−1

s a n d

−2 −3 −2

Optimal Q -Function for Golf

  • We can hit the ball farther with driver than with

putter, but with less accuracy

  • Q(s,driver) gives the value or using driver first,

then using whichever actions are best

31

slide-32
SLIDE 32

Updating Q -values: example

90 72 ) 1 ( ) , ( ) , ( ) 1 ( ) , (

1 1 1

             

right right right

a s sample a s Q a s Q

γ = 0.9 α = 1

32

90 } 100 , 81 , 63 { max 9 . ) ' , ( max ) , (

2 1

'

     a s Q r a s sample

a

right

Current Q(s,a) indicated; experience < s1, aright, 0, s2 >

slide-33
SLIDE 33
  • Q‐learning is off‐policy learning
  • if rewards ≥ 0 then Q ‐values ≥ 0 and non‐decreasing

with each update

  • If each (s,a) pair is visited infinitely often, the process

convergences to true (optimal) Q

 Amazing result: Q‐learning converges to optimal

policy ‐‐ even if you’re acting suboptimally! …Basically, in the limit, it doesn’t matter how you select actions (!)

Q-Learning Properties I

33

slide-34
SLIDE 34

Caveats:

– You have to explore enough – You have to eventually make the learning rate α small enough – … but not decrease it too quickly

Q-Learning Properties II

34

slide-35
SLIDE 35

Exploration vs. Exploitation

Multi‐armed bandit: each machine provides a random reward from a distribution specific to that machine. Which machine should you play, and how many times?

35

slide-36
SLIDE 36

Exploration vs Exploitation

  • The policy indicates the exploration strategy: which

action to take in which state

  • Standard Q‐learning uses Q‐values associated with

best action: pure exploitation, using what it already knows

  • We can add randomness for true exploration:

sometimes try to learn something new by picking a random action (e.g. ‐greedy)

  • The exploration‐exploitation trade‐off is highly

influenced by context: online or offline?

36

slide-37
SLIDE 37

Q-learning to crawl

37

slide-38
SLIDE 38

Approximate Q-Learning

38

slide-39
SLIDE 39

Generalizing Across States

  • Basic Q‐Learning keeps a table of all q‐values
  • In realistic situations, we cannot possibly learn

about every single state!

– Too many states to visit them all in training – Too many states to hold the q‐tables in memory

  • Instead, we want to generalize:

– Learn about some small number of training states from experience – Generalize that experience to new, similar situations – This is a fundamental idea in machine learning!

39

slide-40
SLIDE 40

Example: Pacman

Let’s say we discover through experience that this state is bad: In naïve Q‐learning, we know nothing about this state: Or even this

  • ne!

40

slide-41
SLIDE 41

Feature-Based Representations

  • Solution: describe a state using a vector of features

(properties)

– Features are functions from states to real numbers (often 0/1) that capture important properties of the state – Example features:

  • Distance to closest ghost
  • Distance to closest dot
  • Number of ghosts
  • 1 / (dist to dot)2
  • Is Pacman in a tunnel? (0/1)
  • …… etc.
  • Is it the exact state on this slide?

– Can also describe a q‐state (s, a) with features (e.g. action moves closer to food)

41

slide-42
SLIDE 42

Linear Value Functions

  • Using a feature representation, we can write a Q or V function

for any state using a few weights:

  • Advantage: our experience is summed up in a few powerful

numbers

  • Disadvantage: states may share features but actually be very

different in value!

42

slide-43
SLIDE 43

Approximate Q-Learning

  • In Q‐learning, use difference between current Q(s,a)

and new sample to update weights of active features:

  • Intuitive interpretation: if something unexpectedly bad

happens, blame the features that were on: disprefer all states with that state’s features before (exact Q): now: approximate Q with w updates sample transition = < s, a, r, s’ >

43

slide-44
SLIDE 44

Example: Q-Pacman

(no noise)

44

slide-45
SLIDE 45

Subtleties and ongoing research

  • Replace approximate‐Q table with neural

net or other generalizer

  • Handle case where state only partially
  • bservable
  • Design optimal exploration strategies
  • Extend to continuous action, state

45

slide-46
SLIDE 46

Summary

  • Reinforcement learning: learn from experience,

not from a ‘teacher’

  • Reinforcement learning problem can be cast as

MDP with unknown T and R

  • Model‐based RL: estimate R and T from experience
  • Model‐free RL: estimate V(s) or Q(s,a) from

experience

  • Latter can be done actively, using Q‐learning, and

gives optimal policy

  • Large state‐spaces: use approximate Q‐learning

with domain‐specific features

46