Reinforcement Learning Yingyu Liang yliang@cs.wisc.edu Computer - - PowerPoint PPT Presentation

reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning Yingyu Liang yliang@cs.wisc.edu Computer - - PowerPoint PPT Presentation

Introduction to Reinforcement Learning Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from David Page, Mark Craven] Goals for the lecture you should understand the following


slide-1
SLIDE 1

Introduction to Reinforcement Learning

Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison

[Based on slides from David Page, Mark Craven]

slide-2
SLIDE 2

Goals for the lecture

you should understand the following concepts

  • the reinforcement learning task
  • Markov decision process
  • value functions
  • value iteration

2

slide-3
SLIDE 3

Reinforcement learning (RL)

Task of an agent embedded in an environment repeat forever 1) sense world 2) reason 3) choose an action to perform 4) get feedback (usually reward = 0) 5) learn the environment may be the physical world or an artificial one

3

slide-4
SLIDE 4
  • world

– 30 pieces, 24 locations

  • actions

– roll dice, e.g. 2, 5 – move one piece 2 – move one piece 5

  • rewards

– win, lose

  • TD-Gammon 0.0

– trained against itself (300,000 games) – as good as best previous BG computer program (also by Tesauro)

  • TD-Gammon 2

– beat human champion

Example: RL Backgammon Player

[Tesauro, CACM 1995]

4

slide-5
SLIDE 5
  • world

– 19x19 locations

  • actions

– Put one stone on some empty location

  • rewards

– win, lose

  • 2016 beats World Champion

Lee Sedol by 4-1

  • Subsequent system (AlphaGo Master/zero )

shows superior performance than humans

  • Trained by supervised learning +

reinforcement learning

Example: AlphaGo

[Nature, 2017]

5

slide-6
SLIDE 6

Reinforcement learning

agent environment state reward action

s0 s1 s2 a0 a1 a2 r0 r1 r2

  • set of states S
  • set of actions A
  • at each time t, agent observes state

st ∈ S then chooses action at ∈ A

  • then receives reward rt and changes

to state st+1

6

slide-7
SLIDE 7

Reinforcement learning as a Markov decision process (MDP)

agent environment state reward action

s0 s1 s2 a0 a1 a2 r0 r1 r2

  • Markov assumption
  • also assume reward is Markovian

Goal: learn a policy π : S → A for choosing actions that maximizes for every possible starting state s0

7

) , | ( ,...) , , , | (

1 1 1 1 t t t t t t t t

a s s P a s a s s P

+ − − +

= ) , | ( ,...) , , , | (

1 1 1 1 t t t t t t t t

a s r P a s a s r P

+ − − +

= 1 where ...] [

2 2 1

  + + +

+ +

  

t t t

r r r E

slide-8
SLIDE 8

Reinforcement learning task

  • Suppose we want to learn a control policy π : S → A that

maximizes from every state s ∈ S G

100 100

each arrow represents an action a and the associated number represents deterministic reward r(s, a)

8

 =0

] [

t t t

r E 

slide-9
SLIDE 9

VALUE FUNCTION

slide-10
SLIDE 10

Value function for a policy

  • given a policy π : S → A define

assuming action sequence chosen according to π starting at state s

  • we want the optimal policy π* where

p * = argmaxp V p (s) for all s

we’ll denote the value function for this optimal policy as V*(s)

10

 =

= ] [ ) (

t t t

r E s V 

slide-11
SLIDE 11

Value function for a policy π

  • Suppose π is shown by red arrows, γ = 0.9

G

100 100

Vπ(s) values are shown in red 100 90 81 73 66

11

slide-12
SLIDE 12

Value function for an optimal policy π*

  • Suppose π* is shown by red arrows, γ = 0.9

G

100 100

V*(s) values are shown in red 100 90 100 90 81

12

slide-13
SLIDE 13

Using a value function

If we know V*(s), r(st, a), and P(st | st-1, at-1) we can compute π*(s)

13

      = + =

 +  S s t t t A a t

s V a s s s P a s r s ) ( ) , | ( ) , ( max arg ) (

* 1 *

 

slide-14
SLIDE 14

Value iteration for learning V*(s)

initialize V(s) arbitrarily loop until policy good enough { loop for s ∈ S { loop for a ∈ A { } }

}

14

+ 

S s

s V a s s P a s r a s Q

'

) ' ( ) , | ' ( ) , ( ) , (  ) , ( max ) ( a s Q s V

a

slide-15
SLIDE 15

Value iteration for learning V*(s)

  • V(s) converges to V*(s)
  • works even if we randomly traverse environment instead of

looping through each state and action methodically – but we must visit each state infinitely often

  • implication: we can do online learning as an agent roams

around its environment

  • assumes we have a model of the world: i.e. know P(st | st-1, at-1)
  • What if we don’t?

15

slide-16
SLIDE 16

Q-LEARNING

slide-17
SLIDE 17

Q functions

define a new function, closely related to V* if agent knows Q(s, a), it can choose optimal action without knowing P(s’ | s, a) and it can learn Q(s, a) without knowing P(s’ | s, a)

17

 

 

) ' ( ) , ( ) , (

* , | '

s V E a s r E a s Q

a s s

 +  ) , ( max ) (

*

a s Q s V

a

 ) , ( max arg ) (

*

a s Q s

a

 

slide-18
SLIDE 18

Q values

G 100 100

r(s, a) (immediate reward) values

G 100 90 100 90 81 81 72 81 81 72 90 81

Q(s, a) values

G 100 90 100 90 81

V*(s) values

18

slide-19
SLIDE 19

Q learning for deterministic worlds

for each s, a initialize table entry

  • bserve current state s

do forever select an action a and execute it receive immediate reward r

  • bserve the new state s’

update table entry s ← s’

19

) ' , ' ( ˆ max ) , ( ˆ

'

a s Q r a s Q

a

 +  ) , ( ˆ  a s Q

slide-20
SLIDE 20

Updating Q

100 72 63 81 100 90 63 81

aright

20

90 } 100 , 81 , 63 max{ 9 . ) ' , ( ˆ max ) , ( ˆ

2 ' 1

 +  +  a s Q r a s Q

a right

slide-21
SLIDE 21

Q’s vs. V’s

  • Which action do we choose when we’re in a given state?
  • V’s (model-based)

– need to have a ‘next state’ function to generate all possible states – choose next state with highest V value.

  • Q’s (model-free)

– need only know which actions are legal – generally choose next state with highest Q value. V V V Q Q

21

slide-22
SLIDE 22

Exploration vs. Exploitation

  • in order to learn about better alternatives, we shouldn’t always

follow the current policy (exploitation)

  • sometimes, we should select random actions (exploration)
  • ne way to do this: select actions probabilistically according to:

where c > 0 is a constant that determines how strongly selection favors actions with higher Q values

22

=

j a s Q a s Q i

j i

c c s a P

) , ( ˆ ) , ( ˆ

) | (