RL Overview of topics About Reinforcement Learning The - - PowerPoint PPT Presentation

rl overview of topics about reinforcement learning the
SMART_READER_LITE
LIVE PREVIEW

RL Overview of topics About Reinforcement Learning The - - PowerPoint PPT Presentation

Introduction to Reinforcement Learning RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem Inside an RL agent Temporal difference learning Many faces of Reinforcement Learning What is


slide-1
SLIDE 1

Introduction to Reinforcement Learning

RL

slide-2
SLIDE 2

Overview of topics

  • About Reinforcement Learning
  • The Reinforcement Learning Problem
  • Inside an RL agent
  • Temporal difference learning
slide-3
SLIDE 3

Many faces of Reinforcement Learning

slide-4
SLIDE 4

Reinforcement Learning 4

What is Reinforcement Learning?

  • Learning from interaction
  • Goal-oriented learning
  • Learning about, from, and while interacting

with an external environment

  • Learning what to do—how to map situations

to actions—so as to maximize a numerical reward signal

slide-5
SLIDE 5

Branches of AI

Reinforcement Learning Supervised Learning Unsupervised Learning Machine Learning

slide-6
SLIDE 6

Reinforcement Learning 6

Supervised Learning

Supervised Learning System

Inputs Outputs Training Info = desired (target) outputs

Error = (target output – actual output)

slide-7
SLIDE 7

Reinforcement Learning 7

Reinforcement Learning

RL System

Inputs Outputs (“actions”) Training Info = evaluations (“rewards” / “penalties”)

Objective: get as much reward as possible

slide-8
SLIDE 8

Recipe for creative behavior: explore & exploit

  • Creativity: finding a new approach /

solution / …

– Exploration (random / systematic / …) – Evaluation (utility = expected rewards) – Selection (ongoing behavior and learning)

slide-9
SLIDE 9

Coli bacteria and creativity

  • Escherichia Coli searches for food using trial

and error:

– Choose a random direction by tumbling and then start swimming straight – Evaluate progress – Continue longer or cancel earlier depending on progress

http://biology.about.com/library/weekly/aa081299.htm

slide-10
SLIDE 10

Zebra finch: from singing in the shower to performing artist

  • 1. A newborn zebra finch can’t

sing

  • 2. The baby bird listens to

father’s song

  • 3. The baby starts to “babble”

father’s song as a target template

  • 4. The song develops through

trial and error – “singing in the shower”

  • 5. No exploration when singing

to a female

http://www.brain.riken.jp/bsi-news/bsinews34/no34/ speciale.html

slide-11
SLIDE 11

Zebra finch: from singing in the shower to performing artist

  • http://www.youtube.com/watch?v=Md6bsvkauPg
slide-12
SLIDE 12

Reinforcement Learning 12

Key Features of RL

  • Learner is not told which actions to take
  • Trial-and-Error search
  • Possibility of delayed reward (sacrifice short-

term gains for greater long-term gains)

  • The need to explore and exploit
  • Considers the whole problem of a goal-

directed agent interacting with an uncertain environment

slide-13
SLIDE 13

Reinforcement Learning 13

Complete Agent

Temporally situated Continual learning and planning Object is to affect the environment Environment is stochastic and uncertain

Environment action state reward Agent

slide-14
SLIDE 14

Reinforcement Learning 14

Elements of RL

  • Policy: what to do
  • Reward: what is good
  • Value: what is good because it predicts

reward

  • Model: what follows what

Policy Reward Value Model of environment

slide-15
SLIDE 15

Reinforcement Learning 15

An Extended Example: Tic-Tac-Toe

X X X O O X X O X O X O X O X X O X O X O X O X O X O X

} x’s move

} x’s move } o’s move } x’s move

} o’s move ... ... ... ...

... ... ... ... ...

x x x x o x

  • x
  • x

x x

  • Assume an imperfect opponent: he/

she sometimes makes mistakes

slide-16
SLIDE 16

Reinforcement Learning 16

An RL Approach to Tic-Tac-Toe

  • 1. Make a table with one entry per state:
  • 2. Now play lots of games. To

pick our moves, look ahead

  • ne step:

State V(s) – estimated probability of winning .5 ? .5 ? . . . . . . . . . . . . 1 win 0 loss . . . . . . 0 draw

x x x x

  • x

x

  • x

x x x

  • current state

various possible next states

*

Just pick the next state with the highest estimated prob. of winning — the largest V(s); a greedy move. But 10% of the time pick a move at random; an exploratory move.

slide-17
SLIDE 17

Reinforcement Learning 17

RL Learning Rule for Tic-Tac-Toe

“Exploratory” move

move greedy

  • ur

after state the – s move greedy

  • ur

before state the – s ʹ″

[ ]

) s ( V ) s ( V ) s ( V ) s ( V : a – ) s ( V toward ) s ( V each increment We − ʹ″ α + ← ʹ″ backup parameter size

  • step

the . e.g., fraction, positive small a 1 = α

slide-18
SLIDE 18

Reinforcement Learning 18

How can we improve this T.T.T. player?

  • Take advantage of symmetries

– representation/generalization

  • Do we need “random” moves? Why?

– Do we always need a full 10%?

  • Can we learn from “random” moves?
slide-19
SLIDE 19

Temporal difference learning

  • Solution to temporal credit assignment

problem

  • Replace the reward signal by the change in

expected future reward

– Prediction moves the rewards from the future as close to the actions as possible – Primary reward such as sugar replaced with secondary (or higher order) rewards such as money – In the brain, dopamine ≈ temporal difference signal

– Supervised learning is used for channelling information in predictive stimuli to learning

slide-20
SLIDE 20

Start S2 S3 S4 S5 Goal S7 S8

Arrows indicate strength between two problem states Start maze …

Reinforcement learning example

slide-21
SLIDE 21

Start S2 S3 S4 S5 Goal S7 S8

The first response leads to S2 … The next state is chosen by randomly sampling from the possible next states weighted by their associative strength Associative strength = line width

slide-22
SLIDE 22

Start S2 S3 S4 S5 Goal S7 S8

Suppose the randomly sampled response leads to S3 …

slide-23
SLIDE 23

Start S2 S3 S4 S5 Goal S7 S8

At S3, choices lead to either S2, S4, or S7. S7 was picked (randomly)

slide-24
SLIDE 24

Start S2 S3 S4 S5 Goal S7 S8

By chance, S3 was picked next…

slide-25
SLIDE 25

Start S2 S3 S4 S5 Goal S7 S8

Next response is S4

slide-26
SLIDE 26

Start S2 S3 S4 S5 Goal S7 S8

And S5 was chosen next (randomly)

slide-27
SLIDE 27

Start S2 S3 S4 S5 Goal S7 S8

And the goal is reached …

slide-28
SLIDE 28

Start S2 S3 S4 S5 Goal S7 S8

Goal is reached, strengthen the associative connection between goal state and last response Next time S5 is reached, part of the associative strength is passed back to S4...

slide-29
SLIDE 29

Start S2 S3 S4 S5 Goal S7 S8

Start maze again…

slide-30
SLIDE 30

Start S2 S3 S4 S5 Goal S7 S8

Let’s suppose after a couple of moves, we end up at S5 again

slide-31
SLIDE 31

Start S2 S3 S4 S5 Goal S7 S8

S5 is likely to lead to GOAL through strenghtened route In reinforcement learning, strength is also passed back to the last state This paves the way for the next time going through maze

slide-32
SLIDE 32

Start S2 S3 S4 S5 Goal S7 S8

The situation after lots of restarts …

slide-33
SLIDE 33

Stanford autonomous helicopter

  • https://www.youtube.com/watch?v=VCdxqn0fcnE
slide-34
SLIDE 34

RL applications in robotics

  • Robot Learns to Flip Pancakes
  • Autonomous spider learns to walk forward by

reinforcement learning

  • Reinforcement learning for a robitic soccer

goalkeeper

slide-35
SLIDE 35

Conclusion

  • The Reinforcement Learning Problem
  • Inside an RL agent

– Policy – Reward – Value – Model

  • Temporal difference learning