CS 309: Autonomous Intelligent Robotics Instructor: Jivko Sinapov - - PowerPoint PPT Presentation

cs 309 autonomous intelligent robotics
SMART_READER_LITE
LIVE PREVIEW

CS 309: Autonomous Intelligent Robotics Instructor: Jivko Sinapov - - PowerPoint PPT Presentation

CS 309: Autonomous Intelligent Robotics Instructor: Jivko Sinapov http://www.cs.utexas.edu/~jsinapov/teaching/cs309_spring2017/ Reinforcement Learning A little bit about next semester... New robots: robot arm, HSR-1 robot Virtually


slide-1
SLIDE 1

CS 309: Autonomous Intelligent Robotics

Instructor: Jivko Sinapov

http://www.cs.utexas.edu/~jsinapov/teaching/cs309_spring2017/

slide-2
SLIDE 2

Reinforcement Learning

slide-3
SLIDE 3

A little bit about next semester...

  • New robots: robot arm, HSR-1 robot
  • Virtually all of the grade will be based on a

project

  • There will still be some lectures and

tutorials but much of the class time will be used to give updates on your projects and for discussions

slide-4
SLIDE 4

Reinforcment Learning

slide-5
SLIDE 5

Activity: You are the Learner

At each time step, you receive an observation (a color) You have three actions: “clap”, “wave”, and “stand” After performing an action, you may receive a reward

slide-6
SLIDE 6

Next time...

How can we formalize the strategy for solving this RL problem into an algorithm?

slide-7
SLIDE 7

Project Breakout Session

Meet with your group Summarize what you've done so far, identify next steps Come up with questions for me, the TAs, and the metors

slide-8
SLIDE 8

Main Reference

Sutton and Barto, (2012). Reinforcement Learning: An Introduction, Chapter 1-3

slide-9
SLIDE 9

What is Reinforcement Learning (RL)?

slide-10
SLIDE 10
slide-11
SLIDE 11

Ivan Pavlov (1849-1936)

slide-12
SLIDE 12
slide-13
SLIDE 13

From Pavlov to Markov

slide-14
SLIDE 14

Andrey Andreyevich Markov (1856 – 1922)

[http://en.wikipedia.org/wiki/Andrey_Markov]

slide-15
SLIDE 15

Markov Chain

slide-16
SLIDE 16

Markov Decision Process

slide-17
SLIDE 17

The Multi-Armed Bandit Problem

a.k.a. how to pick between Slot Machines (one-armed bandits) so that you walk out with the most $$$ from the Casino

. . . . Arm 1 Arm 2 Arm k

slide-18
SLIDE 18

How should we decide which slot machine to pull next?

slide-19
SLIDE 19

How should we decide which slot machine to pull next?

0 1 1 0 1 0 0 0 50 0

slide-20
SLIDE 20

How should we decide which slot machine to pull next?

1 with prob = 0.6 and 0 otherwise 50 with prob = 0.01 and 0 otherwise

slide-21
SLIDE 21

Value Function

A value function encodes the “value” of performing a particular action (i.e., bandit)

Value function Q Rewards observed when performing action a # of times the agent has picked action a

slide-22
SLIDE 22

How do we choose next action?

  • Greedy: pick the action that maximizes the

value function, i.e.,

  • ε-Greedy: with probability ε pick a random

action, otherwise, be greedy

slide-23
SLIDE 23

10-armed Bandit Example

slide-24
SLIDE 24

Soft-Max Action Selection

Exponent of natural logarithm (~ 2.718) “temperature” As temperature goes up, all actions become nearly equally likely to be selected; as it goes down, those with higher value function outputs become more likely

slide-25
SLIDE 25

What happens after choosing an action?

Batch: Incremental:

slide-26
SLIDE 26

Updating the Value Function

slide-27
SLIDE 27

What happens when the payout of a bandit is changing over time?

slide-28
SLIDE 28

What happens when the payout of a bandit is changing over time?

Earlier rewards may not be indicative of how the bandit performs now

slide-29
SLIDE 29

What happens when the payout of a bandit is changing over time?

instead of

slide-30
SLIDE 30

How do we construct a value function at the start (before any actions have been taken)

slide-31
SLIDE 31

How do we construct a value function at the start (before any actions have been taken)

. . . . Arm 1 Arm 2 Arm k Zeros: Random:

  • 0.23

0.76

  • 0.9

Optimistic: +5 +5 +5

slide-32
SLIDE 32
slide-33
SLIDE 33

The Multi-Armed Bandit Problems

The casino always wins – so why is this problem important?

slide-34
SLIDE 34

The Reinforcement Learning Problem

slide-35
SLIDE 35

RL in the context of MDPs

slide-36
SLIDE 36

The Markov Assumption

The award and state-transition observed at time t after picking action a in state s is independent of anything that happened before time t

slide-37
SLIDE 37

Maze Example

[slide credit: David Silver]

slide-38
SLIDE 38

Maze Example: Value Function

[slide credit: David Silver]

slide-39
SLIDE 39

Maze Example: Policy

[slide credit: David Silver]

slide-40
SLIDE 40

Maze Example: Model

[slide credit: David Silver]

slide-41
SLIDE 41

Notation

Set of States: Set of Actions: Transition Function: Reward Function:

slide-42
SLIDE 42

Action-Value Function

slide-43
SLIDE 43

Action-Value Function

The value of taking action a in state s The reward received after taking action a in state s Probability of going to state s' from s after a Discount factor (between 0 and 1) a' is the action with the highest action- value in state s'

slide-44
SLIDE 44

Action-Value Function

Common algorithms to learn the action-value function include Q-Learning and SARSA The policy consists of always taking the action that maximize the action-value function

slide-45
SLIDE 45

Q-Learning Example

  • Example Slides
slide-46
SLIDE 46

Q-Learning Algorithm

slide-47
SLIDE 47

Pac-Man RL Demo

slide-48
SLIDE 48

How does Pac-Man “see” the world?

slide-49
SLIDE 49

How does Pac-Man “see” the world?

slide-50
SLIDE 50

The state-space may be continuous...

state reward action

slide-51
SLIDE 51

How does Pac-Man “see” the world?

slide-52
SLIDE 52

Q-Function Approximation

a1 * x1 + a2 * x2 + … + an * xn

slide-53
SLIDE 53

Example Learning Curve

Sinapov et al. (2015). Learning Inter-Task Transferability in the Absence of Target Task Samples. In proceedings of the 2015 ACM Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), Istanbul, Turkey, May 4-8, 2015.

slide-54
SLIDE 54

Curriculum Development for RL Agents

54

A

Goal

slide-55
SLIDE 55

Curriculum Development for RL Agents

55

A

Goal

Most difficult region

slide-56
SLIDE 56

Main Approach

56

. . . . .

t t-19 t-20 t-21

slide-57
SLIDE 57

Main Approach

57

. . . . .

t t-19 t-20 t-21

Rewind back k game steps and branch out

slide-58
SLIDE 58

Narvekar, S., Sinapov, J., Leonetti, M. and Stone, P. (2016). Source Task Creation for Curriculum Learning. To appear in proceedings of the 2016 ACM Conference on Autonomous Agents and Multi-Agent Systems (AAMAS)

slide-59
SLIDE 59

Resources

  • BURLAP: Java RL Library:

http://burlap.cs.brown.edu/

  • Reinforcement Learning: An Introduction

http://people.inf.elte.hu/lorincz/Files/RL_ 2006/SuttonBook.pdf

slide-60
SLIDE 60

THE END

slide-61
SLIDE 61