Reinforcement Learning: A Tutorial Satinder Singh Computer Science - - PowerPoint PPT Presentation

reinforcement learning a tutorial
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning: A Tutorial Satinder Singh Computer Science - - PowerPoint PPT Presentation

Reinforcement Learning: A Tutorial Satinder Singh Computer Science & Engineering University of Michigan, Ann Arbor with special thanks to Rich Sutton , Michael Kearns, Andy Barto, Michael Littman, Doina Precup, Peter Stone, Andrew Ng,...


slide-1
SLIDE 1

Reinforcement Learning: A Tutorial

Satinder Singh

Computer Science & Engineering University of Michigan, Ann Arbor

with special thanks to Rich Sutton, Michael Kearns, Andy Barto, Michael Littman, Doina Precup, Peter Stone, Andrew Ng,...

http://www.eecs.umich.edu/~baveja/NIPS05Tutorial/

slide-2
SLIDE 2

Outline

  • History and Place of RL
  • Markov Decision Processes (MDPs)
  • Planning in MDPs
  • Learning in MDPs
  • Function Approximation and RL
  • Partially Observable MDPs (POMDPs)
  • Beyond MDP/POMDPs
  • Applications
slide-3
SLIDE 3

RL is Learning from Interaction

Environment action perception reward Agent

  • complete agent
  • temporally situated
  • continual learning and planning
  • object is to affect environment
  • environment is stochastic and uncertain

RL is like Life!

slide-4
SLIDE 4

RL (another view)

Agent chooses actions so as to maximize expected cumulative reward over a time horizon Observations can be vectors or other structures Actions can be multi-dimensional Rewards are scalar but can be arbitrarily uninformativ Agent has partial knowledge about its environment Agent’s life Unit of experience

slide-5
SLIDE 5

Key Ideas in RL

  • Temporal Differences (or updating a guess on the

basis of another guess)

  • Eligibility traces
  • Off-policy learning
  • Function approximation for RL
  • Hierarchical RL (options)
  • Going beyond MDPs/POMDPs towards AI
slide-6
SLIDE 6

Demos...

slide-7
SLIDE 7

Stone & Sutton

slide-8
SLIDE 8

Stone & Sutton

slide-9
SLIDE 9

Keepaway Soccer (Stone & Sutton)

  • 4 vs 3 keepaway
  • Learned could keep the ball for 10.2 seconds
  • Random could keep the ball for 6.3 seconds
  • 5 vs 4 keepaway
  • Learned could keep the ball for 12.3 seconds
  • Random could keep the ball for 8.3 seconds
slide-10
SLIDE 10

Stone & Sutton

slide-11
SLIDE 11

Tetris Demo Learned by J Bagnell & J Schneider

slide-12
SLIDE 12

History & Place (of RL)

slide-13
SLIDE 13

Place

Reinforcement Learning Control Theory (optimal control) (Mathematical) Psychology Artificial Intelligence Operations Research Neuroscience

slide-14
SLIDE 14

(Partial) History

  • “Of several responses made to the same situation, those whic

are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followe by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when recurs, they will be less likely to occur. The great the satisfaction or discomfort, the greater the strengthening or weakening of the bond.”

  • (Thorndike, 1911, p. 244)
  • Law of Effect
slide-15
SLIDE 15

(Partial) History...

Idea of programming a computer to learn by trial and error (Turing, 1954) SNARCs (Stochastic Neural-Analog Reinforcement Calculators) (Minsky, Checkers playing program (Samuel, 59) Lots of RL in the 60s (e.g., Waltz & Fu 65; Mendel 66; Fu 70) MENACE (Matchbox Educable Naughts and Crosses Engine (Mitchie, 63) RL based Tic Tac Toe learner (GLEE) (Mitchie 68) Classifier Systems (Holland, 75) Adaptive Critics (Barto & Sutton, 81) Temporal Differences (Sutton, 88)

slide-16
SLIDE 16

RL and Machine Learning

  • 1. Supervised Learning (error correction)
  • learning approaches to regression & classification
  • learning from examples, learning from a teacher
  • 2. Unsupervised Learning
  • learning approaches to dimensionality reduction, density

estimation, recoding data based on some principle, etc.

  • 3. Reinforcement Learning
  • learning approaches to sequential decision making
  • learning from a critic, learning from delayed reward
slide-17
SLIDE 17

(Partial) List of Applications

  • Robotics
  • Navigation, Robosoccer, walking, juggling, ...
  • Control
  • factory processes, admission control in telecomm, resource

control in multimedia networks, helicopters, elevators, ....

  • Games
  • Backgammon, Chess, Othello, Tetris, ...
  • Operations Research
  • Warehousing, transportation, scheduling, ...
  • Others
  • HCI, Adaptive treatment design, biological modeling, ...
slide-18
SLIDE 18

List of Conferences and Journals

  • Conferences
  • Neural Information Processing Systems (NIPS)
  • International Conference on Machine Learning (ICML)
  • AAAI, IJCAI, Agents,COLT,...
  • Journals
  • Journal of Artificial Intelligence Research (JAIR) [free online]
  • Journal of Machine Learning Research (JMLR) [free online]
  • Neural Computation, Neural Networks
  • Machine Learning, AI journal, ...
slide-19
SLIDE 19

Model of Agent-Environment Interaction

Model?

slide-20
SLIDE 20

Markov Decision Processes (MDPs)

Markov Assumption:

Markov Assumption

slide-21
SLIDE 21

MDP Preliminaries

  • S: finite state space

A: finite action space P: transition probabilities P(i|j,a) [or Pa(ij)] R: payoff function R(i) or R(i,a) : deterministic non-stationary policy S -> A :return for policy when started in state Discounted framework

Also, average framework: Vπ = LimT → ∞ Eπ1/T {r0 + r1 + … +

slide-22
SLIDE 22

MDP Preliminaries...

  • In MDPs there always exists a deterministic

stationary policy (that simultaneously maximizes the value of every state) ;

slide-23
SLIDE 23

Bellman Optimality Equations

Policy Evaluation (Prediction) Markov assumption!

slide-24
SLIDE 24

Bellman Optimality Equations

Optimal Control

slide-25
SLIDE 25

Graphical View of MDPs

state state state state action action action

Temporal Credit Assignment Problem!!

Learning from Delayed Reward

Distinguishes RL from other forms of M

slide-26
SLIDE 26

Planning & Learning in MDPs

slide-27
SLIDE 27

Planning in MDPs

  • Given an exact model (i.e., reward function, transiti

probabilities), and a fixed policy For k = 0,1,2,... Value Iteration (Policy Evaluation) Stopping criterion: Arbitrary initialization: V0

slide-28
SLIDE 28

Planning in MDPs

Given a exact model (i.e., reward function, transition probabilities), and a fixed policy For k = 0,1,2,... Value Iteration (Policy Evaluation) Stopping criterion: Arbitrary initialization: Q0

slide-29
SLIDE 29

Planning in MDPs

Given a exact model (i.e., reward function, transition probabilities) For k = 0,1,2,... Value Iteration (Optimal Control) Stopping criterion:

slide-30
SLIDE 30

Convergence of Value Iteration

*

1 2 3 4

Contractions!

slide-31
SLIDE 31

Proof of the DP contraction

slide-32
SLIDE 32

Learning in MDPs

  • Have access to the “real

system” but no model

state state state state action action action

Generate experience Two classes of approaches:

  • 1. Indirect methods
  • 2. Direct methods

This is what life looks like!

slide-33
SLIDE 33

Indirect Methods for Learning in MDP

  • Use experience data to estimate model
  • Compute optimal policy w.r.to estimated model

(Certainly equivalent policy)

  • Exploration-Exploitation Dilemma

Parametric models

Model converges asymptotically provided all state-action pairs are visited infinitely often in the limit; hence certainty equivalen policy converges asymptotically to the optimal policy

slide-34
SLIDE 34

Direct Method:

Only updates state-action pairs that are visited...

Q-Learning

s0a0r0 s1a1r1 s2a2r2 s3a3r3… skakrk… A unit of experience < sk ak rk sk+1 > Update: Qnew(sk,ak) = (1-) Qold(sk,ak) + [rk + maxb Qold(sk+1,b)]

Watkins, 1988 step-size Big table of Q-values?

slide-35
SLIDE 35
slide-36
SLIDE 36

So far...

  • Q-Learning is the first provably convergent direct

adaptive optimal control algorithm

  • Great impact on the field of modern

Reinforcement Learning

  • smaller representation than models
  • automatically focuses attention to where it is

needed, i.e., no sweeps through state space

  • though does not solve the exploration versus

exploitation dilemma

  • epsilon-greedy, optimistic initialization, etc,...
slide-37
SLIDE 37

Monte Carlo?

Start at state s and execute the policy for a long trajectory and compute the empirical discounted return Do this several times and average the returns across trajectories Suppose you want to find for some fixed state s How many trajectories? Unbiased estimate whose variance improves with n

slide-38
SLIDE 38

Application: Direct Method

slide-39
SLIDE 39

Dog Training Ground by Kohl & Stone

slide-40
SLIDE 40

Before Training by Kohl & Stone

slide-41
SLIDE 41

After Training by Kohl & Stone

slide-42
SLIDE 42

Application: Indirect Method

slide-43
SLIDE 43

by Andrew Ng and colleagues

slide-44
SLIDE 44

by Andrew Ng and colleagues

slide-45
SLIDE 45

Sparse Sampling

Use generative model to generate depth ‘n’ tree with ‘m’ samples for each actio in each state generated Near-optimal action at root state in time independent of the size of state spa (but, exponential in horizon!)

Kearns, Mansour & Ng

slide-46
SLIDE 46

Classification for RL

  • Use Sparse Sampling to derive a data set of

examples of near-optimal actions for a subset

  • f states
  • Pass this data set to a classification algorithm
  • Leverage algorithm and theoretical results on

classification for RL Langford

slide-47
SLIDE 47

Trajectory Trees…

Given a set of policies to evaluate, the number of policy trees needed to find a near-

  • ptimal policy from the given

set depends on the “VC-dim” of the class of policies Kearns, Mansour & Ng

slide-48
SLIDE 48

Summary

  • Space of Algorithms:
  • (does not need a model) linear in horizon +

polynomial in states

  • (needs generative model) Independent of states +

exponential in horizon

  • (needs generative model) time complexity

depends on the complexity of policy class

slide-49
SLIDE 49

Eligibility Traces (another key idea in RL)

slide-50
SLIDE 50

Eligibility Traces

  • The policy evaluation problem: given a (in

general stochastic) policy , estimate V(i) = E{r0+ r1 + 2r2 + 3r3+… | s0=i} from multiple experience trajectories generated by following policy repeatedly from state i A single trajectory:

r0 r1 r2 r3 …. rk rk+1 ….

slide-51
SLIDE 51

TD()

r0 r1 r2 r3 …. rk rk+1 …. r0 + V(s1) 0-step (e0): temporal difference Vnew(s0) = Vold(s0) + [r0 + Vold(s1) - Vold(s0)] Vnew(s0) = Vold(s0) + [e0 - Vold(s0)] TD(0)

slide-52
SLIDE 52

TD()

r0 r1 r2 r3 …. rk rk+1 …. r0 + V(s1) r0 + r1 + 2V(s2) 1-step (e1): Vnew(s0) = Vold(s0) + [e1 - Vold(s0)] Vold(s0) + [r0 + r1 + 2Vold(s2) - Vold(s0)]

slide-53
SLIDE 53

TD()

r0 r1 r2 r3 …. rk rk+1 …. r0 + V(s1) r0 + r1 + 2V(s2) r0 + r1 + 2r2

+ 3V(s3)

e2: r0 + r1 + 2r2

+ 3r3 + … k-1rk-1 + k V(sk)

ek-1: r0 + r1 + 2r2

+ 3r3 + … k rk + k+1 rk+1 + …

e: e1: e0: w0 w1 w2 wk-1 w Vnew(s0) = Vold(s0) + [k wk ek - Vold(s0)]

slide-54
SLIDE 54

TD()

r0 r1 r2 r3 …. rk rk+1 …. r0 + V(s1) r0 + r1 + 2V(s2) r0 + r1 + 2r2

+ 3V(s3)

r0 + r1 + 2r2

+ 3r3 + … k-1rk-1 + k V(sk)

Vnew(s0) = Vold(s0) + [k (1-)k ek - Vold(s0)] (1-)2 (1-) (1-) (1-)k-1 0 1 interpolates between 1-step TD and Monte-Carlo

slide-55
SLIDE 55

TD()

r0 r1 r2 r3 …. rk rk+1 …. r0 + V(s1) - V(s0) Vnew(s0) = Vold(s0) + [k (1-)k k] r1 + V(s2) - V(s1) 1 r2

+ V(s3) - V(s2)

2 rk-1 + V(sk)-V(sk-1) k eligibility trace

w.p.1 convergence (Jaakkola, Jordan & Singh)

slide-56
SLIDE 56

Bias-Variance Tradeoff

r0 r1 r2 r3 …. rk rk+1 …. r0 + V(s1) r0 + r1 + 2V(s2) r0 + r1 + 2r2

+ 3V(s3)

e2: r0 + r1 + 2r2

+ 3r3 + … k-1rk-1 + k V(sk)

ek-1: r0 + r1 + 2r2

+ 3r3 + … k rk + k+1 rk+1 + …

e: e1: e0: increasing variance decreasing bias

slide-57
SLIDE 57

TD( )

slide-58
SLIDE 58

Bias-Variance Tradeoff

Intuition: start with large and then decrease over time error

t a

1 b

t

1 b + b

t

t, error asymptotes at a 1- b ( an increasing function of ) Rate of convergence is b

t (exponential)

b is a decreasing function of Kearns & Singh, 2000 Constant step-size

slide-59
SLIDE 59

Near-Optimal Reinforcement Learning in Polynomial Time

(solving the exploration versus exploitation dilemma)

slide-60
SLIDE 60

Setting

  • Unknown MDP M
  • At any step: explore or exploit
  • Finite time analysis
  • Goal: Develop an algorithm such that an agent

following that algorithm will in time polynomial in the complexity of the MDP, will achieve nearly the same payoff per time step as an agent that knew the MDP to begin with.

  • Need to solve exploration versus exploitation
  • Algorithm called E3
slide-61
SLIDE 61

Preliminaries

  • Actual return:
  • Let T* denote the (unknown) mixing time of the MDP
  • One key insight: even the optimal policy will take time O(T*) to

achieve actual return that is near-optimal

  • E3 has the property that it always compares favorably to the best

policy amongst the policies that mix in the time that the algorithm is run.

) ... ( 1

2 1 T

R R R T + + +

slide-62
SLIDE 62

The Algorithm (informal)

  • Do “balanced wandering” until some state is known
  • Do forever:
  • Construct known-state MDP
  • Compute optimal exploitation policy in known-state

MDP

  • If return of above policy is near optimal, execute it
  • Otherwise compute optimal exploration policy in

known-state MDP and execute it; do balanced wandering from unknown states.

slide-63
SLIDE 63
slide-64
SLIDE 64
slide-65
SLIDE 65

M: true known state MDP M ˆ : estimated known state MDP

slide-66
SLIDE 66

Main Result

  • A new algorithm E3, taking inputs and such tha

for any V* and T* holding in the unknown MDP:

  • Total number of actions and computation time

required by E3 are poly( , , T*,N)

  • Performance guarantee: with probability at least

(1- ) amortized return of E3 so far will exceed

(1 - )V*

slide-67
SLIDE 67

Function Approximation and Reinforcement Learning

slide-68
SLIDE 68

General Idea

Function Approximator

s a

Could be:

  • table
  • Backprop Neural Network
  • Radial-Basis-Function Network
  • Tile Coding (CMAC)
  • Nearest Neighbor, Memory Based
  • Decision Tree

gradient- descent methods

targets or errors Q(s,a)

slide-69
SLIDE 69

Neural Networks as FAs

estimated value

w w + rt +1 + Q(st+1,at +1) Q(st,at)

[ ]

w f(st,at,w)

Q(s,a) = f (s,a,w)

e.g., gradient-descent Sarsa:

target value weight vector standard backprop gradient

slide-70
SLIDE 70

Linear in the Parameters FAs

ˆ V (s) = r

  • T r
  • s

r

ˆ

V (s) = r

  • s

Each state s represented by a feature vector Or represent a state-action pair with and approximate action values: r

  • s

Q

(s, a) = E r 1 + r2 + 2r 3 +L st = s, a t = a,

ˆ Q (s,a) = r

  • T r
  • s,a

r

  • sa
slide-71
SLIDE 71

Sparse Coarse Coding

fixed expansive Re-representation Linear last layer

Coarse: Large receptive fields Sparse: Few features present at one time

features

. . . . . . . . . . .

slide-72
SLIDE 72
slide-73
SLIDE 73

Shaping Generalization in Coarse Coding

slide-74
SLIDE 74
slide-75
SLIDE 75
slide-76
SLIDE 76

FAs & RL

  • Linear FA (divergence can happen)

Nonlinear Neural Networks (theory is not well developed) Non-parametric, e.g., nearest-neighbor (provably not divergent; bounds on error) Everyone uses their favorite FA… little theoretical guidance yet!

  • Does FA really beat the curse of dimensionality?
  • Probably; with FA, computation seems to scale with the

complexity of the solution (crinkliness of the value function) and how hard it is to find it

  • Empirically it works
  • though many folks have a hard time making it so
  • no off-the-shelf FA+RL yet
slide-77
SLIDE 77

by Andrew Ng and colleagues

slide-78
SLIDE 78

Dynamic Channel Assignment in Cellular Telephones

slide-79
SLIDE 79

Dynamic Channel Assignment

Singh & Bertsekas (NIPS)

Agent Channel assignment in cellular telephone systems

  • what (if any) conflict-free channel to assign to caller

Learned better dynamic assignment policies than competition State: current assignments Actions: feasible assignments Reward: 1 per call per sec.

slide-80
SLIDE 80

Run Cellphone Demo

(http://www.eecs.umich.edu/~baveja/Demo.html)

slide-81
SLIDE 81

After MDPs...

  • Great success with MDPs
  • What next?
  • Rethinking Actions, States, Rewards
  • Options instead of actions
  • POMDPs
slide-82
SLIDE 82

Rethinking Action (Hierarchical RL) Options

(Precup, Sutton, Singh) MAXQ by Dietterich HAMs by Parr & Russell

slide-83
SLIDE 83

Related Work

“Classical” AI

Fikes, Hart & Nilsson(1972) Newell & Simon (1972) Sacerdoti (1974, 1977) Macro-Operators Korf (1985) Minton (1988) Iba (1989) Kibler & Ruby (1992) Qualitative Reasoning Kuipers (1979) de Kleer & Brown (1984) Dejong (1994) Laird et al. (1986) Drescher (1991) Levinson & Fuchs (1994) Say & Selahatin (1996) Brafman & Moshe (1997)

Robotics and Control Engineering

Brooks (1986) Maes (1991) Koza & Rice (1992) Brockett (1993) Grossman et. al (1993) Dorigo & Colombetti (1994) Asada et. al (1996) Uchibe et. al (1996) Huber & Grupen(1997) Kalmar et. al (1997) Mataric(1997) Sastry (1997) Toth et. al (1997)

Reinforcement Learning and MDP Planning

Mahadevan & Connell (1992) Singh (1992) Lin (1993) Dayan & Hinton (1993) Kaelbling(1993) Chrisman (1994) Bradtke & Duff (1995) Ring (1995) Sutton (1995) Thrun & Schwartz (1995) Boutilier et. al (1997) Dietterich(1997) Wiering & Schmidhuber (1997) Precup, Sutton & Singh (1997) McGovern & Sutton (1998) Parr & Russell (1998) Drummond (1998) Hauskrecht et. al (1998) Meuleau et. al (1998) Ryan and Pendrith (1998)

slide-84
SLIDE 84

Abstraction in Learning and Planning

  • A long-standing, key problem in AI !
  • How can we give abstract knowledge a clear semantics?

e.g. “I could go to the library”

  • How can different levels of abstraction be related?

spatial: states temporal: time scales

  • How can we handle stochastic, closed-loop, temporally

extended courses of action?

  • Use RL/MDPs to provide a theoretical foundation
slide-85
SLIDE 85

Options

Example: docking : hand-crafted controller : terminate when docked or charger not visible

Options can take variable number of steps

A generalization of actions to include courses of action

Option execution is assumed to be call-and-return An option is a triple o =< I,, >

  • IS is the set of states in which o may be started
  • :SA[0,1] is the policy followed during o
  • :S[0,1] is the probability of terminating in each state

I : all states in which charger is in sight

slide-86
SLIDE 86

Rooms Example

HALLWAYS O2 O1

4 rooms 4 hallways 8 multi-step options Given goal location, quickly plan shortest route

up down right left (to each room's 2 hallways)

G? G? 4 unreliable primitive actions

Fail 33%

  • f the time

Goal states are given a terminal value of 1

= .9

All rewards zero ROOM

slide-87
SLIDE 87

Options define a Semi-Markov Decison Process (SMDP)

Discrete time Homogeneous discount Continuous time Discrete events Interval-dependent discount Discrete time Overlaid discrete events Interval-dependent discount A discrete-time SMDP overlaid on an MDP Can be analyzed at either level

MDP SMDP Options

  • ver MDP

State Time

slide-88
SLIDE 88

MDP + Options = SMDP

Thus all Bellman equations and DP results extend for value functions over options and models of options (cf. SMDP theory). Theorem: For any MDP, and any set of options, the decision process that chooses among the options, executing each to termination, is an SMDP.

slide-89
SLIDE 89

What does the SMDP connection give us?

  • Policies over options: µ :SO [0,1]
  • Value functions over options: V µ(s), Qµ(s,o),VO

*(s), QO * (s,o)

  • Learning methods: Bradtke & Duff (1995), Parr (1998)
  • Models of options
  • Planning methods: e.g. value iteration, policy iteration, Dyna...
  • A coherent theory of learning and planning with courses of

action at variable time scales, yet at the same level A theoretical fondation for what we really need! But the most interesting issues are beyond SMDPs...

slide-90
SLIDE 90

Value Functions for Options

Define value functions for options, similar to the MDP case

V µ(s) = E {rt+1 + r

t+2 + ... | E(µ,s,t)}

Q µ(s,o) = E { rt+1 + rt+2 + ... | E(oµ,s,t)} Now consider policies µ (O) restricted to choose only from options in O : VO

*(s) = max µ (O)V µ(s)

QO

*(s,o) = max µ(O)Qµ (s,o)

slide-91
SLIDE 91

Models of Options

Knowing how an option is executed is not enough for reasoning about it, or planning with it. We need information about its consequences The model of the consequences of starting option o in state s has:

  • a reward part

r

s

  • = E{r

1 + r2 + ...+ k1r k | s0 = s, o taken in s0, lasts k steps}

  • a next - state part

pss'

  • = E{ ksks' | s0 = s, o taken in s0, lasts k steps}
  • 1 if s'= sk is the termination state, 0 otherwise

This form follows from SMDP theory. Such models can be used interchangeably with models of primitive actions in Bellman equations.

slide-92
SLIDE 92

Room Example

HALLWAYS O2 O1

4 rooms 4 hallways 8 multi-step options Given goal location, quickly plan shortest route

up down right left (to each room's 2 hallways)

G? G? 4 unreliable primitive actions

Fail 33%

  • f the time

Goal states are given a terminal value of 1

= .9

All rewards zero ROOM

slide-93
SLIDE 93

Example: Synchronous Value Iteration Generalized to Options

Initialize : V0(s) 0 s S Iterate : Vk+1(s) max

  • O[r

s

  • +

pss'

  • s'S
  • Vk(s')] s S

The algorithm converges to the optimal value function,given the options lim

kVk = VO *

Once VO

* is computed, µO * is readily determined.

If O = A, the algorithm reduces to conventional value iteration If A O, then VO

* = V *

slide-94
SLIDE 94

Rooms Example

Iteration #0 Iteration #1 Iteration #2 with cell-to-cell primitive actions Iteration #0 Iteration #1 Iteration #2 with room-to-room

  • ptions

V(goal)=1 V(goal)=1

slide-95
SLIDE 95

Example with GoalSubgoal both primitive actions and options

Iteration #1 Initial values Iteration #2 Iteration #3 Iteration #4 Iteration #5

slide-96
SLIDE 96

What does the SMDP connection give us?

  • Policies over options: µ : S O a[0,1]
  • Value functions over options: V

µ(s), Q µ (s,o), V O *(s), QO * (s,o)

  • Learning methods: Bradtke & Duff (1995), Parr (1998)
  • Models of options
  • Planning methods: e.g. value iteration, policy iteration, Dyna...
  • A coherent theory of learning and planning with courses of

action at variable time scales, yet at the same level A theoretical foundation for what we really need! But the most interesting issues are beyond SMDPs...

slide-97
SLIDE 97

Advantages of Dual MDP/SMDP View

At the SMDP level Compute value functions and policies over options with the benefit of increased speed / flexibility At the MDP level Learn how to execute an option for achieving a given goal Between the MDP and SMDP level Improve over existing options (e.g. by terminating early) Learn about the effects of several options in parallel, without executing them to termination

slide-98
SLIDE 98

Between MDPs and SMDPs

  • Termination Improvement

Improving the value function by changing the termination

conditions of options

  • Intra-Option Learning

Learning the values of options in parallel, without executing them to termination Learning the models of options in parallel, without executing them to termination

  • Tasks and Subgoals

Learning the policies inside the options

slide-99
SLIDE 99

Termination Improvement

Idea: We can do better by sometimes interrupting ongoing options

  • forcing them to terminate before says to

Theorem: For any policy over options µ :SO [0,1], suppose we interrupt its options one or more times, when Qµ(s,o) < Qµ(s,µ(s)), where s is the state at that time

  • is the ongoing option

to obtain µ':SO'[0,1], Then µ'> µ (it attains more or equal reward everywhere) Application : Suppose we have determined QO

* and thus µ = µO * .

Then µ' is guaranteed better than µO

*

and is available with no additional computation.

slide-100
SLIDE 100

range (input set) of each run-to-landmark controller landmarks

S G

Landmarks Task

Task: navigate from S to G as fast as possible 4 primitive actions, for taking tiny steps up, down, left, 7 controllers for going straight to each one of the landmarks, from within a circular region where the landmark is visible In this task, planning at the level of primitive actions is computationally intractable, we need the controllers

slide-101
SLIDE 101
slide-102
SLIDE 102

Illustration: Reconnaissance

Mission Planning (Problem)

  • Mission: Fly over (observe) most

valuable sites and return to base

  • Stochastic weather affects
  • bservability (cloudy or clear) of sites
  • Limited fuel
  • Intractable with classical optimal

control methods

  • Temporal scales:

Actions: which direction to fly now Options: which site to head for

  • Options compress space and time

Reduce steps from ~600 to ~6 Reduce states from ~1011 to ~10

QO

* (s, o) = rs

  • +

ps

s

  • V

O *(

s )

  • s
  • any state (106)

sites only (6)

10 50 50 50 100 25 15 (reward) 5 25 8

Base

100 decision steps

  • ptions

(mean time between weather changes)

slide-103
SLIDE 103

30 40 50 60

Illustration: Reconnaissance

Mission Planning (Results)

  • SMDP planner:

Assumes options followed to

completion

Plans optimal SMDP solution

  • SMDP planner with re-evaluation

Plans as if options must be followed to

completion

But actually takes them for only one

step

Re-picks a new option on every step

  • Static planner:

Assumes weather will not change Plans optimal tour among clear sites Re-plans whenever weather changes

Low Fuel High Fuel

Expected Reward/Mission

SMDP Planner Static Re-planner SMDP planner with re-evaluation

  • f options on

each step

Temporal abstraction finds better approximation than static planner, with little more computation than SMDP planner

slide-104
SLIDE 104

Intra-Option Learning Methods for Markov Options

Proven to converge to correct values, under same assumptions as 1-step Q-learning

Idea: take advantage of each fragment of experience

SMDP Q-learning:

  • execute option to termination, keeping track of reward along

the way

  • at the end, update only the option taken, based on reward and

value of state in which option terminates Intra-option Q-learning:

  • after each primitive action, update all the options that could have

taken that action, based on the reward and the expected value from the next state on

slide-105
SLIDE 105
slide-106
SLIDE 106

Example of Intra-Option Value Learning

Intra-option methods learn correct values without ever

taking the options! SMDP methods are not applicable here

Random start, goal in right hallway, random actions

  • 4
  • 3
  • 2
  • 1

1000 1000 6000 2000 3000 4000 5000 6000

Episodes Episodes

Option values Average value of greedy policy

Learned value Learned value U hallway

  • L

hallway

  • True value

True value

  • 4
  • 3
  • 2

1 10 100

Value of Optimal Policy

slide-107
SLIDE 107
slide-108
SLIDE 108

Intra-Option Model Learning

Intra-option methods work much faster than SMDP methods

Random start state, no goal, pick randomly among all options

Options executed

State prediction error

0.1 0.2 0.3 0.4 0.5 0.6 0.7 20,000 40,000 60,000 80,000 100,0

SMDP SMDP Intra Intra SMDP 1/t

M error A error

SMDP 1/t

1 2 3 4 20,000 40,000 60,000 80,000 100,000

Options executed

SMDP Intra SMDP 1/t SMDP Intra SMDP 1/t

Reward prediction error

Max error

  • Avg. error
slide-109
SLIDE 109

Tasks and Subgoals

It is natural to define options as solutions to subtasks

e.g. treat hallways as subgoals, learn shortest paths We have defined subgoals as pairs: <G,g > GS is the set of states treated as subgoals g :G are their subgoal values (can be both good and bad) Each subgoal has its own set of value functions, e.g.: Vg

  • (s) = E{r

1 + r 2 + ...+ k1rk + g(sk) | s0 = s, o, sk G}

Vg

*(s) = max

  • Vg
  • (s)

Policies inside options can be learned from subgoals, in intra - option, off - policy manner.

slide-110
SLIDE 110

Between MDPs and SMDPs

  • Termination Improvement

Improving the value function by changing the termination

conditions of options

  • Intra-Option Learning

Learning the values of options in parallel, without executing them to termination Learning the models of options in parallel, without executing them to termination

  • Tasks and Subgoals

Learning the policies inside the options

slide-111
SLIDE 111

Summary: Benefits of Options

  • Transfer

Solutions to sub-tasks can be saved and reused Domain knowledge can be provided as options and subgoals

  • Potentially much faster learning and planning

By representing action at an appropriate temporal scale

  • Models of options are a form of knowledge representation

Expressive Clear Suitable for learning and planning

  • Much more to learn than just one policy, one set of values

A framework for “constructivism” – for finding models of the

world that are useful for rapid planning and learning

slide-112
SLIDE 112

POMDPs

st st+1 st+2

  • t
  • t+1
  • t+2

at at+1 Belief-states are distributions over hidden nominal-states actions… T O nominal-states

  • bservations
slide-113
SLIDE 113

POMDPs...

  • n underlying nominal or hidden states
  • b(h) is a belief-state at history h
  • Ta : transition probabilities among hidden states fo

action a

  • Oao(ii) is the probability of observation o on action

a in state i

  • b(hao) = b(h)TaOao/Z = b(h) Bao/Z
slide-114
SLIDE 114

Rethinking State

(Predictive State Representations or PSRs) (TD-Nets)

Initiated by Littman, Sutton & Singh …Singh’s group at Umich …Sutton’s group at UAlberta

slide-115
SLIDE 115

Go to NIPS05PSRTutorial

slide-116
SLIDE 116

Rethinking Reward

(Intrinsically Motivated RL) By Singh, Barto & Chentanez … Singh’s group at Umich … Barto’s group at UMass

slide-117
SLIDE 117

Go to NIPS05IMRLTutorial

slide-118
SLIDE 118

Applications of RL

slide-119
SLIDE 119

List of Applications

  • Robotics
  • Navigation, Robosoccer, walking, juggling, ...
  • Control
  • factory processes, admission control in telecomm,

resource control in multimedia networks, ....

  • Games
  • Backgammon, Chess, Othello, ...
  • Operations Research
  • Warehousing, transportation, scheduling, ...
  • Others
  • Adaptive treatment design, biological modeling, ...
slide-120
SLIDE 120

RL applied to HCI

slide-121
SLIDE 121

Spoken Dialogue Systems

user ASR TTS DB Dialogue strategy

slide-122
SLIDE 122

Sample Dialogue

  • S1: Welcome to NJFun. How may I help you?

U1: I’d like to find um winetasting in Lambertville in the morning. (ASR output: I’d like to find out wineries the in the Lambertville in the morning.) S2: Did you say you are interested in Lambertville? U2: Yes S3: Did you say you want to go in the morning? U3: Yes.

  • S4. I found a winery near Lambertville that is open in the morning.

It is […] Please give me feedback by saying “good”, “so-so”

  • r “bad”.

U4: Good

slide-123
SLIDE 123

NJFun

  • Spoken dialogue system providing telephone

access to a DB of activities in NJ

  • Want to obtain 3 attributes:

activity type (e.g., wine tasting) location (e.g., Lambertville) time (e.g., morning)

  • Failure to bind an attribute: query DB with

don’t-care

slide-124
SLIDE 124

NJFun

  • Spoken dialogue system providing telephone

access to a DB of activities in NJ

  • Want to obtain 3 attributes:

activity type (e.g., wine tasting) location (e.g., Lambertville) time (e.g., morning)

  • Failure to bind an attribute: query DB with

don’t-care

slide-125
SLIDE 125

NJFun

  • Spoken dialogue system providing telephone

access to a DB of activities in NJ

  • Want to obtain 3 attributes:

activity type (e.g., wine tasting) location (e.g., Lambertville) time (e.g., morning)

  • Failure to bind an attribute: query DB with

don’t-care

slide-126
SLIDE 126

Approximate State Space

N.B. Non-state variables record attribute values; state does not condition on previous attributes!

slide-127
SLIDE 127

Action Space

  • Initiative (when T = 0):
  • pen or constrained prompt?
  • pen or constrained grammar?

N.B. might depend on H, A,…

  • Confirmation (when V = 1)

confirm or move on or re-ask? N.B. might depend on C, H, A,…

  • Only allowed “reasonable” actions
  • Results in 42 states with (binary) choices
  • Small state space, large strategy space
slide-128
SLIDE 128

The Experiment

  • Designed 6 specific tasks, each with web survey
  • Gathered 75 internal subjects
  • Split into training and test, controlling for M/F, native/non-

native, experienced/inexperienced

  • 54 training subjects generated 311 dialogues
  • Exploratory training dialogues used to build MDP
  • Optimal strategy for objective TASK COMPLETION computed

and implemented

  • 21 test subjects performed tasks and web surveys for modified

system generated 124 dialogues

  • Did statistical analyses of performance changes
slide-129
SLIDE 129

Estimating the MDP

...

3 3 2 2 1 1

  • u

s u s u s

Initial system utterance Initial user utterance Actions have

  • prob. outcomes

estimate transition probabilities... P(next state | current state & action) ...and rewards... R(current state, action) ...from set of exploratory dialogues

a e a e a e ...

  • 1

2 1 2 3 3

  • + system logs

Models population

  • f users
slide-130
SLIDE 130

Reward Function

  • Objective task completion:
  • 1 for an incorrect attribute binding

0,1,2,3 correct attribute bindings

  • Binary version:

1 for 3 correct bindings, else 0

  • Other reward measures: perceived completion, user

satisfaction, future use, perceived understanding, user understanding, ease of use

  • Optimized for objective task completion, but predicted

improvements in some others

slide-131
SLIDE 131

Main Results

  • Objective task completion:

train mean ~ 1.722, test mean ~ 2.176 two-sample t-test p-value ~ 0.0289

  • Binary task completion:

train mean ~ 0.515, test mean ~ 0.635 two-sample t-test p-value ~ 0.05

  • Outperformed hand-built policies

move to the middle

slide-132
SLIDE 132

by Hajime Kimura

slide-133
SLIDE 133

by Hajime Kimura

slide-134
SLIDE 134

by Stefan Schaal & Chris Atkeson

slide-135
SLIDE 135

by Stefan Schaal & Chris Atkeson

slide-136
SLIDE 136

by Sebastian Thrun & Colleagues

slide-137
SLIDE 137

Textbook References

  • Reinforcement Learning: An Introduction

by Richard S. Sutton & Andrew G. Barto MIT Press, Cambridge MA, 1998.

  • Neuro-Dynamic Programming

by Dimitri Bertsekas & John Tsitsiklis Athena Scientific, Belmont MA, 1996.

slide-138
SLIDE 138

Myths of RL

  • RL is TD or perhaps Q-learning
  • RL is model-free
  • RL is table lookup
  • RL is slow
  • RL does not work well with function approximation
  • POMDPs are hard for RL to deal with
  • RL is about learning optimal policies
slide-139
SLIDE 139

Twiki pages on RL

  • Myths of RL
  • http://neuromancer.eecs.umich.edu/cgi-bin/twiki/view/Main/MythsofRL
  • Successes of RL
  • http://neuromancer.eecs.umich.edu/cgi-bin/twiki/view/Main/SuccessesOfRL
  • Theory of RL
  • http://neuromancer.eecs.umich.edu/cgi-bin/twiki/view/Main/TheoryOfRL
  • Algorithms of RL
  • http://neuromancer.eecs.umich.edu/cgi-bin/twiki/view/Main/AlgorithmsOfRL
  • Demos of RL
  • http://neuromancer.eecs.umich.edu/cgi-bin/twiki/view/Main/DemosOfRL
slide-140
SLIDE 140

RL Abstractly...

Life is an optimal control problem! Goal: maximize expected payoff over some time horizo

Environment Agent State Payoff Action Sensors Observations

slide-141
SLIDE 141

Satinder Singh EECS Dept., Univ. of Michigan

MDPs…

  • Make the problem precise (& simpler?)
  • Yet keeps many interesting challenges

– Preserves tradeoff between short-term and long-term consequences – Temporal credit assignment – Exploration vs. exploitation – Generalization across states (or learning from small amounts of experience).

slide-142
SLIDE 142

Satinder Singh EECS Dept., Univ. of Michigan

So why rethink state?

slide-143
SLIDE 143

Satinder Singh EECS Dept., Univ. of Michigan

Narrow vs. Broad competence

  • MDPs/POMDPs very successful in

OR/engineering/control

  • Still lots of hard work left to do…

especially in making RL more “off the shelf”

  • My Goal: move towards old-fashioned

AI? (build broadly competent, flexible, agents)

slide-144
SLIDE 144

Satinder Singh EECS Dept., Univ. of Michigan

Knowledge in AI Systems

  • AI systems tend to be brittle.
  • MDP/POMDP representations while modeling

uncertainty share that brittleness

  • Relational extensions may help…

But many of these approaches are “linguistically inspired” using notions of objects, relations between

  • bjects… which while very meaningful and natural to

humans may not be suited for computational agents.

Goal: a KR expressed entirely in input-output terms… so that the knowledge learned or given is meaningful, verifiable, maintainable by the agent without human intervention

slide-145
SLIDE 145

Satinder Singh EECS Dept., Univ. of Michigan

Knowledge/Models for RL/AI

  • Knowledge that is useful for achieving

high reward

– Is tweety a bird? – Can one sit on the object in front? – If I pick the phone and dial my home number what is the chance that my wife picks up? – Can block A be stacked on top of block B?

  • Answers to questions (usually predictive)
slide-146
SLIDE 146

Satinder Singh EECS Dept., Univ. of Michigan

slide-147
SLIDE 147

Satinder Singh EECS Dept., Univ. of Michigan

slide-148
SLIDE 148

Satinder Singh EECS Dept., Univ. of Michigan

slide-149
SLIDE 149

Satinder Singh EECS Dept., Univ. of Michigan

slide-150
SLIDE 150

Satinder Singh EECS Dept., Univ. of Michigan

slide-151
SLIDE 151

Satinder Singh EECS Dept., Univ. of Michigan

slide-152
SLIDE 152

Satinder Singh EECS Dept., Univ. of Michigan

Rightward movement example

slide-153
SLIDE 153

Satinder Singh EECS Dept., Univ. of Michigan

slide-154
SLIDE 154

Satinder Singh EECS Dept., Univ. of Michigan

slide-155
SLIDE 155

Satinder Singh EECS Dept., Univ. of Michigan

slide-156
SLIDE 156

Satinder Singh EECS Dept., Univ. of Michigan

slide-157
SLIDE 157

Satinder Singh EECS Dept., Univ. of Michigan

Rethink state

  • Think of states as answers to questions (i.e.,

predictions of outcomes of experiments one can do in the world)

– Wallet’s contents, Michael’s location, presence of

  • bjects, …
  • Prior work

– Learning deterministic FSA’s (Rivest & Schapire, 1987); Multiplicity Automata (Beimel et al.) – Added stochasticity (Jaeger, 1999) – Added actions (PSR work, 2001)

slide-158
SLIDE 158

Satinder Singh EECS Dept., Univ. of Michigan

Which questions?

  • What is a question (future, test)?

Uncontrolled system: a future is sequence of observations t = o1 o2 … ok Controlled system: a future is sequence of observations for a sequence of actions t = a1

  • 1

a2

  • 2

… ak

  • k
  • What is a (answer) prediction for a (question) future?

Uncontrolled system: p(t) = prob(o1=o1,…,ok=ok) Controlled system: p(t) = prob(o1=o1,…,ok=ok|a1=a1,…,ak=ak) We will show that this class of questions contains within it a subset whose answers are sufficient to model the state of interesting dynamical systems. Discrete time, discrete observation, finite action systems

slide-159
SLIDE 159

Satinder Singh EECS Dept., Univ. of Michigan

System Dynamics Vector

t1 t2 t3 t4 p(t1) p(ti) ti

Mathematical construct that IS the system (not a model) Any exact model of a system should be able to generate this vector All possible futures! For both controlled and uncontrolled systems… Lots of constraints on the entries of this vector

There may be 0’s in here…

A “System” is a distribution over all futures…

slide-160
SLIDE 160

Satinder Singh EECS Dept., Univ. of Michigan

System Dynamics Matrix

t1 t2 t3 t4 p(t1) p(ti) ti h1 = φ h2 h3 hj p(ti|hj) p(t1|hj)

Uncontrolled system: ti=o1o2…ok hj=o1o2…on p(ti|hj) = prob(on+1=o1,…on+k=ok|o1o2 …on) Controlled system: ti=a1o1…akok hj=a1o1…anon p(ti|hj) = prob(on+1=o1,…on+k| a1o1…anon,an+1=a1,…,an+k=ak) tests or experiments… Again, this construct IS the system (not a model)

slide-161
SLIDE 161

Satinder Singh EECS Dept., Univ. of Michigan

System Dynamics Matrix

t1 t2 t3 t4 p(t1) p(ti) ti h1 = φ h2 h3 hj p(ti|hj) p(t1|hj)

Any model must be able to generate this matrix All rows are determined uniquely by the first row Linear dimension of dynamical system is the rank (say N) of its system dynamics matrix (only consider finite rank systems here)

Only those histories that can happen

slide-162
SLIDE 162

Satinder Singh EECS Dept., Univ. of Michigan

System Dynamics Matrix

t1 t2 t3 t4 p(t1) p(ti) ti h1 = φ h2 h3 hj p(ti|hj) p(t1|hj)

Q = {q1 q2 … qN} Core tests h t p(t|h) = p(Q|h)T mt ; note that mt is independent of h ! Prediction for any test is linear combination of the predictions of the core tests p(Q|h)

slide-163
SLIDE 163

Satinder Singh EECS Dept., Univ. of Michigan

nth-order Markov Models

t1 t2 t3 t4 p(t1) p(ti) ti h1 = φ h2 h3 hj p(ti|hj) p(t1|hj)

At most as many unique rows as possible n-length histories (lets say K) Model parameters all length one tests Unique histories Theorem: All K-history Markov models are dynamical systems of linear-dimension <= K

slide-164
SLIDE 164

Satinder Singh EECS Dept., Univ. of Michigan

K-history Markov models…

  • Theorem: there exist dynamical systems of

linear-dimension N that cannot be modeled by any finite-order Markov model Consider a system in which the first

  • bservation determines which of two sub-

systems is entered…

slide-165
SLIDE 165

Satinder Singh EECS Dept., Univ. of Michigan

POMDPs…

st st+1 st+2

  • t
  • t+1
  • t+2

at at+1 Learning POMDP models from data (EM) does not work very well; almost no applications Belief-states are distributions over hidden nominal-states actions… T O nominal-states

  • bservations
slide-166
SLIDE 166

Satinder Singh EECS Dept., Univ. of Michigan

POMDPs…

  • n underlying or nominal-states

State representation for any history h (belief-state) b(h) [a probability distribution over nominal-states]

  • Update parameters

– Transition probabilities Ta (one for every a); Observation probabilities Oao; (for every a,o) Initial belief state b(h0)

  • b(hao) = b(h)TaOao/Z = b(h) Bao/Z
  • For t = a1o1…akok; p(t|h) =b(h) Ta1Oa1o1…TakOakok = b(h) Bt

predictions linear in belief-states

slide-167
SLIDE 167

Satinder Singh EECS Dept., Univ. of Michigan

POMDPs

  • Theorem: Every POMDP with ‘n’ nominal states is a

dynamical system of linear-dimension ≤ n h0 h1 h2 hj p(ti|hj) p(t1|hj) D = t1 t2 t3 t4 p(t1|h0) p(ti|h0) ti h0 h1 h2 hj b(hj)Bti b(hj)Bt1 D = b(h0)Bt1 b(h0)Bti t1 t2 t3 t4 ti

Linear combination with weights b(h2) of the rows corresponding to the n rows whose belief-states are unit-basis

slide-168
SLIDE 168

Satinder Singh EECS Dept., Univ. of Michigan

POMDPs

h0 h1 h2 hj b(hj)Bti b(hj)Bt1 D = b(h0)Bt1 b(h0)Bti t1 t2 t3 t4 ti

Theorem: there exist dynamical systems of finite linear-dimension that cannot be modeled by any finite nominal-state POMDP Intuition: POMDP restricted to positive linear combinations…

slide-169
SLIDE 169

Satinder Singh EECS Dept., Univ. of Michigan

PSRs

t1 t2 t3 t4 p(t1) p(ti) ti h1 = φ h2 h3 hj p(ti|hj) p(t1|hj) q1 q2 . . . qN

Core tests Q = {q1 q2 . . . qN} State representation: p(Q|h) = [p(q1|h) . . . p(qN|h)]

slide-170
SLIDE 170

Satinder Singh EECS Dept., Univ. of Michigan

PSRs

h t1 t2 t3 t4 p(t1) p(Q|h1)m ti h1 = φ h2 h3 hj p(Q|hj)m p(t1|hj) t

p(Q|h)mt

hao

? p(Q|h) is a sufficient statistic for history h

slide-171
SLIDE 171

Satinder Singh EECS Dept., Univ. of Michigan

Updating Linear PSRs

  • Update core test qi on taking action a

and observing o in history h

  • Note: one only needs parameters for the
  • ne step extensions to the core tests!

m’s can have negative entries!!

model parameters

slide-172
SLIDE 172

Satinder Singh EECS Dept., Univ. of Michigan

Update Parameters…

t1 t2 t3 t4 p(t1) ti h1 = φ h2 h3 hj p(t1|hj) q1 q2 . . . qN

a1o1 ajoj a1o1q1 ajojqN All 1-step tests All 1-step extensions of core tests

slide-173
SLIDE 173

Satinder Singh EECS Dept., Univ. of Michigan

Linear PSRs

  • Theorem: Every discrete-time dynamical

system of linear-dimension ‘n’ is equivalent to a linear PSR with ‘n’ core tests

slide-174
SLIDE 174

Satinder Singh EECS Dept., Univ. of Michigan

Ok, where are we?

  • Defined system dynamics matrix
  • Defined linear PSRs

Result: K-history Markov model < K-nominal-state POMDPs < K-test linear PSRs = dynamical systems of linear-dimension K

(applies to both controlled and uncontrolled systems)

slide-175
SLIDE 175

Satinder Singh EECS Dept., Univ. of Michigan

Discovery & Learning in PSRs

  • Discovery

– Determine core tests given experience data

  • Learning

– Determine update parameters given core tests and experience data

  • Discovery & Learning

– Do both from experience data

slide-176
SLIDE 176

Satinder Singh EECS Dept., Univ. of Michigan

Discovering PSR tests

tj length 1 tests

length 1 histories

p(tj|hi) t1 h1 hi length 2 tests

length 2 histories

If rank stops changing - you are done? In practice Yes - In theory No!

slide-177
SLIDE 177

Satinder Singh EECS Dept., Univ. of Michigan

Learning PSR models

  • Gradient algorithm

p(aoQ|h1) p(Q|h1) p(aoQ|h2) p(Q|h2) p(Q|hn) p(aoQ|hn)

Suffix-History Algorithm

can be sampled from data

slide-178
SLIDE 178

Satinder Singh EECS Dept., Univ. of Michigan

Results on Learning & Discovery

  • Given core tests
slide-179
SLIDE 179

Satinder Singh EECS Dept., Univ. of Michigan

Nonlinear PSRs

  • Suppose we allow non-linear predictions and non-linear

updates?

  • Nonlinear core tests X={x1,x2,…,xw}; sufficient

statistic of history but smaller in size than the linear- dimension of the dynamical system

  • State representation p(X|h) for history h
  • Prediction for test ‘t’, p(t|h) = ft(p(X|h)) for some

nonlinear function ‘f’ (independent of ‘t’)

  • Update process
slide-180
SLIDE 180

Satinder Singh EECS Dept., Univ. of Michigan

Beyond Linear PSRs

  • Nonlinear PSRs

– Exponential compression over Linear PSRs and POMDPs in some deterministic systems (Rudary & Singh, NIPS 2003)

  • PSRs for continuous systems
slide-181
SLIDE 181

Satinder Singh EECS Dept., Univ. of Michigan

Predictive Linear Gaussian (PLG)

  • The distribution over the next “N”
  • bservations given current history is Gaussian.

~

  • The predictive state is the mean and

covariance matrix of the Gaussian

  • The N+1st observation is computed as a linear

function of the next N observations.

slide-182
SLIDE 182

Satinder Singh EECS Dept., Univ. of Michigan

PLG vs. LDS

Theorem: Every linear dynamical system (LDS) with dimension ‘n’ can be modeled as a PLG with dimension ‘n’. (Kalman Filters) A PLG has no hidden variables We can derive consistent learning algorithms for a PLG

slide-183
SLIDE 183

Satinder Singh EECS Dept., Univ. of Michigan

PLG Learning vs. EM for LDSs

slide-184
SLIDE 184

Satinder Singh EECS Dept., Univ. of Michigan

Illustration of PLG

slide-185
SLIDE 185

Satinder Singh EECS Dept., Univ. of Michigan

Illustration of PLG

slide-186
SLIDE 186

Satinder Singh EECS Dept., Univ. of Michigan

Illustration of Kernel PLGs

slide-187
SLIDE 187

Satinder Singh EECS Dept., Univ. of Michigan

Summary

  • Knowledge expressed entirely in observable

quantities

– is possible – is no less compact than at least unstructured traditional (latent variable) representations – may be more efficiently learnable/plannable/maintainable…

  • So far: sufficient representations
  • Next: efficient (perhaps structured)
  • bservable representations
slide-188
SLIDE 188

Satinder Singh EECS Dept., Univ. of Michigan

Conclusion

  • MDPs are great!
  • We are making progress in going beyond

MDPs (in states, actions & rewards)

  • Lots of work to be done…
slide-189
SLIDE 189

Satinder Singh EECS Dept., Univ. of Michigan

Leftover Slides

slide-190
SLIDE 190

Satinder Singh EECS Dept., Univ. of Michigan

Graphical Model for POMDPs

st st+1 st+2

  • t
  • t+1
  • t+2

at at+1 Learning POMDP models from data (EM) does not work very well; almost no applications Belief-states are distributions over hidden states

slide-191
SLIDE 191

Satinder Singh EECS Dept., Univ. of Michigan

Float/Reset

Float: Random walk. Reset: Go right, observe 1 if already there.

slide-192
SLIDE 192

Satinder Singh EECS Dept., Univ. of Michigan

Float/Reset (Linear PSR)

  • F 0; R 0
  • F 0 R 0
  • F 0 F 0 R 0
  • F 0 F 0 F 0 R 0

p(F 0 F 0 F 0 R 0|h F 0) = 0.25 p(R 0|h)

  • 0.0625 p(F 0 R 0|h)

+ 0.750 p(F 0 F 0 R 0|h)

slide-193
SLIDE 193

Satinder Singh EECS Dept., Univ. of Michigan

Learning & Discovery in Paint

slide-194
SLIDE 194

Satinder Singh EECS Dept., Univ. of Michigan

Learning & Discovery in Shuttle

slide-195
SLIDE 195

Satinder Singh EECS Dept., Univ. of Michigan

Learning & Discovery in Tiger

slide-196
SLIDE 196

Satinder Singh EECS Dept., Univ. of Michigan

Learning & Discovery in Network

slide-197
SLIDE 197

Satinder Singh EECS Dept., Univ. of Michigan

PSRs - a definition

  • Test
  • Predictions for test t
  • A core set of tests Q={t1,t2,…,tm}
  • State at history h : p(Q|h) = [p(t1|h) … p(tm|h)]
  • There exists Q, such that p(Q|h) is sufficient

statistic for all histories h!

  • => For arbitrary test t, p(t|h) = ft(p(Q|h))

Littman, Sutton, & Singh