Manifold Embeddings for Model-Based Reinforcement Learning under - - PowerPoint PPT Presentation

manifold embeddings for model based reinforcement
SMART_READER_LITE
LIVE PREVIEW

Manifold Embeddings for Model-Based Reinforcement Learning under - - PowerPoint PPT Presentation

Outline Introduction Method Experiments Conclusions Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE


slide-1
SLIDE 1

Outline Introduction Method Experiments Conclusions

Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE February 19, 2010

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

slide-2
SLIDE 2

Outline Introduction Method Experiments Conclusions

Outline

◮ Introduction of RL

◮ Theory and Philosophy Behind RL ◮ Markov Models and RL ◮ Moving from Theory Study to Real Applications

◮ Background of This Paper ◮ Methods ◮ Experiments ◮ Conclusions and Discussions

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

slide-3
SLIDE 3

Outline Introduction Method Experiments Conclusions

RL

◮ Simple philosophy: agent receives rewards for good behaviors;

punished for bad behaviors.

◮ General training data format: state (situation), action

(decision), reward (label).

◮ Purpose of RL: learning behavior policy. ◮ Fundamental: Optimal Bellman Equation,

V ∗(st) = maxat [E(r(st+1)|st, at) + γ ∗ E(V ∗(st+1)|st, at)]

◮ The most general learning framework:

Degree of Labels’ Clearness Most (with Explicit Labels) Least (No Labels) RL Supervised Learning Unsupervised Learning

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

slide-4
SLIDE 4

Outline Introduction Method Experiments Conclusions

Markov Models and RL

◮ Makov chain, HMM, MDP and POMDPs:

Markov Models Do we have control

  • ver the state transitons?

NO YES Are the states completely

  • bservable?

YES

Markov Chain MDP

Markov Decision Process NO

HMM

Hidden Markov Model

POMDP

Partially Observable Markov Decision Process

adopted from pomdp.org

◮ Two ways to learning the behavior policy:

◮ Model-based: learn dynamics, solve Markov Models; ◮ Model-free: directly learning the policy, such as, RPR,

Q-learning.

◮ Applications: robotics, decision making under uncertainty.

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

slide-5
SLIDE 5

Outline Introduction Method Experiments Conclusions

Background of This Paper

◮ Barriers of Applications: (1) goal (reward) not well-defined;

(2) exploration is expensive; (3) data not preserve Markov property.

◮ Solution 1: For many domains, particularly those governed by

differential equations, leverage the induction of locality (nearest neighborhood, e.g., s(t + 1) and s(t) ) during function approximation to satisfy Markov property

◮ Solution 2: Reconstruct state-spaces of partially observable

systems: transfer high-order Markov property to first order Markov property, and preserve the locality.

◮ Example: use manifold embeddings to reconstruct locally

Eculidean state-spaces of forced partially observable systems; can find the embedding non-parametrically.

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

slide-6
SLIDE 6

Outline Introduction Method Experiments Conclusions

Summary of the method

An offline RL learning:

◮ Part 1, modeling phase: identify the appropriate embedding

and define the local model.

◮ Part 2, learning phase: leverage (use) the resultant locality

and perform RL.

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

slide-7
SLIDE 7

Outline Introduction Method Experiments Conclusions

Modeling: Manifold Embeddings for RL 1/2

Purpose: Using nonlinear dynamic systems theory to reconstruct complete state observability from incomplete state via delayed embeddings (representation).

◮ Assume real-valued vector space RM; action a; state

dynamics function f ; and deterministic policy function a(t) = π(s(t)), where s(t) the state. s(t + 1) = f (s(t), a(t)) = f (s(t), π(s(t))) = φ(s(t)) (1)

◮ if the system is observed via function y, s.t.

˜ s(t) = y(s(t)) (2)

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

slide-8
SLIDE 8

Outline Introduction Method Experiments Conclusions

Modeling: Manifold Embeddings for RL 2/2

◮ Construct a vector sE(t) below s.t. sE lies on a subset of RE

which is an embedding of s. sE(t) = [˜ s(t),˜ s(t − 1), · · · ,˜ s(t − (E − 1))], E > 2M (3)

◮ Because embeddings preserve the connectivity of the original

vector space RM, in the context of RL the following mapping ψ with : sE(t + 1) = ψ(sE(t)), (4) may be substituted for f and vectors sE(t) may be substituted for corresponding vectors s(t).

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

slide-9
SLIDE 9

Outline Introduction Method Experiments Conclusions

Modeling: Nonparametric Idenfication of Manifold Embeddings 1/2

Problem left: how to compute E, the embedding dimension. Solution: Singular Value Decomposition (SVD) Algorithm

◮ Given a sequence of state observations ˜

s of length ˜ S, choose a sufficiently large fixed embedding dimension, ˆ E.

◮ For each embedding window size ˆ

Tmin ∈ {ˆ E, · · · , ˜ S},

  • 1. Define a matrix Sˆ

E of row vectors sˆ E(t), t ∈ { ˆ

Tmin, · · · , ˜ S}, by the rule: sˆ

E(t) = [˜

s(t),˜ s(t − τ), · · · ,˜ s(t − (ˆ E − 1)τ)], (5)

  • 2. Compute the SVD of the matrix Sˆ

E, Sˆ E = UΣW ∗.

  • 3. Record the vector of singular values, σ( ˆ

Tmin) in Σ.

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

slide-10
SLIDE 10

Outline Introduction Method Experiments Conclusions

Modeling: Nonparametric Idenfication of Manifold Embeddings 2/2

◮ Estimate the embedding parameters, Tmin and E, of s by

analysis of the second σ2( ˆ Tmin).

  • 1. Approximate window size, Tmin, of s is the ˆ

Tmin value of the first local maxima of the sequence of all σ2( ˆ Tmin), for ˆ Tmin ∈ {ˆ E, · · · , ˜ S}.

  • 2. Approximate embedding dimension, E, is the number of

non-trivial singular values of σ(Tmin).

◮ Embeddings s via parameters, Tmin and E, yields the matrix

SE of row vectors sE(t), t ∈ {Tmin, · · · , ˜ S}.

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

slide-11
SLIDE 11

Outline Introduction Method Experiments Conclusions

Modeling: Generative Local Models from Embeddings 1/2

Purpose: Generate local model, a trajectory of the underlying system and prepare the observed “state” for RL.

◮ Consider a dataset D of a set of temporally aligned sequences

  • f s(t), a(t) and reward r(t), t ∈ {1, · · · , ˜

S}.

◮ Apply the above spectral embedding method to D and yield a

sequence of vectors sE(t), t ∈ {Tmin, · · · , ˜ S}.

◮ A local model M of D is the set of 3-tuples,

m(t) = {sE(t), a(t), r(t)}, t ∈ {Tmin, · · · , ˜ S}.

◮ Define operations on these tuples,

A(m(t)) = a(t), S(m(t)) = sE(t), Z(m(t)) = sz(t), where sz(t) = [sE(t), a(t)], and U(M, a) = Ma, where Ma is the subset

  • f tuples in M containing action a.

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

slide-12
SLIDE 12

Outline Introduction Method Experiments Conclusions

Modeling: Generative Local Models from Embeddings 2/2

Consider a state vector x(i) in RE indexed by simulation time i, and Compute its locality, i.e., nearest neighbor.

◮ Model’s nearest neighbor of x(i) when taking action a(i)

defined in case of discrete set of actions and continuous case: m(tx(i)) = arg min

m(t)∈U(M,a(i)) S(m(t)) − x(i), a ∈ A

(6) m(tx(i)) = arg min

m(t)∈M Z(m(t)) − [x(i), ωa(i)], a ∈ A,

(7) where ω is a scaling parameter.

◮ Model gradient and numerical integration defined as:

∇x(i) = S(m(txi + 1)) − S(m(txi)) (8) x(i + 1) = x(i) + △i(∇x(i) + η) (9) where η is a vector of noise and ∆i is the integration step-size.

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

slide-13
SLIDE 13

Outline Introduction Method Experiments Conclusions

Q-learning

Consider the obtained set of sequences x(i) in RE, a(i) and r(i); Learn the optimal policy function, π∗, that maximizes the expected sum of future rewards, termed the optimal action-value function or Q-function, Q∗, s.t. Q∗(x(i), a(i)) = r(i + 1) + γ max

a

Q∗(x(i + 1), a) (10) Iteratively construct an approximate Q of Q∗: δ(i) = r(i + 1) + γ max

a

Q(x(i + 1), a) − Q(x(i), a(i)) (11) where δ(i) is temporal difference error that can be used to improve the approximation. Q(x(i), a(i)) = Q(x(i), a(i)) + αδ(i) (12) where α is the learning rate.

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

slide-14
SLIDE 14

Outline Introduction Method Experiments Conclusions

Mountain Car: parking on the hill 1/2

A second-order nonlinear dynamic system: adopted from library.rl-community.org

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

slide-15
SLIDE 15

Outline Introduction Method Experiments Conclusions

Mountain Car: parking on the hill 2/2

50000 50000 100000 100000 150000 2.5 150000 200000 5.0 200000 −1.0 7.5 10.0 −0.5 0.0 0.5 1000 1000 100 100 −1.0 5 10 −0.5

Training Samples

15 0.0

Path−to−goal Length Path−to−goal Length

20 0.5

x(t) Tmin (sec) x(t) Tmin (sec) τ ) x(t− τ ) x(t− Singular Values Singular Values

−1.0 −0.5 0.0 0.5 2.5 5.0 −1.0 7.5 10.0 −0.5 0.0 0.5 5 10

Learned Policy (d)

15

(a) Embedding Performance, E=2 (b) Random Policy Embedding Performance, E=3 (c)

Training Samples σ5 σ2 σ1 σ3 σ1 σ2 σ5 0.20 0.70 1.20 1.70 2.20 Tmin Best Max Random 0.20 0.70 1.20 1.70 2.20 Tmin Best Max Random

Figure 1: Learning experiments on Mountain Car under partial observability. (a) Embedding spec- trum and accompanying trajectory (E = 3, Tmin = 0.70 sec.) under random policy. (b) Learning performance as a function of embedding parameters and quantity of training data. (c) Embedding spectrum and accompanying trajectory (E = 3, Tmin = 0.70 sec.) for the learned policy.

adopted from the presented NIPS paper

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

slide-16
SLIDE 16

Outline Introduction Method Experiments Conclusions

Neurostimulation Treatment of Epilepsy 1/2

Control 1 Hz 50 100 150 0.5 Hz

σ

3

σ2 σ1 σ2 σ

3

(a) Example Field Potentials (b) Neurostimulation Embedding Spectrum

−0.6 −0.4 −0.2 0.6 0.0 0.2 0.4 −0.6 −0.4 −0.2 0.6 0.0 0.2 0.4 −0.4 −0.2 0.6 0.0 0.2 0.4 −0.6 −3 −1 −2

(c) Neurostimulation Model

2 Hz 200 sec 1 mV Singular Values 50 100 150 200 250 0.0 0.5 1.0 1.5 2.0 10 20 30 40 50 60 Singular Values

*

*

T T min min (s) (s) 2nd Principal Component 3rd Principal Component 1st Principal Component 2nd Principal Component

Figure 2: Graphical summary of the modeling phase of our adaptive neurostimulation study. (a) Sample observations from the fixed-frequency stimulation dataset. Seizures are labeled with horizontal lines. (b) The embedding spectrum of the fixed-frequency stimulation dataset. The large maximum of σ2 at approximately 100 sec. is an artifact of the periodicity of seizures in the dataset. *Detail of the embedding spectrum for Tmin = [0.05, 2.0] depicting a maximum of σ2 at the time- scale of individual stimulation events. (c) The resultant neurostimulation model constructed from embedding the dataset with parameters (E = 3, Tmin = 1.05 sec.). Note, the model has been desampled 5× in the plot.

adopted from the presented NIPS paper

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

slide-17
SLIDE 17

Outline Introduction Method Experiments Conclusions

Neurostimulation Treatment of Epilepsy 2/2

(a) Control Phase

60 sec 2 mV

(d) Policy Phase 2 (b) Policy Phase 1 (c) Recovery Phase Stimulations

*

Figure 3: Field potential trace of a real seizure suppression experiment using a policy learned from

  • simulation. Seizures are labeled as horizontal lines above the traces. Stimulation events are marked

by vertical bars below the traces. (a) A control phase used to determine baseline seizure activity. (b) The initial application of the learned policy. (c) A recovery phase to ensure slice viability after stimulation and recompute baseline seizure activity. (d) The second application of the learned policy. *10 minutes of trace are omitted while the algorithm was reset.

adopted from the presented NIPS paper

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

slide-18
SLIDE 18

Outline Introduction Method Experiments Conclusions

Conclusions

◮ An integration of existing methods with a new application ◮ Many potential interesting real applications ahead.

By Keith Bush and Joelle Pineau at NIPS 2009 Presented by Chenghui Cai Duke University, ECE Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability