Approximate information state for partially observed systems - - PowerPoint PPT Presentation

approximate information state for partially observed
SMART_READER_LITE
LIVE PREVIEW

Approximate information state for partially observed systems - - PowerPoint PPT Presentation

Approximate information state for partially observed systems Jayakumar Subramanian and Aditya Mahajan McGill University Thanks to Amit Sinha and Raihan Seraj for simulation results IEEE Conference on Decision and Control 11 December 2019


slide-1
SLIDE 1

Approximate information state for partially observed systems

Jayakumar Subramanian and Aditya Mahajan

McGill University

Thanks to Amit Sinha and Raihan Seraj for simulation results IEEE Conference on Decision and Control 11 December 2019

slide-2
SLIDE 2
  • Approx. info. state–(Subramanian and Mahajan)

1

Many successes of RL in recent years

Algorithms based on comprehensive theory

slide-3
SLIDE 3
  • Approx. info. state–(Subramanian and Mahajan)

1

Alpha Go

Many successes of RL in recent years

Algorithms based on comprehensive theory

slide-4
SLIDE 4
  • Approx. info. state–(Subramanian and Mahajan)

1

Alpha Go Arcade games

Many successes of RL in recent years

Algorithms based on comprehensive theory

slide-5
SLIDE 5
  • Approx. info. state–(Subramanian and Mahajan)

1

Alpha Go Arcade games Robotics

Many successes of RL in recent years

Algorithms based on comprehensive theory

slide-6
SLIDE 6
  • Approx. info. state–(Subramanian and Mahajan)

1

Alpha Go Arcade games Robotics

Many successes of RL in recent years

Algorithms based on comprehensive theory restricted almost exclusively to systems with perfect state observations.

Applications with partially observed state

Healthcare Autonomous driving Finance (portfolio management) Retail and marketing

slide-7
SLIDE 7
  • Approx. info. state–(Subramanian and Mahajan)

1

Alpha Go Arcade games Robotics

Many successes of RL in recent years

Algorithms based on comprehensive theory restricted almost exclusively to systems with perfect state observations.

Applications with partially observed state

Healthcare Autonomous driving Finance (portfolio management) Retail and marketing

Develop a comprehensive theory of approximate DP and RL for partially observed systems

slide-8
SLIDE 8

Notion of information state for partially observed systems

slide-9
SLIDE 9
  • Approx. info. state–(Subramanian and Mahajan)

2 Stochastic System Controlled input: Ut Stochastic input: Wt Output: Yt

Yt = ft(U1:t, W1:t).

Notion of state in partially observed stochastic dynamical systems

slide-10
SLIDE 10
  • Approx. info. state–(Subramanian and Mahajan)

2 Stochastic System Controlled input: Ut Stochastic input: Wt Output: Yt

Yt = ft(U1:t, W1:t).

STOCHASTIC INPUT IS NOT OBSERVED

Let Ht = (Y1:t−1, U1:t−1) denote the history

  • f inputs and OUTPUTS until time t.

Notion of state in partially observed stochastic dynamical systems

slide-11
SLIDE 11
  • Approx. info. state–(Subramanian and Mahajan)

2 Stochastic System Controlled input: Ut Stochastic input: Wt Output: Yt

Yt = ft(U1:t, W1:t).

STOCHASTIC INPUT IS NOT OBSERVED

Let Ht = (Y1:t−1, U1:t−1) denote the history

  • f inputs and OUTPUTS until time t.

TRADITIONAL SOLUTION: BELIEF STATES

Step 1 Identify a state {St}t≥0 for predicting output assuming that the stochastic inputs are observed. Step 2 Defjne a BELIEF STATE Bt ∈ Δ(𝒯): Bt(s) = ℙ(St = s | Ht = ht), s ∈ 𝒯.

Notion of state in partially observed stochastic dynamical systems

Astrom, “Optimal control of Markov decision processes with incomplete state information,” 1965. Striebel, “Suffjcient statistics in the optimal control of stochastic systems,” 1965. Baum and Petrie, “Statistical inference for probabilistic functions of fjnite state Markov chains,” 1966. Stratonovich, “Conditional Markov processes,” 1960.

slide-12
SLIDE 12
  • Approx. info. state–(Subramanian and Mahajan)

3

Value function is piecewise linear and convex. Is exploited by various effjcient algorithms.

Partially observed Markov decision processes (POMDPs): Pros and Cons of belief state representation

Smallwood and Sondik, “The optimal control of partially observable Markov process over a fjnite horizon,” 1973. Chen, “Algorithms for partially observable Markov decision processes,” 1988. Kaelbling, Littmam, Cassandra, “Planning and acting in partially observable stochastic domains,” 1998. Pineau, Gordon, Thrun, “Point-based value iteration: an anytime algorithm for POMDPs,” 2003.

slide-13
SLIDE 13
  • Approx. info. state–(Subramanian and Mahajan)

3

Value function is piecewise linear and convex. Is exploited by various effjcient algorithms. When the state space model is not known analytically (as is the case for black-box models and simulators as well as some real world application such as healthcare), belief states are diffjcult to construct and diffjcult to approximate from data.

Partially observed Markov decision processes (POMDPs): Pros and Cons of belief state representation

Smallwood and Sondik, “The optimal control of partially observable Markov process over a fjnite horizon,” 1973. Chen, “Algorithms for partially observable Markov decision processes,” 1988. Kaelbling, Littmam, Cassandra, “Planning and acting in partially observable stochastic domains,” 1998. Pineau, Gordon, Thrun, “Point-based value iteration: an anytime algorithm for POMDPs,” 2003.

slide-14
SLIDE 14

Is there another ways to model partially observed systems which is more amenable to approximations? Let’s go back to first principles.

slide-15
SLIDE 15
  • Approx. info. state–(Subramanian and Mahajan)

4 Stochastic System Controlled input: Ut Stochastic input: Wt Output: Yt

Yt = ft(U1:t, W1:t).

WHEN THE STOCHASTIC INPUT IS NOT OBSERVED

Let Ht = (Y1:t−1, U1:t−1) denote the history

  • f inputs and OUTPUTS until time t.

Notion of state in partially observed stochastic dynamical systems

slide-16
SLIDE 16
  • Approx. info. state–(Subramanian and Mahajan)

4 Stochastic System Controlled input: Ut Stochastic input: Wt Output: Yt

Yt = ft(U1:t, W1:t).

WHEN THE STOCHASTIC INPUT IS NOT OBSERVED

Let Ht = (Y1:t−1, U1:t−1) denote the history

  • f inputs and OUTPUTS until time t.

PREDICTING OUTPUTS ALMOST SURELY

H(1)

t

∼ H(2)

t

if for all future inputs (Ut:T, Wt:T), Y(1)

t:T = Y(2) t:T ,

a.s.

Notion of state in partially observed stochastic dynamical systems

slide-17
SLIDE 17
  • Approx. info. state–(Subramanian and Mahajan)

4 Stochastic System Controlled input: Ut Stochastic input: Wt Output: Yt

Yt = ft(U1:t, W1:t).

WHEN THE STOCHASTIC INPUT IS NOT OBSERVED

Let Ht = (Y1:t−1, U1:t−1) denote the history

  • f inputs and OUTPUTS until time t.

PREDICTING OUTPUTS ALMOST SURELY

H(1)

t

∼ H(2)

t

if for all future inputs (Ut:T, Wt:T), Y(1)

t:T = Y(2) t:T ,

a.s.

FORECASTING OUTPUTS IN DISTRIBUTION

H(1)

t

∼ H(2)

t

if for all future CONTROL inputs Ut:T, ℙ(Y(1)

t:T | H(1) t

, Ut:T) = ℙ(Y(2)

t:T | H(2) t

, Ut:T)

Notion of state in partially observed stochastic dynamical systems

Grassberger, “Complexity and forecasting in dynamical systems,” 1988. Cruthfjeld and Young, “Inferring statistical complexity,” 1989.

slide-18
SLIDE 18
  • Approx. info. state–(Subramanian and Mahajan)

4 Stochastic System Controlled input: Ut Stochastic input: Wt Output: Yt

Yt = ft(U1:t, W1:t).

WHEN THE STOCHASTIC INPUT IS NOT OBSERVED

Let Ht = (Y1:t−1, U1:t−1) denote the history

  • f inputs and OUTPUTS until time t.

PREDICTING OUTPUTS ALMOST SURELY

H(1)

t

∼ H(2)

t

if for all future inputs (Ut:T, Wt:T), Y(1)

t:T = Y(2) t:T ,

a.s.

FORECASTING OUTPUTS IN DISTRIBUTION

H(1)

t

∼ H(2)

t

if for all future CONTROL inputs Ut:T, ℙ(Y(1)

t:T | H(1) t

, Ut:T) = ℙ(Y(2)

t:T | H(2) t

, Ut:T)

Too restrictive . . . Notion of state in partially observed stochastic dynamical systems

Grassberger, “Complexity and forecasting in dynamical systems,” 1988. Cruthfjeld and Young, “Inferring statistical complexity,” 1989.

slide-19
SLIDE 19
  • Approx. info. state–(Subramanian and Mahajan)

5 FORECASTING OUTPUTS IN DISTRIBUTION

H(1)

t

∼ H(2)

t

if for all future CONTROL inputs Ut:T, ℙ(Y(1)

t:T | H(1) t

, Ut:T) = ℙ(Y(2)

t:T | H(2) t

, Ut:T)

Now let’s consturct the state space

slide-20
SLIDE 20
  • Approx. info. state–(Subramanian and Mahajan)

5 FORECASTING OUTPUTS IN DISTRIBUTION

H(1)

t

∼ H(2)

t

if for all future CONTROL inputs Ut:T, ℙ(Y(1)

t:T | H(1) t

, Ut:T) = ℙ(Y(2)

t:T | H(2) t

, Ut:T)

PROPERTIES OF INFORMATION STATE

The info state Zt at time t is a “compression”

  • f past inputs that satisfjes the following:

SUFFICIENT TO PREDICT ITSELF:

ℙ(Zt+1 | Ht, Ut) = ℙ(Zt+1 | Zt, Ut).

SUFFICIENT TO PREDICT OUTPUT:

ℙ(Yt | Ht, Ut) = ℙ(Yt | Zt, Ut).

Now let’s consturct the state space

slide-21
SLIDE 21
  • Approx. info. state–(Subramanian and Mahajan)

5 FORECASTING OUTPUTS IN DISTRIBUTION

H(1)

t

∼ H(2)

t

if for all future CONTROL inputs Ut:T, ℙ(Y(1)

t:T | H(1) t

, Ut:T) = ℙ(Y(2)

t:T | H(2) t

, Ut:T)

PROPERTIES OF INFORMATION STATE

The info state Zt at time t is a “compression”

  • f past inputs that satisfjes the following:

SUFFICIENT TO PREDICT ITSELF:

ℙ(Zt+1 | Ht, Ut) = ℙ(Zt+1 | Zt, Ut).

SUFFICIENT TO PREDICT OUTPUT:

ℙ(Yt | Ht, Ut) = ℙ(Yt | Zt, Ut). Same complexity as identifying the state suffjcient for forecasting outputs for the case of perfect observations (which was Step 1 for belief state formulations)

Now let’s consturct the state space

slide-22
SLIDE 22
  • Approx. info. state–(Subramanian and Mahajan)

5 FORECASTING OUTPUTS IN DISTRIBUTION

H(1)

t

∼ H(2)

t

if for all future CONTROL inputs Ut:T, ℙ(Y(1)

t:T | H(1) t

, Ut:T) = ℙ(Y(2)

t:T | H(2) t

, Ut:T)

PROPERTIES OF INFORMATION STATE

The info state Zt at time t is a “compression”

  • f past inputs that satisfjes the following:

SUFFICIENT TO PREDICT ITSELF:

ℙ(Zt+1 | Ht, Ut) = ℙ(Zt+1 | Zt, Ut).

SUFFICIENT TO PREDICT OUTPUT:

ℙ(Yt | Ht, Ut) = ℙ(Yt | Zt, Ut). Same complexity as identifying the state suffjcient for forecasting outputs for the case of perfect observations (which was Step 1 for belief state formulations)

KEY QUESTIONS

Can this be used for dynamic programming? What is the right notion of approx- imations in this framework?

Now let’s consturct the state space

slide-23
SLIDE 23

An information state for dynamic programming

slide-24
SLIDE 24
  • Approx. info. state–(Subramanian and Mahajan)

6 Stochastic System Controlled input: Ut Stochastic input: Wt Output: Yt Reward: 𝐒𝐮

Yt = ft(U1:t, W1:t), Rt = rt(U1:t, W1:t). Choose Ut = gt(Y1:t−1, U1:t−1) to max 𝔽 [

T

t=1

Rt]

Predicting output vs optimizing expected rewards over time

slide-25
SLIDE 25
  • Approx. info. state–(Subramanian and Mahajan)

6 Stochastic System Controlled input: Ut Stochastic input: Wt Output: Yt Reward: 𝐒𝐮

Yt = ft(U1:t, W1:t), Rt = rt(U1:t, W1:t). Choose Ut = gt(Y1:t−1, U1:t−1) to max 𝔽 [

T

t=1

Rt]

PROPERTIES OF INFORMATION STATE (SUFFICIENT FOR DYNAMIC PROGRAMMING)

The info state Zt at time t is a “compression” of past inputs that satisfjes the following:

SUFFICIENT TO PREDICT ITSELF:

ℙ(Zt+1 | Ht, Ut) = ℙ(Zt+1 | Zt, Ut).

SUFFICIENT TO ESTIMATE EXPECTED REWARD:

𝔽[Rt | Ht, Ut] = 𝔽[Rt | Zt, Ut].

Predicting output vs optimizing expected rewards over time

slide-26
SLIDE 26
  • Approx. info. state–(Subramanian and Mahajan)

7

PROPERTIES OF INFORMATION STATE (SUFFICIENT FOR DYNAMIC PROGRAMMING)

The info state Zt at time t is a “compression”

  • f past inputs that satisfjes the following:

SUFFICIENT TO PREDICT ITSELF:

ℙ(Zt+1 | Ht, Ut) = ℙ(Zt+1 | Zt, Ut).

SUFFICIENT TO ESTIMATE EXPECTED REWARD:

𝔽[Rt | Ht, Ut] = 𝔽[Rt | Zt, Ut].

Dynamic programming using information state

slide-27
SLIDE 27
  • Approx. info. state–(Subramanian and Mahajan)

7

PROPERTIES OF INFORMATION STATE (SUFFICIENT FOR DYNAMIC PROGRAMMING)

The info state Zt at time t is a “compression”

  • f past inputs that satisfjes the following:

SUFFICIENT TO PREDICT ITSELF:

ℙ(Zt+1 | Ht, Ut) = ℙ(Zt+1 | Zt, Ut).

SUFFICIENT TO ESTIMATE EXPECTED REWARD:

𝔽[Rt | Ht, Ut] = 𝔽[Rt | Zt, Ut].

PRELIMINARY THEOREM

If {Zt}t≥1 is any information state process. Then: There is no loss of optimality in restricting attention to policies of the form Ut = ˜ gt(Zt).

Dynamic programming using information state

Bohlin (1970) David and Varaiya (1972) Kumar and Varaiya (1984).

slide-28
SLIDE 28
  • Approx. info. state–(Subramanian and Mahajan)

7

PROPERTIES OF INFORMATION STATE (SUFFICIENT FOR DYNAMIC PROGRAMMING)

The info state Zt at time t is a “compression”

  • f past inputs that satisfjes the following:

SUFFICIENT TO PREDICT ITSELF:

ℙ(Zt+1 | Ht, Ut) = ℙ(Zt+1 | Zt, Ut).

SUFFICIENT TO ESTIMATE EXPECTED REWARD:

𝔽[Rt | Ht, Ut] = 𝔽[Rt | Zt, Ut].

PRELIMINARY THEOREM

If {Zt}t≥1 is any information state process. Then: There is no loss of optimality in restricting attention to policies of the form Ut = ˜ gt(Zt). Let {Vt}T+1

t=1 denote the solution to the following

dynamic program: VT+1(zT+1) = 0 and for t ∈ {T, . . . , 1}, Qt(zt, ut) = 𝔽[Rt + Vt+1(Zt+1) | Zt = zt, Ut = ut], Vt(zt) = max

ut∈𝒱 Qt(zt, ut).

A policy { ˜ gt}T

t=1, ˜

gt∶ 𝒶t → 𝒱, is optimal if it satisfjes ˜ gt(zt) ∈ arg max

ut∈𝒱 Qt(zt, ut).

Dynamic programming using information state

Bohlin (1970) David and Varaiya (1972) Kumar and Varaiya (1984).

slide-29
SLIDE 29

What about approximations?

slide-30
SLIDE 30
  • Approx. info. state–(Subramanian and Mahajan)

8 INTEGRAL PROBABILITY METRIC (IPM)

Let 𝒬 denote the set of probability measures

  • n a measurable space (𝒴, 𝔋).

Given a class 𝔊 of real-valued bounded measureable functions on (𝒴, 𝔋), the integral probability metric (IPM) between two probability distributions μ, ν ∈ 𝒬 is given by: d𝔊(μ, ν) = sup

f∈𝔊|∫ 𝒴

fdμ − ∫

𝒴

fdν | .

Preliminary: A family of pseudometrics on probability distributions

Müller, “Integral probability metrics and their generating classes of functions,” 1997.

slide-31
SLIDE 31
  • Approx. info. state–(Subramanian and Mahajan)

8 INTEGRAL PROBABILITY METRIC (IPM)

Let 𝒬 denote the set of probability measures

  • n a measurable space (𝒴, 𝔋).

Given a class 𝔊 of real-valued bounded measureable functions on (𝒴, 𝔋), the integral probability metric (IPM) between two probability distributions μ, ν ∈ 𝒬 is given by: d𝔊(μ, ν) = sup

f∈𝔊|∫ 𝒴

fdμ − ∫

𝒴

fdν | .

EXAMPLES

If 𝔊 = {f : ‖f‖∞ ≤ 1}, d𝔊 = Total variation distance. If 𝔊 = {f : |f|L ≤ 1}, d𝔊 = Wasserstein distance. If 𝔊 = {f : ‖f‖∞ + |f|L ≤ 1}, d𝔊 = Dudley metric. . . . We say a function f has a 𝔊-constant K if f/K ∈ 𝔊.

Preliminary: A family of pseudometrics on probability distributions

Müller, “Integral probability metrics and their generating classes of functions,” 1997.

slide-32
SLIDE 32
  • Approx. info. state–(Subramanian and Mahajan)

9

(ε, δ)-APPROXIMATE INFORMATION STATE (AIS) Given a function class 𝔊, a compression {Zt}t≥1 of history (i.e., Zt = φt(Ht)) is called an {(εt, δt)}t≥1 AIS if there exist: a function ˜ Rt(Zt, Ut), and a stochastic kernel νt(Zt+1|Zt, Ut) such that |𝔽[Rt|Ht = ht, Ut = ut] − ˜ Rt(φt(ht), ut)| ≤ εt For any Borel set A of 𝒶t, defjne μt(A) = ℙ(Zt+1 ∈ A | Ht = ht, Ut = ut) Then, d𝔊(μt, νt(⋅ |φt(ht), ut)) ≤ δt.

Approximate information state

slide-33
SLIDE 33
  • Approx. info. state–(Subramanian and Mahajan)

10 MAIN THEOREM

Given a function class 𝔊, let {Zt}t≥1, where Zt = φt(Ht), be an {(εt, δt)}t≥1 AIS. Recursively defjne the following functions: ˆ VT+1(zT+1) = 0 and for t ∈ {T, . . . , 1}: ˆ Vt(zt) = max

ut∈𝒱{˜

Rt(zt, ut) + ∫ Vt+1(zt+1)νt(dzt+1 | zt, ut)}. Let π = (π1, . . . , πT) denote the corresponding policy.

Approximate dynamic programming using AIS

slide-34
SLIDE 34
  • Approx. info. state–(Subramanian and Mahajan)

10 MAIN THEOREM

Given a function class 𝔊, let {Zt}t≥1, where Zt = φt(Ht), be an {(εt, δt)}t≥1 AIS. Recursively defjne the following functions: ˆ VT+1(zT+1) = 0 and for t ∈ {T, . . . , 1}: ˆ Vt(zt) = max

ut∈𝒱{˜

Rt(zt, ut) + ∫ Vt+1(zt+1)νt(dzt+1 | zt, ut)}. Let π = (π1, . . . , πT) denote the corresponding policy. Then, if the value function ˆ Vt has 𝔊-constant Kt, then for any history ht, |Vt(ht) − ˆ Vt(φt(ht))| ≤ εT +

T

s=t

(εs + Ksδs). for any history ht, |Vt(ht) − Vπ

t (ht)|

≤ 2[εT +

T

s=t

(εs + Ksδs)].

Approximate dynamic programming using AIS

slide-35
SLIDE 35
  • Approx. info. state–(Subramanian and Mahajan)

11

In the defjnition of AIS, we can replace d𝔊(ℙ(μt, νt(⋅|Zt = φt(ht), Ut = ut)) ≤ δt by Zt+1 = function(Zt, Yt+1, Ut) d𝔊(ℙ(Yt|Ht = ht, Ut = ut), ℙ(Yt|Zt = φt(ht), Ut = ut)) ≤ δt.

AIS: Some remarks

slide-36
SLIDE 36
  • Approx. info. state–(Subramanian and Mahajan)

11

In the defjnition of AIS, we can replace d𝔊(ℙ(μt, νt(⋅|Zt = φt(ht), Ut = ut)) ≤ δt by Zt+1 = function(Zt, Yt+1, Ut) d𝔊(ℙ(Yt|Ht = ht, Ut = ut), ℙ(Yt|Zt = φt(ht), Ut = ut)) ≤ δt. The AIS process {Zt}t≥1 need not be Markov !!

AIS: Some remarks

slide-37
SLIDE 37
  • Approx. info. state–(Subramanian and Mahajan)

11

In the defjnition of AIS, we can replace d𝔊(ℙ(μt, νt(⋅|Zt = φt(ht), Ut = ut)) ≤ δt by Zt+1 = function(Zt, Yt+1, Ut) d𝔊(ℙ(Yt|Ht = ht, Ut = ut), ℙ(Yt|Zt = φt(ht), Ut = ut)) ≤ δt. The AIS process {Zt}t≥1 need not be Markov !! Two ways to interpret the results: Given the information state space 𝒶, fjnd the best compression φt∶ ℋt → 𝒶 Given any compression function φt∶ ℋt → 𝒶t, fjnd the approximation error.

AIS: Some remarks

slide-38
SLIDE 38
  • Approx. info. state–(Subramanian and Mahajan)

11

In the defjnition of AIS, we can replace d𝔊(ℙ(μt, νt(⋅|Zt = φt(ht), Ut = ut)) ≤ δt by Zt+1 = function(Zt, Yt+1, Ut) d𝔊(ℙ(Yt|Ht = ht, Ut = ut), ℙ(Yt|Zt = φt(ht), Ut = ut)) ≤ δt. The AIS process {Zt}t≥1 need not be Markov !! Two ways to interpret the results: Given the information state space 𝒶, fjnd the best compression φt∶ ℋt → 𝒶 Given any compression function φt∶ ℋt → 𝒶t, fjnd the approximation error. Results naturally extend to infjnite horizon

AIS: Some remarks

slide-39
SLIDE 39

Some examples

slide-40
SLIDE 40
  • Approx. info. state–(Subramanian and Mahajan)

12

Consider an MDP with state space 𝒴 and per-step reward Rt = r(Xt, Ut). Suppose 𝒴 is quantized to a discrete set 𝒶 using φ∶ 𝒴 → 𝒶. Let z = φ(x) denote the label for x. Then φ−1(z) denote all states which have label z.

Example 1: Error bounds on state aggregation

slide-41
SLIDE 41
  • Approx. info. state–(Subramanian and Mahajan)

12

Consider an MDP with state space 𝒴 and per-step reward Rt = r(Xt, Ut). Suppose 𝒴 is quantized to a discrete set 𝒶 using φ∶ 𝒴 → 𝒶. Let z = φ(x) denote the label for x. Then φ−1(z) denote all states which have label z. {Zt}t≥1 IS AN (ε, δ) AIS ε = sup

(x,u)∈𝒴×𝒱|r(x, u) − r(φ(x), u)|

  • r, equivalently, r(⋅, u) has a 𝔊-cosntant Kr

δ = sup

(x,u)∈𝒴×𝒱

d𝔊(ℙ(X+ | X = x, U = u), ℙ(X+ | X ∈ φ−1(φ(x)), U = u)).

  • r, equivalently, ℙ(X+|X = ⋅, U = u) has a 𝔊-constant of Kd.

Example 1: Error bounds on state aggregation

Bertsekas, “Convergence of discretization procedures in dynamic programming,” 1975.

slide-42
SLIDE 42
  • Approx. info. state–(Subramanian and Mahajan)

13

Example 2: Approximation bounds for using quantized obs.

Ha, Schmidhuber, “World Models”, 2018.

Video observation Vision Vision Memory RL agent Yt ˆ Yt Zt

slide-43
SLIDE 43
  • Approx. info. state–(Subramanian and Mahajan)

13

Proposed as a heuristic algorithms No performance bounds

Example 2: Approximation bounds for using quantized obs.

Ha, Schmidhuber, “World Models”, 2018.

Video observation Vision Vision Memory RL agent Yt ˆ Yt Zt

slide-44
SLIDE 44
  • Approx. info. state–(Subramanian and Mahajan)

13

Proposed as a heuristic algorithms No performance bounds {Zt}t≥1 IS AN (ε, δ) AIS ε = sup

ht,ut|𝔽[Rt|ht, ut] − ˜

Rt(φt(ht), ut)| δ = sup

ht,ut

d𝔊(ℙ(ˆ Yt+1|ht, ut), ℙ(ˆ Yt+1|φt(ht), ut))

Example 2: Approximation bounds for using quantized obs.

Ha, Schmidhuber, “World Models”, 2018.

Video observation Vision Vision Memory RL agent Yt ˆ Yt Zt

slide-45
SLIDE 45
  • Approx. info. state–(Subramanian and Mahajan)

14

n agents: state Xi

t, control Ui t.

Empirical mean-fjeld: Mt(x) = 1 n

n

i=1

δXi

t(x).

Statistical mean-fjeld: ¯ mt(x) = ℙ(Xi

t = x).

Example 3: Approximation bounds for mean-field teams

slide-46
SLIDE 46
  • Approx. info. state–(Subramanian and Mahajan)

14

n agents: state Xi

t, control Ui t.

Dynamics ℙ(𝐘t+1|𝐘t, 𝐕t) =

n

i=1

P(Xi

t+1|Xi t, Ui t, Mt)

Per-step reward R(𝐘t, 𝐕t) = 1 n

n

i=1

r(Xi

t, Ui t, Mt)

Empirical mean-fjeld: Mt(x) = 1 n

n

i=1

δXi

t(x).

Statistical mean-fjeld: ¯ mt(x) = ℙ(Xi

t = x).

Example 3: Approximation bounds for mean-field teams

slide-47
SLIDE 47
  • Approx. info. state–(Subramanian and Mahajan)

14

n agents: state Xi

t, control Ui t.

Dynamics ℙ(𝐘t+1|𝐘t, 𝐕t) =

n

i=1

P(Xi

t+1|Xi t, Ui t, Mt)

Per-step reward R(𝐘t, 𝐕t) = 1 n

n

i=1

r(Xi

t, Ui t, Mt)

Empirical mean-fjeld: Mt(x) = 1 n

n

i=1

δXi

t(x).

Statistical mean-fjeld: ¯ mt(x) = ℙ(Xi

t = x).

Info structure: Ii

t = {Xi t}.

Example 3: Approximation bounds for mean-field teams

slide-48
SLIDE 48
  • Approx. info. state–(Subramanian and Mahajan)

14

n agents: state Xi

t, control Ui t.

Dynamics ℙ(𝐘t+1|𝐘t, 𝐕t) =

n

i=1

P(Xi

t+1|Xi t, Ui t, Mt)

Per-step reward R(𝐘t, 𝐕t) = 1 n

n

i=1

r(Xi

t, Ui t, Mt)

Empirical mean-fjeld: Mt(x) = 1 n

n

i=1

δXi

t(x).

Statistical mean-fjeld: ¯ mt(x) = ℙ(Xi

t = x).

Info structure: Ii

t = {Xi t}.

Expanded info structure: ˜ Ii

t = {Xi t, Mt}.

𝒦∗ ≤ ˜ 𝒦∗

Example 3: Approximation bounds for mean-field teams

slide-49
SLIDE 49
  • Approx. info. state–(Subramanian and Mahajan)

14

n agents: state Xi

t, control Ui t.

Dynamics ℙ(𝐘t+1|𝐘t, 𝐕t) =

n

i=1

P(Xi

t+1|Xi t, Ui t, Mt)

Per-step reward R(𝐘t, 𝐕t) = 1 n

n

i=1

r(Xi

t, Ui t, Mt)

Empirical mean-fjeld: Mt(x) = 1 n

n

i=1

δXi

t(x).

Statistical mean-fjeld: ¯ mt(x) = ℙ(Xi

t = x).

Info structure: Ii

t = {Xi t}.

Expanded info structure: ˜ Ii

t = {Xi t, Mt}.

𝒦∗ ≤ ˜ 𝒦∗ (A) r(x, u, m) and P(y|x, u, m) are Lipschitz in m. { ¯ mt}t≥1 is an (ε, δ) AIS for expanded info structure, where ε, δ ∈ 𝒫(1/√n).

Example 3: Approximation bounds for mean-field teams

slide-50
SLIDE 50
  • Approx. info. state–(Subramanian and Mahajan)

14

n agents: state Xi

t, control Ui t.

Dynamics ℙ(𝐘t+1|𝐘t, 𝐕t) =

n

i=1

P(Xi

t+1|Xi t, Ui t, Mt)

Per-step reward R(𝐘t, 𝐕t) = 1 n

n

i=1

r(Xi

t, Ui t, Mt)

Empirical mean-fjeld: Mt(x) = 1 n

n

i=1

δXi

t(x).

Statistical mean-fjeld: ¯ mt(x) = ℙ(Xi

t = x).

Info structure: Ii

t = {Xi t}.

Expanded info structure: ˜ Ii

t = {Xi t, Mt}.

𝒦∗ ≤ ˜ 𝒦∗, ˜ 𝒦∗ − ¯ 𝒦∗ ≤ K/√n ¯ 𝒦∗ ≤ 𝒦∗ ≤ ¯ 𝒦∗ + K/√n. (A) r(x, u, m) and P(y|x, u, m) are Lipschitz in m. { ¯ mt}t≥1 is an (ε, δ) AIS for expanded info structure, where ε, δ ∈ 𝒫(1/√n).

Example 3: Approximation bounds for mean-field teams

slide-51
SLIDE 51

Now to reinforcement learning for partially observed systems.

slide-52
SLIDE 52
  • Approx. info. state–(Subramanian and Mahajan)

15

State aggregator: ℒAIS = αt|˜ Rt − Rt| + (1 − αt)d𝔊(νt, μt) ξ: Parameters of the aggregator Updated using SGD with LR ak

Reinforcement learning setup

AIS Encoder AIS Decoder Zt State aggregator Yt Ut−1 ˜ Rt νt

slide-53
SLIDE 53
  • Approx. info. state–(Subramanian and Mahajan)

15

State aggregator: ℒAIS = αt|˜ Rt − Rt| + (1 − αt)d𝔊(νt, μt) ξ: Parameters of the aggregator Updated using SGD with LR ak Value approximator: φ: parameters of Q(z, u) approximator. Updated using TD(0) or TD(λ) with LR bk.

Reinforcement learning setup

AIS Encoder AIS Decoder Zt State aggregator Yt Ut−1 ˜ Rt νt Value approx. Critic Ut

slide-54
SLIDE 54
  • Approx. info. state–(Subramanian and Mahajan)

15

State aggregator: ℒAIS = αt|˜ Rt − Rt| + (1 − αt)d𝔊(νt, μt) ξ: Parameters of the aggregator Updated using SGD with LR ak Value approximator: φ: parameters of Q(z, u) approximator. Updated using TD(0) or TD(λ) with LR bk. Policy approximator: θ: parameters of π(u | z) Updated using policy gradient with LR ck.

Reinforcement learning setup

AIS Encoder AIS Decoder Zt State aggregator Yt Ut−1 ˜ Rt νt Value approx. Critic Ut Policy approx. Actor

slide-55
SLIDE 55
  • Approx. info. state–(Subramanian and Mahajan)

16 CONVERGENCE RESULT

If the learning rates satisfy conditions for three time-scale stochastic approximation, the compatibility condition ∂Q(z, u) ∂φ = 1 π(u|z) ∂π(u|z) ∂θ and additional mild technical conditions hold. Then, State aggregator converges (with some approximation error) The critic converges to the best approximator within the specifjed family. The actor converges to a local maximizer within the family of policy approximators.

Reinforcement learning setup

AIS Encoder AIS Decoder Zt State aggregator Yt Ut−1 ˜ Rt νt Value approx. Critic Ut Policy approx. Actor

slide-56
SLIDE 56
  • Approx. info. state–(Subramanian and Mahajan)

17

Numerical Results: 𝟓 × 𝟓 Grid Environment

slide-57
SLIDE 57
  • Approx. info. state–(Subramanian and Mahajan)

18

𝟓 × 𝟓 Grid Environment

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Samples ×106 1 2 3 4 5 Performance Planning solution RPG AIS

slide-58
SLIDE 58
  • Approx. info. state–(Subramanian and Mahajan)

19

Numerical Results: Tiger Environment

slide-59
SLIDE 59
  • Approx. info. state–(Subramanian and Mahajan)

20

Tiger Environment

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Samples ×106 −40 −30 −20 −10 10 20 Performance Planning solution RPG AIS

slide-60
SLIDE 60
  • Approx. info. state–(Subramanian and Mahajan)

21

Numerical Results: Cheese Maze Environment

slide-61
SLIDE 61
  • Approx. info. state–(Subramanian and Mahajan)

22

Cheese Maze Environment

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Samples ×106 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Performance Planning solution RPG AIS

slide-62
SLIDE 62
  • Approx. info. state–(Subramanian and Mahajan)

23

Summary

slide-63
SLIDE 63
  • Approx. info. state–(Subramanian and Mahajan)

23

Summary

  • Approx. info. state–(Subramanian and Mahajan)

5 FORECASTING OUTPUTS IN DISTRIBUTION

H(1)

t

∼ H(2)

t

if for all future CONTROL inputs Ut:T, ℙ(Y(1)

t:T | H(1) t

, Ut:T) = ℙ(Y(2)

t:T | H(2) t

, Ut:T)

PROPERTIES OF INFORMATION STATE

The info state Zt at time t is a “compression”

  • f past inputs that satisfjes the following:

SUFFICIENT TO PREDICT ITSELF:

ℙ(Zt+1 | Ht, Ut) = ℙ(Zt+1 | Zt, Ut).

SUFFICIENT TO PREDICT OUTPUT:

ℙ(Yt | Ht, Ut) = ℙ(Yt | Zt, Ut). Same complexity as identifying the state suffjcient for forecasting outputs for the case of perfect observations (which was Step 1 for belief state formulations)

KEY QUESTIONS

Can this be used for dynamic programming? What is the right notion of approx- imations in this framework?

Now let’s consturct the state space

slide-64
SLIDE 64
  • Approx. info. state–(Subramanian and Mahajan)

23

Summary

  • Approx. info. state–(Subramanian and Mahajan)

5 FORECASTING OUTPUTS IN DISTRIBUTION

H(1)

t

∼ H(2)

t

if for all future CONTROL inputs Ut:T, ℙ(Y(1)

t:T | H(1) t

, Ut:T) = ℙ(Y(2)

t:T | H(2) t

, Ut:T)

PROPERTIES OF INFORMATION STATE

The info state Zt at time t is a “compression”

  • f past inputs that satisfjes the following:

SUFFICIENT TO PREDICT ITSELF:

ℙ(Zt+1 | Ht, Ut) = ℙ(Zt+1 | Zt, Ut).

SUFFICIENT TO PREDICT OUTPUT:

ℙ(Yt | Ht, Ut) = ℙ(Yt | Zt, Ut). Same complexity as identifying the state suffjcient for forecasting outputs for the case of perfect observations (which was Step 1 for belief state formulations)

KEY QUESTIONS

Can this be used for dynamic programming? What is the right notion of approx- imations in this framework?

Now let’s consturct the state space

  • Approx. info. state–(Subramanian and Mahajan)

9

(ε, δ)-APPROXIMATE INFORMATION STATE (AIS) Given a function class 𝔊, a compression {Zt}t≥1 of history (i.e., Zt = φt(Ht)) is called an {(εt, δt)}t≥1 AIS if there exist: a function ˜ Rt(Zt, Ut), and a stochastic kernel νt(Zt+1|Zt, Ut) such that |𝔽[Rt|Ht = ht, Ut = ut] − ˜ Rt(φt(ht), ut)| ≤ εt For any Borel set A of 𝒶t, defjne μt(A) = ℙ(Zt+1 ∈ A | Ht = ht, Ut = ut) Then, d𝔊(μt, νt(⋅ |φt(ht), ut)) ≤ δt.

Approximate information state

slide-65
SLIDE 65
  • Approx. info. state–(Subramanian and Mahajan)

23

Summary

  • Approx. info. state–(Subramanian and Mahajan)

5 FORECASTING OUTPUTS IN DISTRIBUTION

H(1)

t

∼ H(2)

t

if for all future CONTROL inputs Ut:T, ℙ(Y(1)

t:T | H(1) t

, Ut:T) = ℙ(Y(2)

t:T | H(2) t

, Ut:T)

PROPERTIES OF INFORMATION STATE

The info state Zt at time t is a “compression”

  • f past inputs that satisfjes the following:

SUFFICIENT TO PREDICT ITSELF:

ℙ(Zt+1 | Ht, Ut) = ℙ(Zt+1 | Zt, Ut).

SUFFICIENT TO PREDICT OUTPUT:

ℙ(Yt | Ht, Ut) = ℙ(Yt | Zt, Ut). Same complexity as identifying the state suffjcient for forecasting outputs for the case of perfect observations (which was Step 1 for belief state formulations)

KEY QUESTIONS

Can this be used for dynamic programming? What is the right notion of approx- imations in this framework?

Now let’s consturct the state space

  • Approx. info. state–(Subramanian and Mahajan)

9

(ε, δ)-APPROXIMATE INFORMATION STATE (AIS) Given a function class 𝔊, a compression {Zt}t≥1 of history (i.e., Zt = φt(Ht)) is called an {(εt, δt)}t≥1 AIS if there exist: a function ˜ Rt(Zt, Ut), and a stochastic kernel νt(Zt+1|Zt, Ut) such that |𝔽[Rt|Ht = ht, Ut = ut] − ˜ Rt(φt(ht), ut)| ≤ εt For any Borel set A of 𝒶t, defjne μt(A) = ℙ(Zt+1 ∈ A | Ht = ht, Ut = ut) Then, d𝔊(μt, νt(⋅ |φt(ht), ut)) ≤ δt.

Approximate information state

  • Approx. info. state–(Subramanian and Mahajan)

10 MAIN THEOREM

Given a function class 𝔊, let {Zt}t≥1, where Zt = φt(Ht), be an {(εt, δt)}t≥1 AIS. Recursively defjne the following functions: ˆ VT+1(zT+1) = 0 and for t ∈ {T, . . . , 1}: ˆ Vt(zt) = max

ut∈𝒱{˜

Rt(zt, ut) + ∫ Vt+1(zt+1)νt(dzt+1 | zt, ut)}. Let π = (π1, . . . , πT) denote the corresponding policy. Then, if the value function ˆ Vt has 𝔊-constant Kt, then for any history ht, |Vt(ht) − ˆ Vt(φt(ht))| ≤ εT +

T

s=t

(εs + Ksδs). for any history ht, |Vt(ht) − Vπ

t (ht)|

≤ 2[εT +

T

s=t

(εs + Ksδs)].

Approximate dynamic programming using AIS

slide-66
SLIDE 66
  • Approx. info. state–(Subramanian and Mahajan)

23

Summary

  • Approx. info. state–(Subramanian and Mahajan)

5 FORECASTING OUTPUTS IN DISTRIBUTION

H(1)

t

∼ H(2)

t

if for all future CONTROL inputs Ut:T, ℙ(Y(1)

t:T | H(1) t

, Ut:T) = ℙ(Y(2)

t:T | H(2) t

, Ut:T)

PROPERTIES OF INFORMATION STATE

The info state Zt at time t is a “compression”

  • f past inputs that satisfjes the following:

SUFFICIENT TO PREDICT ITSELF:

ℙ(Zt+1 | Ht, Ut) = ℙ(Zt+1 | Zt, Ut).

SUFFICIENT TO PREDICT OUTPUT:

ℙ(Yt | Ht, Ut) = ℙ(Yt | Zt, Ut). Same complexity as identifying the state suffjcient for forecasting outputs for the case of perfect observations (which was Step 1 for belief state formulations)

KEY QUESTIONS

Can this be used for dynamic programming? What is the right notion of approx- imations in this framework?

Now let’s consturct the state space

  • Approx. info. state–(Subramanian and Mahajan)

9

(ε, δ)-APPROXIMATE INFORMATION STATE (AIS) Given a function class 𝔊, a compression {Zt}t≥1 of history (i.e., Zt = φt(Ht)) is called an {(εt, δt)}t≥1 AIS if there exist: a function ˜ Rt(Zt, Ut), and a stochastic kernel νt(Zt+1|Zt, Ut) such that |𝔽[Rt|Ht = ht, Ut = ut] − ˜ Rt(φt(ht), ut)| ≤ εt For any Borel set A of 𝒶t, defjne μt(A) = ℙ(Zt+1 ∈ A | Ht = ht, Ut = ut) Then, d𝔊(μt, νt(⋅ |φt(ht), ut)) ≤ δt.

Approximate information state

  • Approx. info. state–(Subramanian and Mahajan)

10 MAIN THEOREM

Given a function class 𝔊, let {Zt}t≥1, where Zt = φt(Ht), be an {(εt, δt)}t≥1 AIS. Recursively defjne the following functions: ˆ VT+1(zT+1) = 0 and for t ∈ {T, . . . , 1}: ˆ Vt(zt) = max

ut∈𝒱{˜

Rt(zt, ut) + ∫ Vt+1(zt+1)νt(dzt+1 | zt, ut)}. Let π = (π1, . . . , πT) denote the corresponding policy. Then, if the value function ˆ Vt has 𝔊-constant Kt, then for any history ht, |Vt(ht) − ˆ Vt(φt(ht))| ≤ εT +

T

s=t

(εs + Ksδs). for any history ht, |Vt(ht) − Vπ

t (ht)|

≤ 2[εT +

T

s=t

(εs + Ksδs)].

Approximate dynamic programming using AIS

  • Approx. info. state–(Subramanian and Mahajan)

12

Consider an MDP with state space 𝒴 and per-step reward Rt = r(Xt, Ut). Suppose 𝒴 is quantized to a discrete set 𝒶 using φ∶ 𝒴 → 𝒶. Let z = φ(x) denote the label for x. Then φ−1(z) denote all states which have label z. {Zt}t≥1 IS AN (ε, δ) AIS ε = sup

(x,u)∈𝒴×𝒱|r(x, u) − r(φ(x), u)|

  • r, equivalently, r(⋅, u) has a 𝔊-cosntant Kr

δ = sup

(x,u)∈𝒴×𝒱

d𝔊(ℙ(X+ | X = x, U = u), ℙ(X+ | X ∈ φ−1(φ(x)), U = u)).

  • r, equivalently, ℙ(X+|X = ⋅, U = u) has a 𝔊-constant of Kd.

Example 1: Error bounds on state aggregation

Bertsekas, “Convergence of discretization procedures in dynamic programming,” 1975.

slide-67
SLIDE 67
  • Approx. info. state–(Subramanian and Mahajan)

23

Summary

  • Approx. info. state–(Subramanian and Mahajan)

5 FORECASTING OUTPUTS IN DISTRIBUTION

H(1)

t

∼ H(2)

t

if for all future CONTROL inputs Ut:T, ℙ(Y(1)

t:T | H(1) t

, Ut:T) = ℙ(Y(2)

t:T | H(2) t

, Ut:T)

PROPERTIES OF INFORMATION STATE

The info state Zt at time t is a “compression”

  • f past inputs that satisfjes the following:

SUFFICIENT TO PREDICT ITSELF:

ℙ(Zt+1 | Ht, Ut) = ℙ(Zt+1 | Zt, Ut).

SUFFICIENT TO PREDICT OUTPUT:

ℙ(Yt | Ht, Ut) = ℙ(Yt | Zt, Ut). Same complexity as identifying the state suffjcient for forecasting outputs for the case of perfect observations (which was Step 1 for belief state formulations)

KEY QUESTIONS

Can this be used for dynamic programming? What is the right notion of approx- imations in this framework?

Now let’s consturct the state space

  • Approx. info. state–(Subramanian and Mahajan)

9

(ε, δ)-APPROXIMATE INFORMATION STATE (AIS) Given a function class 𝔊, a compression {Zt}t≥1 of history (i.e., Zt = φt(Ht)) is called an {(εt, δt)}t≥1 AIS if there exist: a function ˜ Rt(Zt, Ut), and a stochastic kernel νt(Zt+1|Zt, Ut) such that |𝔽[Rt|Ht = ht, Ut = ut] − ˜ Rt(φt(ht), ut)| ≤ εt For any Borel set A of 𝒶t, defjne μt(A) = ℙ(Zt+1 ∈ A | Ht = ht, Ut = ut) Then, d𝔊(μt, νt(⋅ |φt(ht), ut)) ≤ δt.

Approximate information state

  • Approx. info. state–(Subramanian and Mahajan)

10 MAIN THEOREM

Given a function class 𝔊, let {Zt}t≥1, where Zt = φt(Ht), be an {(εt, δt)}t≥1 AIS. Recursively defjne the following functions: ˆ VT+1(zT+1) = 0 and for t ∈ {T, . . . , 1}: ˆ Vt(zt) = max

ut∈𝒱{˜

Rt(zt, ut) + ∫ Vt+1(zt+1)νt(dzt+1 | zt, ut)}. Let π = (π1, . . . , πT) denote the corresponding policy. Then, if the value function ˆ Vt has 𝔊-constant Kt, then for any history ht, |Vt(ht) − ˆ Vt(φt(ht))| ≤ εT +

T

s=t

(εs + Ksδs). for any history ht, |Vt(ht) − Vπ

t (ht)|

≤ 2[εT +

T

s=t

(εs + Ksδs)].

Approximate dynamic programming using AIS

  • Approx. info. state–(Subramanian and Mahajan)

12

Consider an MDP with state space 𝒴 and per-step reward Rt = r(Xt, Ut). Suppose 𝒴 is quantized to a discrete set 𝒶 using φ∶ 𝒴 → 𝒶. Let z = φ(x) denote the label for x. Then φ−1(z) denote all states which have label z. {Zt}t≥1 IS AN (ε, δ) AIS ε = sup

(x,u)∈𝒴×𝒱|r(x, u) − r(φ(x), u)|

  • r, equivalently, r(⋅, u) has a 𝔊-cosntant Kr

δ = sup

(x,u)∈𝒴×𝒱

d𝔊(ℙ(X+ | X = x, U = u), ℙ(X+ | X ∈ φ−1(φ(x)), U = u)).

  • r, equivalently, ℙ(X+|X = ⋅, U = u) has a 𝔊-constant of Kd.

Example 1: Error bounds on state aggregation

Bertsekas, “Convergence of discretization procedures in dynamic programming,” 1975.

  • Approx. info. state–(Subramanian and Mahajan)

13

Proposed as a heuristic algorithms No performance bounds {Zt}t≥1 IS AN (ε, δ) AIS ε = sup

ht,ut|𝔽[Rt|ht, ut] − ˜

Rt(φt(ht), ut)| δ = sup

ht,ut

d𝔊(ℙ(ˆ Yt+1|ht, ut), ℙ(ˆ Yt+1|φt(ht), ut))

Example 2: Approximation bounds for using quantized obs.

Ha, Schmidhuber, “World Models”, 2018.

Video observation Vision Vision Memory RL agent Yt ˆ Yt Zt

slide-68
SLIDE 68
  • Approx. info. state–(Subramanian and Mahajan)

23

Summary

  • Approx. info. state–(Subramanian and Mahajan)

5 FORECASTING OUTPUTS IN DISTRIBUTION

H(1)

t

∼ H(2)

t

if for all future CONTROL inputs Ut:T, ℙ(Y(1)

t:T | H(1) t

, Ut:T) = ℙ(Y(2)

t:T | H(2) t

, Ut:T)

PROPERTIES OF INFORMATION STATE

The info state Zt at time t is a “compression”

  • f past inputs that satisfjes the following:

SUFFICIENT TO PREDICT ITSELF:

ℙ(Zt+1 | Ht, Ut) = ℙ(Zt+1 | Zt, Ut).

SUFFICIENT TO PREDICT OUTPUT:

ℙ(Yt | Ht, Ut) = ℙ(Yt | Zt, Ut). Same complexity as identifying the state suffjcient for forecasting outputs for the case of perfect observations (which was Step 1 for belief state formulations)

KEY QUESTIONS

Can this be used for dynamic programming? What is the right notion of approx- imations in this framework?

Now let’s consturct the state space

  • Approx. info. state–(Subramanian and Mahajan)

9

(ε, δ)-APPROXIMATE INFORMATION STATE (AIS) Given a function class 𝔊, a compression {Zt}t≥1 of history (i.e., Zt = φt(Ht)) is called an {(εt, δt)}t≥1 AIS if there exist: a function ˜ Rt(Zt, Ut), and a stochastic kernel νt(Zt+1|Zt, Ut) such that |𝔽[Rt|Ht = ht, Ut = ut] − ˜ Rt(φt(ht), ut)| ≤ εt For any Borel set A of 𝒶t, defjne μt(A) = ℙ(Zt+1 ∈ A | Ht = ht, Ut = ut) Then, d𝔊(μt, νt(⋅ |φt(ht), ut)) ≤ δt.

Approximate information state

  • Approx. info. state–(Subramanian and Mahajan)

10 MAIN THEOREM

Given a function class 𝔊, let {Zt}t≥1, where Zt = φt(Ht), be an {(εt, δt)}t≥1 AIS. Recursively defjne the following functions: ˆ VT+1(zT+1) = 0 and for t ∈ {T, . . . , 1}: ˆ Vt(zt) = max

ut∈𝒱{˜

Rt(zt, ut) + ∫ Vt+1(zt+1)νt(dzt+1 | zt, ut)}. Let π = (π1, . . . , πT) denote the corresponding policy. Then, if the value function ˆ Vt has 𝔊-constant Kt, then for any history ht, |Vt(ht) − ˆ Vt(φt(ht))| ≤ εT +

T

s=t

(εs + Ksδs). for any history ht, |Vt(ht) − Vπ

t (ht)|

≤ 2[εT +

T

s=t

(εs + Ksδs)].

Approximate dynamic programming using AIS

  • Approx. info. state–(Subramanian and Mahajan)

12

Consider an MDP with state space 𝒴 and per-step reward Rt = r(Xt, Ut). Suppose 𝒴 is quantized to a discrete set 𝒶 using φ∶ 𝒴 → 𝒶. Let z = φ(x) denote the label for x. Then φ−1(z) denote all states which have label z. {Zt}t≥1 IS AN (ε, δ) AIS ε = sup

(x,u)∈𝒴×𝒱|r(x, u) − r(φ(x), u)|

  • r, equivalently, r(⋅, u) has a 𝔊-cosntant Kr

δ = sup

(x,u)∈𝒴×𝒱

d𝔊(ℙ(X+ | X = x, U = u), ℙ(X+ | X ∈ φ−1(φ(x)), U = u)).

  • r, equivalently, ℙ(X+|X = ⋅, U = u) has a 𝔊-constant of Kd.

Example 1: Error bounds on state aggregation

Bertsekas, “Convergence of discretization procedures in dynamic programming,” 1975.

  • Approx. info. state–(Subramanian and Mahajan)

13

Proposed as a heuristic algorithms No performance bounds {Zt}t≥1 IS AN (ε, δ) AIS ε = sup

ht,ut|𝔽[Rt|ht, ut] − ˜

Rt(φt(ht), ut)| δ = sup

ht,ut

d𝔊(ℙ(ˆ Yt+1|ht, ut), ℙ(ˆ Yt+1|φt(ht), ut))

Example 2: Approximation bounds for using quantized obs.

Ha, Schmidhuber, “World Models”, 2018.

Video observation Vision Vision Memory RL agent Yt ˆ Yt Zt

  • Approx. info. state–(Subramanian and Mahajan)

14

n agents: state Xi

t, control Ui t.

Dynamics ℙ(𝐘t+1|𝐘t, 𝐕t) =

n

i=1

P(Xi

t+1|Xi t, Ui t, Mt)

Per-step reward R(𝐘t, 𝐕t) = 1 n

n

i=1

r(Xi

t, Ui t, Mt)

Empirical mean-fjeld: Mt(x) = 1 n

n

i=1

δXi

t(x).

Statistical mean-fjeld: ¯ mt(x) = ℙ(Xi

t = x).

Info structure: Ii

t = {Xi t}.

Expanded info structure: ˜ Ii

t = {Xi t, Mt}.

𝒦∗ ≤ ˜ 𝒦∗, ˜ 𝒦∗ − ¯ 𝒦∗ ≤ K/√n ¯ 𝒦∗ ≤ 𝒦∗ ≤ ¯ 𝒦∗ + K/√n. (A) r(x, u, m) and P(y|x, u, m) are Lipschitz in m. { ¯ mt}t≥1 is an (ε, δ) AIS for expanded info structure, where ε, δ ∈ 𝒫(1/√n).

Example 3: Approximation bounds for mean-field teams

slide-69
SLIDE 69
  • Approx. info. state–(Subramanian and Mahajan)

23

Summary

  • Approx. info. state–(Subramanian and Mahajan)

5 FORECASTING OUTPUTS IN DISTRIBUTION

H(1)

t

∼ H(2)

t

if for all future CONTROL inputs Ut:T, ℙ(Y(1)

t:T | H(1) t

, Ut:T) = ℙ(Y(2)

t:T | H(2) t

, Ut:T)

PROPERTIES OF INFORMATION STATE

The info state Zt at time t is a “compression”

  • f past inputs that satisfjes the following:

SUFFICIENT TO PREDICT ITSELF:

ℙ(Zt+1 | Ht, Ut) = ℙ(Zt+1 | Zt, Ut).

SUFFICIENT TO PREDICT OUTPUT:

ℙ(Yt | Ht, Ut) = ℙ(Yt | Zt, Ut). Same complexity as identifying the state suffjcient for forecasting outputs for the case of perfect observations (which was Step 1 for belief state formulations)

KEY QUESTIONS

Can this be used for dynamic programming? What is the right notion of approx- imations in this framework?

Now let’s consturct the state space

  • Approx. info. state–(Subramanian and Mahajan)

9

(ε, δ)-APPROXIMATE INFORMATION STATE (AIS) Given a function class 𝔊, a compression {Zt}t≥1 of history (i.e., Zt = φt(Ht)) is called an {(εt, δt)}t≥1 AIS if there exist: a function ˜ Rt(Zt, Ut), and a stochastic kernel νt(Zt+1|Zt, Ut) such that |𝔽[Rt|Ht = ht, Ut = ut] − ˜ Rt(φt(ht), ut)| ≤ εt For any Borel set A of 𝒶t, defjne μt(A) = ℙ(Zt+1 ∈ A | Ht = ht, Ut = ut) Then, d𝔊(μt, νt(⋅ |φt(ht), ut)) ≤ δt.

Approximate information state

  • Approx. info. state–(Subramanian and Mahajan)

10 MAIN THEOREM

Given a function class 𝔊, let {Zt}t≥1, where Zt = φt(Ht), be an {(εt, δt)}t≥1 AIS. Recursively defjne the following functions: ˆ VT+1(zT+1) = 0 and for t ∈ {T, . . . , 1}: ˆ Vt(zt) = max

ut∈𝒱{˜

Rt(zt, ut) + ∫ Vt+1(zt+1)νt(dzt+1 | zt, ut)}. Let π = (π1, . . . , πT) denote the corresponding policy. Then, if the value function ˆ Vt has 𝔊-constant Kt, then for any history ht, |Vt(ht) − ˆ Vt(φt(ht))| ≤ εT +

T

s=t

(εs + Ksδs). for any history ht, |Vt(ht) − Vπ

t (ht)|

≤ 2[εT +

T

s=t

(εs + Ksδs)].

Approximate dynamic programming using AIS

  • Approx. info. state–(Subramanian and Mahajan)

12

Consider an MDP with state space 𝒴 and per-step reward Rt = r(Xt, Ut). Suppose 𝒴 is quantized to a discrete set 𝒶 using φ∶ 𝒴 → 𝒶. Let z = φ(x) denote the label for x. Then φ−1(z) denote all states which have label z. {Zt}t≥1 IS AN (ε, δ) AIS ε = sup

(x,u)∈𝒴×𝒱|r(x, u) − r(φ(x), u)|

  • r, equivalently, r(⋅, u) has a 𝔊-cosntant Kr

δ = sup

(x,u)∈𝒴×𝒱

d𝔊(ℙ(X+ | X = x, U = u), ℙ(X+ | X ∈ φ−1(φ(x)), U = u)).

  • r, equivalently, ℙ(X+|X = ⋅, U = u) has a 𝔊-constant of Kd.

Example 1: Error bounds on state aggregation

Bertsekas, “Convergence of discretization procedures in dynamic programming,” 1975.

  • Approx. info. state–(Subramanian and Mahajan)

13

Proposed as a heuristic algorithms No performance bounds {Zt}t≥1 IS AN (ε, δ) AIS ε = sup

ht,ut|𝔽[Rt|ht, ut] − ˜

Rt(φt(ht), ut)| δ = sup

ht,ut

d𝔊(ℙ(ˆ Yt+1|ht, ut), ℙ(ˆ Yt+1|φt(ht), ut))

Example 2: Approximation bounds for using quantized obs.

Ha, Schmidhuber, “World Models”, 2018.

Video observation Vision Vision Memory RL agent Yt ˆ Yt Zt

  • Approx. info. state–(Subramanian and Mahajan)

14

n agents: state Xi

t, control Ui t.

Dynamics ℙ(𝐘t+1|𝐘t, 𝐕t) =

n

i=1

P(Xi

t+1|Xi t, Ui t, Mt)

Per-step reward R(𝐘t, 𝐕t) = 1 n

n

i=1

r(Xi

t, Ui t, Mt)

Empirical mean-fjeld: Mt(x) = 1 n

n

i=1

δXi

t(x).

Statistical mean-fjeld: ¯ mt(x) = ℙ(Xi

t = x).

Info structure: Ii

t = {Xi t}.

Expanded info structure: ˜ Ii

t = {Xi t, Mt}.

𝒦∗ ≤ ˜ 𝒦∗, ˜ 𝒦∗ − ¯ 𝒦∗ ≤ K/√n ¯ 𝒦∗ ≤ 𝒦∗ ≤ ¯ 𝒦∗ + K/√n. (A) r(x, u, m) and P(y|x, u, m) are Lipschitz in m. { ¯ mt}t≥1 is an (ε, δ) AIS for expanded info structure, where ε, δ ∈ 𝒫(1/√n).

Example 3: Approximation bounds for mean-field teams

  • Approx. info. state–(Subramanian and Mahajan)

18

𝟓 × 𝟓 Grid Environment

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Samples ×106 1 2 3 4 5 Performance Planning solution RPG AIS

slide-70
SLIDE 70
  • Approx. info. state–(Subramanian and Mahajan)

23

Summary

  • Approx. info. state–(Subramanian and Mahajan)

5 FORECASTING OUTPUTS IN DISTRIBUTION

H(1)

t

∼ H(2)

t

if for all future CONTROL inputs Ut:T, ℙ(Y(1)

t:T | H(1) t

, Ut:T) = ℙ(Y(2)

t:T | H(2) t

, Ut:T)

PROPERTIES OF INFORMATION STATE

The info state Zt at time t is a “compression”

  • f past inputs that satisfjes the following:

SUFFICIENT TO PREDICT ITSELF:

ℙ(Zt+1 | Ht, Ut) = ℙ(Zt+1 | Zt, Ut).

SUFFICIENT TO PREDICT OUTPUT:

ℙ(Yt | Ht, Ut) = ℙ(Yt | Zt, Ut). Same complexity as identifying the state suffjcient for forecasting outputs for the case of perfect observations (which was Step 1 for belief state formulations)

KEY QUESTIONS

Can this be used for dynamic programming? What is the right notion of approx- imations in this framework?

Now let’s consturct the state space

  • Approx. info. state–(Subramanian and Mahajan)

9

(ε, δ)-APPROXIMATE INFORMATION STATE (AIS) Given a function class 𝔊, a compression {Zt}t≥1 of history (i.e., Zt = φt(Ht)) is called an {(εt, δt)}t≥1 AIS if there exist: a function ˜ Rt(Zt, Ut), and a stochastic kernel νt(Zt+1|Zt, Ut) such that |𝔽[Rt|Ht = ht, Ut = ut] − ˜ Rt(φt(ht), ut)| ≤ εt For any Borel set A of 𝒶t, defjne μt(A) = ℙ(Zt+1 ∈ A | Ht = ht, Ut = ut) Then, d𝔊(μt, νt(⋅ |φt(ht), ut)) ≤ δt.

Approximate information state

  • Approx. info. state–(Subramanian and Mahajan)

10 MAIN THEOREM

Given a function class 𝔊, let {Zt}t≥1, where Zt = φt(Ht), be an {(εt, δt)}t≥1 AIS. Recursively defjne the following functions: ˆ VT+1(zT+1) = 0 and for t ∈ {T, . . . , 1}: ˆ Vt(zt) = max

ut∈𝒱{˜

Rt(zt, ut) + ∫ Vt+1(zt+1)νt(dzt+1 | zt, ut)}. Let π = (π1, . . . , πT) denote the corresponding policy. Then, if the value function ˆ Vt has 𝔊-constant Kt, then for any history ht, |Vt(ht) − ˆ Vt(φt(ht))| ≤ εT +

T

s=t

(εs + Ksδs). for any history ht, |Vt(ht) − Vπ

t (ht)|

≤ 2[εT +

T

s=t

(εs + Ksδs)].

Approximate dynamic programming using AIS

  • Approx. info. state–(Subramanian and Mahajan)

12

Consider an MDP with state space 𝒴 and per-step reward Rt = r(Xt, Ut). Suppose 𝒴 is quantized to a discrete set 𝒶 using φ∶ 𝒴 → 𝒶. Let z = φ(x) denote the label for x. Then φ−1(z) denote all states which have label z. {Zt}t≥1 IS AN (ε, δ) AIS ε = sup

(x,u)∈𝒴×𝒱|r(x, u) − r(φ(x), u)|

  • r, equivalently, r(⋅, u) has a 𝔊-cosntant Kr

δ = sup

(x,u)∈𝒴×𝒱

d𝔊(ℙ(X+ | X = x, U = u), ℙ(X+ | X ∈ φ−1(φ(x)), U = u)).

  • r, equivalently, ℙ(X+|X = ⋅, U = u) has a 𝔊-constant of Kd.

Example 1: Error bounds on state aggregation

Bertsekas, “Convergence of discretization procedures in dynamic programming,” 1975.

  • Approx. info. state–(Subramanian and Mahajan)

13

Proposed as a heuristic algorithms No performance bounds {Zt}t≥1 IS AN (ε, δ) AIS ε = sup

ht,ut|𝔽[Rt|ht, ut] − ˜

Rt(φt(ht), ut)| δ = sup

ht,ut

d𝔊(ℙ(ˆ Yt+1|ht, ut), ℙ(ˆ Yt+1|φt(ht), ut))

Example 2: Approximation bounds for using quantized obs.

Ha, Schmidhuber, “World Models”, 2018.

Video observation Vision Vision Memory RL agent Yt ˆ Yt Zt

  • Approx. info. state–(Subramanian and Mahajan)

14

n agents: state Xi

t, control Ui t.

Dynamics ℙ(𝐘t+1|𝐘t, 𝐕t) =

n

i=1

P(Xi

t+1|Xi t, Ui t, Mt)

Per-step reward R(𝐘t, 𝐕t) = 1 n

n

i=1

r(Xi

t, Ui t, Mt)

Empirical mean-fjeld: Mt(x) = 1 n

n

i=1

δXi

t(x).

Statistical mean-fjeld: ¯ mt(x) = ℙ(Xi

t = x).

Info structure: Ii

t = {Xi t}.

Expanded info structure: ˜ Ii

t = {Xi t, Mt}.

𝒦∗ ≤ ˜ 𝒦∗, ˜ 𝒦∗ − ¯ 𝒦∗ ≤ K/√n ¯ 𝒦∗ ≤ 𝒦∗ ≤ ¯ 𝒦∗ + K/√n. (A) r(x, u, m) and P(y|x, u, m) are Lipschitz in m. { ¯ mt}t≥1 is an (ε, δ) AIS for expanded info structure, where ε, δ ∈ 𝒫(1/√n).

Example 3: Approximation bounds for mean-field teams

  • Approx. info. state–(Subramanian and Mahajan)

18

𝟓 × 𝟓 Grid Environment

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Samples ×106 1 2 3 4 5 Performance Planning solution RPG AIS

  • Approx. info. state–(Subramanian and Mahajan)

20

Tiger Environment

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Samples ×106 −40 −30 −20 −10 10 20 Performance Planning solution RPG AIS

slide-71
SLIDE 71
  • Approx. info. state–(Subramanian and Mahajan)

23

Summary

  • Approx. info. state–(Subramanian and Mahajan)

5 FORECASTING OUTPUTS IN DISTRIBUTION

H(1)

t

∼ H(2)

t

if for all future CONTROL inputs Ut:T, ℙ(Y(1)

t:T | H(1) t

, Ut:T) = ℙ(Y(2)

t:T | H(2) t

, Ut:T)

PROPERTIES OF INFORMATION STATE

The info state Zt at time t is a “compression”

  • f past inputs that satisfjes the following:

SUFFICIENT TO PREDICT ITSELF:

ℙ(Zt+1 | Ht, Ut) = ℙ(Zt+1 | Zt, Ut).

SUFFICIENT TO PREDICT OUTPUT:

ℙ(Yt | Ht, Ut) = ℙ(Yt | Zt, Ut). Same complexity as identifying the state suffjcient for forecasting outputs for the case of perfect observations (which was Step 1 for belief state formulations)

KEY QUESTIONS

Can this be used for dynamic programming? What is the right notion of approx- imations in this framework?

Now let’s consturct the state space

  • Approx. info. state–(Subramanian and Mahajan)

9

(ε, δ)-APPROXIMATE INFORMATION STATE (AIS) Given a function class 𝔊, a compression {Zt}t≥1 of history (i.e., Zt = φt(Ht)) is called an {(εt, δt)}t≥1 AIS if there exist: a function ˜ Rt(Zt, Ut), and a stochastic kernel νt(Zt+1|Zt, Ut) such that |𝔽[Rt|Ht = ht, Ut = ut] − ˜ Rt(φt(ht), ut)| ≤ εt For any Borel set A of 𝒶t, defjne μt(A) = ℙ(Zt+1 ∈ A | Ht = ht, Ut = ut) Then, d𝔊(μt, νt(⋅ |φt(ht), ut)) ≤ δt.

Approximate information state

  • Approx. info. state–(Subramanian and Mahajan)

10 MAIN THEOREM

Given a function class 𝔊, let {Zt}t≥1, where Zt = φt(Ht), be an {(εt, δt)}t≥1 AIS. Recursively defjne the following functions: ˆ VT+1(zT+1) = 0 and for t ∈ {T, . . . , 1}: ˆ Vt(zt) = max

ut∈𝒱{˜

Rt(zt, ut) + ∫ Vt+1(zt+1)νt(dzt+1 | zt, ut)}. Let π = (π1, . . . , πT) denote the corresponding policy. Then, if the value function ˆ Vt has 𝔊-constant Kt, then for any history ht, |Vt(ht) − ˆ Vt(φt(ht))| ≤ εT +

T

s=t

(εs + Ksδs). for any history ht, |Vt(ht) − Vπ

t (ht)|

≤ 2[εT +

T

s=t

(εs + Ksδs)].

Approximate dynamic programming using AIS

  • Approx. info. state–(Subramanian and Mahajan)

12

Consider an MDP with state space 𝒴 and per-step reward Rt = r(Xt, Ut). Suppose 𝒴 is quantized to a discrete set 𝒶 using φ∶ 𝒴 → 𝒶. Let z = φ(x) denote the label for x. Then φ−1(z) denote all states which have label z. {Zt}t≥1 IS AN (ε, δ) AIS ε = sup

(x,u)∈𝒴×𝒱|r(x, u) − r(φ(x), u)|

  • r, equivalently, r(⋅, u) has a 𝔊-cosntant Kr

δ = sup

(x,u)∈𝒴×𝒱

d𝔊(ℙ(X+ | X = x, U = u), ℙ(X+ | X ∈ φ−1(φ(x)), U = u)).

  • r, equivalently, ℙ(X+|X = ⋅, U = u) has a 𝔊-constant of Kd.

Example 1: Error bounds on state aggregation

Bertsekas, “Convergence of discretization procedures in dynamic programming,” 1975.

  • Approx. info. state–(Subramanian and Mahajan)

13

Proposed as a heuristic algorithms No performance bounds {Zt}t≥1 IS AN (ε, δ) AIS ε = sup

ht,ut|𝔽[Rt|ht, ut] − ˜

Rt(φt(ht), ut)| δ = sup

ht,ut

d𝔊(ℙ(ˆ Yt+1|ht, ut), ℙ(ˆ Yt+1|φt(ht), ut))

Example 2: Approximation bounds for using quantized obs.

Ha, Schmidhuber, “World Models”, 2018.

Video observation Vision Vision Memory RL agent Yt ˆ Yt Zt

  • Approx. info. state–(Subramanian and Mahajan)

14

n agents: state Xi

t, control Ui t.

Dynamics ℙ(𝐘t+1|𝐘t, 𝐕t) =

n

i=1

P(Xi

t+1|Xi t, Ui t, Mt)

Per-step reward R(𝐘t, 𝐕t) = 1 n

n

i=1

r(Xi

t, Ui t, Mt)

Empirical mean-fjeld: Mt(x) = 1 n

n

i=1

δXi

t(x).

Statistical mean-fjeld: ¯ mt(x) = ℙ(Xi

t = x).

Info structure: Ii

t = {Xi t}.

Expanded info structure: ˜ Ii

t = {Xi t, Mt}.

𝒦∗ ≤ ˜ 𝒦∗, ˜ 𝒦∗ − ¯ 𝒦∗ ≤ K/√n ¯ 𝒦∗ ≤ 𝒦∗ ≤ ¯ 𝒦∗ + K/√n. (A) r(x, u, m) and P(y|x, u, m) are Lipschitz in m. { ¯ mt}t≥1 is an (ε, δ) AIS for expanded info structure, where ε, δ ∈ 𝒫(1/√n).

Example 3: Approximation bounds for mean-field teams

  • Approx. info. state–(Subramanian and Mahajan)

18

𝟓 × 𝟓 Grid Environment

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Samples ×106 1 2 3 4 5 Performance Planning solution RPG AIS

  • Approx. info. state–(Subramanian and Mahajan)

20

Tiger Environment

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Samples ×106 −40 −30 −20 −10 10 20 Performance Planning solution RPG AIS

  • Approx. info. state–(Subramanian and Mahajan)

22

Cheese Maze Environment

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Samples ×106 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Performance Planning solution RPG AIS

slide-72
SLIDE 72
  • Approx. info. state–(Subramanian and Mahajan)

23

Summary

  • Approx. info. state–(Subramanian and Mahajan)

5 FORECASTING OUTPUTS IN DISTRIBUTION

H(1)

t

∼ H(2)

t

if for all future CONTROL inputs Ut:T, ℙ(Y(1)

t:T | H(1) t

, Ut:T) = ℙ(Y(2)

t:T | H(2) t

, Ut:T)

PROPERTIES OF INFORMATION STATE

The info state Zt at time t is a “compression”

  • f past inputs that satisfjes the following:

SUFFICIENT TO PREDICT ITSELF:

ℙ(Zt+1 | Ht, Ut) = ℙ(Zt+1 | Zt, Ut).

SUFFICIENT TO PREDICT OUTPUT:

ℙ(Yt | Ht, Ut) = ℙ(Yt | Zt, Ut). Same complexity as identifying the state suffjcient for forecasting outputs for the case of perfect observations (which was Step 1 for belief state formulations)

KEY QUESTIONS

Can this be used for dynamic programming? What is the right notion of approx- imations in this framework?

Now let’s consturct the state space

  • Approx. info. state–(Subramanian and Mahajan)

9

(ε, δ)-APPROXIMATE INFORMATION STATE (AIS) Given a function class 𝔊, a compression {Zt}t≥1 of history (i.e., Zt = φt(Ht)) is called an {(εt, δt)}t≥1 AIS if there exist: a function ˜ Rt(Zt, Ut), and a stochastic kernel νt(Zt+1|Zt, Ut) such that |𝔽[Rt|Ht = ht, Ut = ut] − ˜ Rt(φt(ht), ut)| ≤ εt For any Borel set A of 𝒶t, defjne μt(A) = ℙ(Zt+1 ∈ A | Ht = ht, Ut = ut) Then, d𝔊(μt, νt(⋅ |φt(ht), ut)) ≤ δt.

Approximate information state

  • Approx. info. state–(Subramanian and Mahajan)

10 MAIN THEOREM

Given a function class 𝔊, let {Zt}t≥1, where Zt = φt(Ht), be an {(εt, δt)}t≥1 AIS. Recursively defjne the following functions: ˆ VT+1(zT+1) = 0 and for t ∈ {T, . . . , 1}: ˆ Vt(zt) = max

ut∈𝒱{˜

Rt(zt, ut) + ∫ Vt+1(zt+1)νt(dzt+1 | zt, ut)}. Let π = (π1, . . . , πT) denote the corresponding policy. Then, if the value function ˆ Vt has 𝔊-constant Kt, then for any history ht, |Vt(ht) − ˆ Vt(φt(ht))| ≤ εT +

T

s=t

(εs + Ksδs). for any history ht, |Vt(ht) − Vπ

t (ht)|

≤ 2[εT +

T

s=t

(εs + Ksδs)].

Approximate dynamic programming using AIS

  • Approx. info. state–(Subramanian and Mahajan)

12

Consider an MDP with state space 𝒴 and per-step reward Rt = r(Xt, Ut). Suppose 𝒴 is quantized to a discrete set 𝒶 using φ∶ 𝒴 → 𝒶. Let z = φ(x) denote the label for x. Then φ−1(z) denote all states which have label z. {Zt}t≥1 IS AN (ε, δ) AIS ε = sup

(x,u)∈𝒴×𝒱|r(x, u) − r(φ(x), u)|

  • r, equivalently, r(⋅, u) has a 𝔊-cosntant Kr

δ = sup

(x,u)∈𝒴×𝒱

d𝔊(ℙ(X+ | X = x, U = u), ℙ(X+ | X ∈ φ−1(φ(x)), U = u)).

  • r, equivalently, ℙ(X+|X = ⋅, U = u) has a 𝔊-constant of Kd.

Example 1: Error bounds on state aggregation

Bertsekas, “Convergence of discretization procedures in dynamic programming,” 1975.

  • Approx. info. state–(Subramanian and Mahajan)

13

Proposed as a heuristic algorithms No performance bounds {Zt}t≥1 IS AN (ε, δ) AIS ε = sup

ht,ut|𝔽[Rt|ht, ut] − ˜

Rt(φt(ht), ut)| δ = sup

ht,ut

d𝔊(ℙ(ˆ Yt+1|ht, ut), ℙ(ˆ Yt+1|φt(ht), ut))

Example 2: Approximation bounds for using quantized obs.

Ha, Schmidhuber, “World Models”, 2018.

Video observation Vision Vision Memory RL agent Yt ˆ Yt Zt

  • Approx. info. state–(Subramanian and Mahajan)

14

n agents: state Xi

t, control Ui t.

Dynamics ℙ(𝐘t+1|𝐘t, 𝐕t) =

n

i=1

P(Xi

t+1|Xi t, Ui t, Mt)

Per-step reward R(𝐘t, 𝐕t) = 1 n

n

i=1

r(Xi

t, Ui t, Mt)

Empirical mean-fjeld: Mt(x) = 1 n

n

i=1

δXi

t(x).

Statistical mean-fjeld: ¯ mt(x) = ℙ(Xi

t = x).

Info structure: Ii

t = {Xi t}.

Expanded info structure: ˜ Ii

t = {Xi t, Mt}.

𝒦∗ ≤ ˜ 𝒦∗, ˜ 𝒦∗ − ¯ 𝒦∗ ≤ K/√n ¯ 𝒦∗ ≤ 𝒦∗ ≤ ¯ 𝒦∗ + K/√n. (A) r(x, u, m) and P(y|x, u, m) are Lipschitz in m. { ¯ mt}t≥1 is an (ε, δ) AIS for expanded info structure, where ε, δ ∈ 𝒫(1/√n).

Example 3: Approximation bounds for mean-field teams

  • Approx. info. state–(Subramanian and Mahajan)

18

𝟓 × 𝟓 Grid Environment

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Samples ×106 1 2 3 4 5 Performance Planning solution RPG AIS

  • Approx. info. state–(Subramanian and Mahajan)

20

Tiger Environment

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Samples ×106 −40 −30 −20 −10 10 20 Performance Planning solution RPG AIS

  • Approx. info. state–(Subramanian and Mahajan)

22

Cheese Maze Environment

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Samples ×106 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Performance Planning solution RPG AIS

AIS provides a conceptually clean framework for approximate DP and

  • nline RL in partially observed systems