Non-Stationary Reinforcement Learning Ruihao Zhu MIT IDSS Joint - - PowerPoint PPT Presentation

non stationary reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Non-Stationary Reinforcement Learning Ruihao Zhu MIT IDSS Joint - - PowerPoint PPT Presentation

Non-Stationary Reinforcement Learning Ruihao Zhu MIT IDSS Joint work with Wang Chi Cheung (NUS) and David Simchi-Levi (MIT) 1 / 18 Epidemic Control A DM iteratively: 1. Pick a measure to contain the virus. 2. See the corresponding outcome.


slide-1
SLIDE 1

Non-Stationary Reinforcement Learning

Ruihao Zhu

MIT IDSS

Joint work with Wang Chi Cheung (NUS) and David Simchi-Levi (MIT)

1 / 18

slide-2
SLIDE 2

Epidemic Control

A DM iteratively:

  • 1. Pick a measure to contain the virus.
  • 2. See the corresponding outcome.

Goal: Minimize the total infected cases.

2 / 18

slide-3
SLIDE 3

Epidemic Control

A DM iteratively:

  • 1. Pick a measure to contain the virus.
  • 2. See the corresponding outcome.

Goal: Minimize the total infected cases. Challenges: ◮ Uncertainty: effectiveness of each measure is unknown. ◮ Bandit feedback: no feedback for un-chosen measures. ◮ Non-stationarity: virus might mutate throughout.

2 / 18

slide-4
SLIDE 4

Epidemic Control

The DM’s action could have long-term impact. ◮ Quarantine lockdown stem the spread of virus to elsewhere, but also delayed key supplies from getting in.

3 / 18

slide-5
SLIDE 5

Model

Model epidemic control by a Markov decision process (MDP) (Nowzari et al. 15, Kiss et al. 17). For each time step t = 1, . . . , T, ◮ Observe the current state st = {1, 2}, and receive a reward. For example r(1) = 1 and r(2) = 0. ◮ Pick an action at ∈ {B, G}, and transition to the next state st+1 ∼ pt(·|st, at) (unknown).

4 / 18

slide-6
SLIDE 6

Model

Model epidemic control by a Markov decision process (MDP) (Nowzari et al. 15, Kiss et al. 17). For each time step t = 1, . . . , T, ◮ Observe the current state st = {1, 2}, and receive a reward. For example r(1) = 1 and r(2) = 0. ◮ Pick an action at ∈ {B, G}, and transition to the next state st+1 ∼ pt(·|st, at) (unknown).

4 / 18

slide-7
SLIDE 7

Model

Model epidemic control by a Markov decision process (MDP) (Nowzari et al. 15, Kiss et al. 17). For each time step t = 1, . . . , T, ◮ Observe the current state st = {1, 2}, and receive a reward. For example r(1) = 1 and r(2) = 0. ◮ Pick an action at ∈ {B, G}, and transition to the next state st+1 ∼ pt(·|st, at) (unknown).

4 / 18

slide-8
SLIDE 8

Model cont’d

◮ Task: Design a reward-maximizing policy π. For every time step t : πt : {1, 2} → {B, G} ◮ Dynamic regret (Besbes et al. 15): dym-regT = E  T

t=1 r(st(

π∗

  • knows pt’s

))   − E T

t=1 r(st(π))

  • .

◮ Variation budget: p1 − p2 + p2 − p3 + . . . + pT−1 − pT ≤ Bp.

5 / 18

slide-9
SLIDE 9

Model cont’d

◮ Task: Design a reward-maximizing policy π. For every time step t : πt : {1, 2} → {B, G} ◮ Dynamic regret (Besbes et al. 15): dym-regT = E  T

t=1 r(st(

π∗

  • knows pt’s

))   − E T

t=1 r(st(π))

  • .

◮ Variation budget: p1 − p2 + p2 − p3 + . . . + pT−1 − pT ≤ Bp.

5 / 18

slide-10
SLIDE 10

Model cont’d

◮ Task: Design a reward-maximizing policy π. For every time step t : πt : {1, 2} → {B, G} ◮ Dynamic regret (Besbes et al. 15): dym-regT = E  T

t=1 r(st(

π∗

  • knows pt’s

))   − E T

t=1 r(st(π))

  • .

◮ Variation budget: p1 − p2 + p2 − p3 + . . . + pT−1 − pT ≤ Bp.

5 / 18

slide-11
SLIDE 11

Diameter of a MDP cont’d

◮ If the DM leaves state 1, she has to come back to state 1 to collect samples.

6 / 18

slide-12
SLIDE 12

Diameter of a MDP cont’d

◮ If the DM leaves state 1, she has to come back to state 1 to collect samples. ◮ The longer it takes to commute between states, the harder the learning process.

6 / 18

slide-13
SLIDE 13

Diameter of a MDP cont’d

◮ If the DM leaves state 1, she has to come back to state 1 to collect samples. ◮ The longer it takes to commute between states, the harder the learning process.

Definition ((Jaksch et al. 10) Informal)

Diameter = max{E[min. time(1 → 2)], E[min. time(2 → 1)]}

6 / 18

slide-14
SLIDE 14

Diameter of a MDP cont’d

◮ If the DM leaves state 1, she has to come back to state 1 to collect samples. ◮ The longer it takes to commute between states, the harder the learning process.

Definition ((Jaksch et al. 10) Informal)

Diameter = max{E[min. time(1 → 2)], E[min. time(2 → 1)]}

  • Example. Diameter = max{1/0.8, 1/0.1} = 10.

6 / 18

slide-15
SLIDE 15

Existing Works

Stationary Non-stationary Multi-armed bandit OFU* Forgetting + OFU† Reinforcement learning OFU‡ ? (Forgetting + OFU) * Auer et al. 03 †Besbes et al. 14, Cheung et al. 19 ‡Jaksch et al. 10, Agrawal and Jia 20

7 / 18

slide-16
SLIDE 16

UCB for Stationary RL

  • 1. Suppose at time t,

Nt(1, B) = 10 : 5 × (1, B) → 1, 5 × (1, B) → 2

8 / 18

slide-17
SLIDE 17

UCB for Stationary RL

  • 1. Suppose at time t,

Nt(1, B) = 10 : 5 × (1, B) → 1, 5 × (1, B) → 2 Nt(2, B) = 10 : 5 × (2, B) → 1, 5 × (2, B) → 2

8 / 18

slide-18
SLIDE 18

UCB for Stationary RL

  • 1. Suppose at time t,

Nt(1, B) = 10 : 5 × (1, B) → 1, 5 × (1, B) → 2 Nt(2, B) = 10 : 5 × (2, B) → 1, 5 × (2, B) → 2 Empirical state transition distribution:

8 / 18

slide-19
SLIDE 19

UCB for Stationary RL

  • 1. Suppose at time t,

Nt(1, B) = 10 : 5 × (1, B) → 1, 5 × (1, B) → 2 Nt(2, B) = 10 : 5 × (2, B) → 1, 5 × (2, B) → 2 Empirical state transition distribution:

  • 2. Confidence intervals:

ˆ pt(·|1, B) − p(·|1, B) ≤ ct(1, B) := C/ √ 10 ˆ pt(·|2, B) − p(·|2, B) ≤ ct(2, B) := C/ √ 10

8 / 18

slide-20
SLIDE 20

UCB for Stationary RL

  • 3. UCB of reward: find the ˚

p that maximizes Pr(visiting state 1) within the confidence interval.

  • 4. Execute the optimal policy w.r.t. the UCB until some

termination criteria are met.

9 / 18

slide-21
SLIDE 21

UCB for Stationary RL

  • 3. UCB of reward: find the ˚

p that maximizes Pr(visiting state 1) within the confidence interval.

  • 4. Execute the optimal policy w.r.t. the UCB until some

termination criteria are met.

9 / 18

slide-22
SLIDE 22

UCB for RL cont’d

Regret analysis: ◮ LCB of diameter: find the ˚ p that maximizes Pr(commuting) within the confidence interval. ◮ Regret ∝ LCB ×

  • (s,a) ct(s, a)
  • .

◮ Under stationarity, LCB of diameter ≤ Diameter(p).

Theorem

Denote D := Diameter(p), the regret of the UCB algorithm is O(D √ T). ◮ Summary: UCB of reward + LCB of diameter ⇒ low regret.

10 / 18

slide-23
SLIDE 23

UCB for RL cont’d

Regret analysis: ◮ LCB of diameter: find the ˚ p that maximizes Pr(commuting) within the confidence interval. ◮ Regret ∝ LCB ×

  • (s,a) ct(s, a)
  • .

◮ Under stationarity, LCB of diameter ≤ Diameter(p).

Theorem

Denote D := Diameter(p), the regret of the UCB algorithm is O(D √ T). ◮ Summary: UCB of reward + LCB of diameter ⇒ low regret.

10 / 18

slide-24
SLIDE 24

UCB for RL cont’d

Regret analysis: ◮ LCB of diameter: find the ˚ p that maximizes Pr(commuting) within the confidence interval. ◮ Regret ∝ LCB ×

  • (s,a) ct(s, a)
  • .

◮ Under stationarity, LCB of diameter ≤ Diameter(p).

Theorem

Denote D := Diameter(p), the regret of the UCB algorithm is O(D √ T). ◮ Summary: UCB of reward + LCB of diameter ⇒ low regret.

10 / 18

slide-25
SLIDE 25

UCB for RL cont’d

Regret analysis: ◮ LCB of diameter: find the ˚ p that maximizes Pr(commuting) within the confidence interval. ◮ Regret ∝ LCB ×

  • (s,a) ct(s, a)
  • .

◮ Under stationarity, LCB of diameter ≤ Diameter(p).

Theorem

Denote D := Diameter(p), the regret of the UCB algorithm is O(D √ T). ◮ Summary: UCB of reward + LCB of diameter ⇒ low regret.

10 / 18

slide-26
SLIDE 26

UCB for RL cont’d

Regret analysis: ◮ LCB of diameter: find the ˚ p that maximizes Pr(commuting) within the confidence interval. ◮ Regret ∝ LCB ×

  • (s,a) ct(s, a)
  • .

◮ Under stationarity, LCB of diameter ≤ Diameter(p).

Theorem

Denote D := Diameter(p), the regret of the UCB algorithm is O(D √ T). ◮ Summary: UCB of reward + LCB of diameter ⇒ low regret.

10 / 18

slide-27
SLIDE 27

SWUCB for RL

According to (Cheung et al. 19): ◮ SWUCB for RL: UCB for RL with W most recent samples.

11 / 18

slide-28
SLIDE 28

SWUCB for RL

According to (Cheung et al. 19): ◮ SWUCB for RL: UCB for RL with W most recent samples. ◮ The perils of drift: Under non-stationarity, LCB of diameter ≫ Diameter(ps) for all s ∈ [T].

11 / 18

slide-29
SLIDE 29

Perils of Non-Stationarity in RL

Non-stationarity: The DM faces time-varying environment.

12 / 18

slide-30
SLIDE 30

Perils of Non-Stationarity in RL

Non-stationarity: The DM faces time-varying environment. Bandit feedback: The DM is not seeing everything.

12 / 18

slide-31
SLIDE 31

Perils of Non-Stationarity in RL

Non-stationarity: The DM faces time-varying environment. Bandit feedback: The DM is not seeing everything. Collected data: {(1, B) → 1, (2, B) → 2}

12 / 18

slide-32
SLIDE 32

Perils of Non-Stationarity in RL

Non-stationarity: The DM faces time-varying environment. Bandit feedback: The DM is not seeing everything. Collected data: {(1, B) → 1, (2, B) → 2} Empirical state transition ˆ pt:

12 / 18

slide-33
SLIDE 33

Perils of Non-Stationarity in RL

Non-stationarity: The DM faces time-varying environment. Bandit feedback: The DM is not seeing everything. Collected data: {(1, B) → 1, (2, B) → 2} Empirical state transition ˆ pt: Diameter explodes!

12 / 18

slide-34
SLIDE 34

Perils of Non-Stationarity in RL

But let’s still check the “LCB” of diameter: ◮ For a window size W , ct(1, B) and ct(2, B) can be as small as Θ(1/ √ W ) (Cheung et al. 20). ◮ Hence, the “LCB” of diameter can be as large as Θ( √ W ). ◮ Recall: diameters of p1 and p2 are 1 ≪ Θ( √ W ). ◮ The “LCB” is no longer a valid LCB under non-stationarity. ◮ SWUCB incurs Θ(T) dynamic regret.

13 / 18

slide-35
SLIDE 35

Perils of Non-Stationarity in RL

But let’s still check the “LCB” of diameter: ◮ For a window size W , ct(1, B) and ct(2, B) can be as small as Θ(1/ √ W ) (Cheung et al. 20). ◮ Hence, the “LCB” of diameter can be as large as Θ( √ W ). ◮ Recall: diameters of p1 and p2 are 1 ≪ Θ( √ W ). ◮ The “LCB” is no longer a valid LCB under non-stationarity. ◮ SWUCB incurs Θ(T) dynamic regret.

13 / 18

slide-36
SLIDE 36

Perils of Non-Stationarity in RL

But let’s still check the “LCB” of diameter: ◮ For a window size W , ct(1, B) and ct(2, B) can be as small as Θ(1/ √ W ) (Cheung et al. 20). ◮ Hence, the “LCB” of diameter can be as large as Θ( √ W ). ◮ Recall: diameters of p1 and p2 are 1 ≪ Θ( √ W ). ◮ The “LCB” is no longer a valid LCB under non-stationarity. ◮ SWUCB incurs Θ(T) dynamic regret.

13 / 18

slide-37
SLIDE 37

Perils of Non-Stationarity in RL

But let’s still check the “LCB” of diameter: ◮ For a window size W , ct(1, B) and ct(2, B) can be as small as Θ(1/ √ W ) (Cheung et al. 20). ◮ Hence, the “LCB” of diameter can be as large as Θ( √ W ). ◮ Recall: diameters of p1 and p2 are 1 ≪ Θ( √ W ). ◮ The “LCB” is no longer a valid LCB under non-stationarity. ◮ SWUCB incurs Θ(T) dynamic regret.

13 / 18

slide-38
SLIDE 38

Perils of Non-Stationarity in RL

But let’s still check the “LCB” of diameter: ◮ For a window size W , ct(1, B) and ct(2, B) can be as small as Θ(1/ √ W ) (Cheung et al. 20). ◮ Hence, the “LCB” of diameter can be as large as Θ( √ W ). ◮ Recall: diameters of p1 and p2 are 1 ≪ Θ( √ W ). ◮ The “LCB” is no longer a valid LCB under non-stationarity. ◮ SWUCB incurs Θ(T) dynamic regret.

13 / 18

slide-39
SLIDE 39

Perils of Non-Stationarity in RL

But let’s still check the “LCB” of diameter: ◮ For a window size W , ct(1, B) and ct(2, B) can be as small as Θ(1/ √ W ) (Cheung et al. 20). ◮ Hence, the “LCB” of diameter can be as large as Θ( √ W ). ◮ Recall: diameters of p1 and p2 are 1 ≪ Θ( √ W ). ◮ The “LCB” is no longer a valid LCB under non-stationarity. ◮ SWUCB incurs Θ(T) dynamic regret.

13 / 18

slide-40
SLIDE 40

Perils of Non-Stationarity in RL

But let’s still check the “LCB” of diameter: ◮ For a window size W , ct(1, B) and ct(2, B) can be as small as Θ(1/ √ W ) (Cheung et al. 20). ◮ Hence, the “LCB” of diameter can be as large as Θ( √ W ). ◮ Recall: diameters of p1 and p2 are 1 ≪ Θ( √ W ). ◮ The “LCB” is no longer a valid LCB under non-stationarity. ◮ SWUCB incurs Θ(T) dynamic regret.

13 / 18

slide-41
SLIDE 41

Confidence Widening

◮ This caveat stems from the estimation.

14 / 18

slide-42
SLIDE 42

Confidence Widening

◮ This caveat stems from the estimation. ◮ We can refine the design principle of UCB.

14 / 18

slide-43
SLIDE 43

Confidence Widening

◮ This caveat stems from the estimation. ◮ We can refine the design principle of UCB. ◮ Confidence widening: increase each confidence interval by η.

14 / 18

slide-44
SLIDE 44

Confidence Widening

◮ This caveat stems from the estimation. ◮ We can refine the design principle of UCB. ◮ Confidence widening: increase each confidence interval by η. ◮ ct ≥ 0 = ⇒ Pr(commuting) ≥ η

14 / 18

slide-45
SLIDE 45

Confidence Widening

◮ This caveat stems from the estimation. ◮ We can refine the design principle of UCB. ◮ Confidence widening: increase each confidence interval by η. ◮ ct ≥ 0 = ⇒ Pr(commuting) ≥ η ◮ New “LCB” ≤ 1/η.

14 / 18

slide-46
SLIDE 46

Confidence Widening

Recall: Regret ∝ LCB ×[

(s,a)(ct(s, a) + η)].

15 / 18

slide-47
SLIDE 47

Confidence Widening

Recall: Regret ∝ LCB ×[

(s,a)(ct(s, a) + η)].

15 / 18

slide-48
SLIDE 48

Confidence Widening

Recall: Regret ∝ LCB ×[

(s,a)(ct(s, a) + η)].

◮ If 1/η ≤ Diameter(pt), then LCB ≤ 1/η ≤ Diameter(pt).

15 / 18

slide-49
SLIDE 49

Confidence Widening

Recall: Regret ∝ LCB ×[

(s,a)(ct(s, a) + η)].

◮ If 1/η ≤ Diameter(pt), then LCB ≤ 1/η ≤ Diameter(pt). ◮ If 1/η ≥ Diameter(pt), then Pr(commuting) ≥ η for pt :

15 / 18

slide-50
SLIDE 50

Confidence Widening

Recall: Regret ∝ LCB ×[

(s,a)(ct(s, a) + η)].

◮ If 1/η ≤ Diameter(pt), then LCB ≤ 1/η ≤ Diameter(pt). ◮ If 1/η ≥ Diameter(pt), then Pr(commuting) ≥ η for pt :

15 / 18

slide-51
SLIDE 51

Confidence Widening

Recall: Regret ∝ LCB ×[

(s,a)(ct(s, a) + η)].

◮ If 1/η ≤ Diameter(pt), then LCB ≤ 1/η ≤ Diameter(pt). ◮ If 1/η ≥ Diameter(pt), then Pr(commuting) ≥ η for pt : ◮ Compare to p1 and p2 : a η variation is detected!

15 / 18

slide-52
SLIDE 52

The Blessing of More Optimism

Confidence widening ensures either we enjoy reasonable upper bound for LCB or we consume η of variation budget.

16 / 18

slide-53
SLIDE 53

The Blessing of More Optimism

Confidence widening ensures either we enjoy reasonable upper bound for LCB or we consume η of variation budget.

Theorem

If we choose the optimal W and η w.r.t. Bp, the dynamic regret bound for the SWUCB-CW algorithm is ˜ O

  • DmaxB

1 4

p T

3 4

  • .

16 / 18

slide-54
SLIDE 54

Conclusion

Stationary Non-stationary MAB OFU OFU + Forgetting RL OFU Extra optimism + Forgetting ◮ An unfavorable “phase transition” from MAB (1 state) to RL (≥ 2 states) for SWUCB. ◮ Blessing of more optimism: Provably low dynamic regret for non-stationary RL. ◮ Parameter-free: Bandit-over-reinforcement learning (Cheung et al. 20).

17 / 18

slide-55
SLIDE 55

Thank You!

rzhu@mit.edu isecwc@nus.edu.sg, dslevi@mit.edu

18 / 18