non stationary reinforcement learning
play

Non-Stationary Reinforcement Learning Ruihao Zhu MIT IDSS Joint - PowerPoint PPT Presentation

Non-Stationary Reinforcement Learning Ruihao Zhu MIT IDSS Joint work with Wang Chi Cheung (NUS) and David Simchi-Levi (MIT) 1 / 18 Epidemic Control A DM iteratively: 1. Pick a measure to contain the virus. 2. See the corresponding outcome.


  1. Non-Stationary Reinforcement Learning Ruihao Zhu MIT IDSS Joint work with Wang Chi Cheung (NUS) and David Simchi-Levi (MIT) 1 / 18

  2. Epidemic Control A DM iteratively: 1. Pick a measure to contain the virus. 2. See the corresponding outcome. Goal : Minimize the total infected cases. 2 / 18

  3. Epidemic Control A DM iteratively: 1. Pick a measure to contain the virus. 2. See the corresponding outcome. Goal : Minimize the total infected cases. Challenges: ◮ Uncertainty: effectiveness of each measure is unknown. ◮ Bandit feedback: no feedback for un-chosen measures. ◮ Non-stationarity: virus might mutate throughout. 2 / 18

  4. Epidemic Control The DM’s action could have long-term impact . ◮ Quarantine lockdown stem the spread of virus to elsewhere, but also delayed key supplies from getting in. 3 / 18

  5. Model Model epidemic control by a Markov decision process (MDP) (Nowzari et al. 15, Kiss et al. 17) . For each time step t = 1 , . . . , T , ◮ Observe the current state s t = { 1 , 2 } , and receive a reward. For example r (1) = 1 and r (2) = 0 . ◮ Pick an action a t ∈ { B , G } , and transition to the next state s t +1 ∼ p t ( ·| s t , a t ) (unknown). 4 / 18

  6. Model Model epidemic control by a Markov decision process (MDP) (Nowzari et al. 15, Kiss et al. 17) . For each time step t = 1 , . . . , T , ◮ Observe the current state s t = { 1 , 2 } , and receive a reward. For example r (1) = 1 and r (2) = 0 . ◮ Pick an action a t ∈ { B , G } , and transition to the next state s t +1 ∼ p t ( ·| s t , a t ) (unknown). 4 / 18

  7. Model Model epidemic control by a Markov decision process (MDP) (Nowzari et al. 15, Kiss et al. 17) . For each time step t = 1 , . . . , T , ◮ Observe the current state s t = { 1 , 2 } , and receive a reward. For example r (1) = 1 and r (2) = 0 . ◮ Pick an action a t ∈ { B , G } , and transition to the next state s t +1 ∼ p t ( ·| s t , a t ) (unknown). 4 / 18

  8. Model cont’d ◮ Task: Design a reward-maximizing policy π. For every time step t : π t : { 1 , 2 } → { B , G } ◮ Dynamic regret (Besbes et al. 15) :   �� T � � T  − E dym-reg T = E t =1 r ( s t ( π ∗ )) t =1 r ( s t ( π )) . ���� knows p t ’s ◮ Variation budget: � p 1 − p 2 � + � p 2 − p 3 � + . . . + � p T − 1 − p T � ≤ B p . 5 / 18

  9. Model cont’d ◮ Task: Design a reward-maximizing policy π. For every time step t : π t : { 1 , 2 } → { B , G } ◮ Dynamic regret (Besbes et al. 15) :   �� T � � T  − E dym-reg T = E t =1 r ( s t ( π ∗ )) t =1 r ( s t ( π )) . ���� knows p t ’s ◮ Variation budget: � p 1 − p 2 � + � p 2 − p 3 � + . . . + � p T − 1 − p T � ≤ B p . 5 / 18

  10. Model cont’d ◮ Task: Design a reward-maximizing policy π. For every time step t : π t : { 1 , 2 } → { B , G } ◮ Dynamic regret (Besbes et al. 15) :   �� T � � T  − E dym-reg T = E t =1 r ( s t ( π ∗ )) t =1 r ( s t ( π )) . ���� knows p t ’s ◮ Variation budget: � p 1 − p 2 � + � p 2 − p 3 � + . . . + � p T − 1 − p T � ≤ B p . 5 / 18

  11. Diameter of a MDP cont’d ◮ If the DM leaves state 1, she has to come back to state 1 to collect samples. 6 / 18

  12. Diameter of a MDP cont’d ◮ If the DM leaves state 1, she has to come back to state 1 to collect samples. ◮ The longer it takes to commute between states, the harder the learning process. 6 / 18

  13. Diameter of a MDP cont’d ◮ If the DM leaves state 1, she has to come back to state 1 to collect samples. ◮ The longer it takes to commute between states, the harder the learning process. Definition ( (Jaksch et al. 10) Informal) Diameter = max { E [min. time(1 → 2)] , E [min. time(2 → 1)] } 6 / 18

  14. Diameter of a MDP cont’d ◮ If the DM leaves state 1, she has to come back to state 1 to collect samples. ◮ The longer it takes to commute between states, the harder the learning process. Definition ( (Jaksch et al. 10) Informal) Diameter = max { E [min. time(1 → 2)] , E [min. time(2 → 1)] } Example. Diameter = max { 1 / 0 . 8 , 1 / 0 . 1 } = 10 . 6 / 18

  15. Existing Works Stationary Non-stationary Multi-armed bandit OFU* Forgetting + OFU † Reinforcement learning OFU ‡ ? (Forgetting + OFU) * Auer et al. 03 † Besbes et al. 14, Cheung et al. 19 ‡ Jaksch et al. 10, Agrawal and Jia 20 7 / 18

  16. UCB for Stationary RL 1. Suppose at time t , N t (1 , B ) = 10 : 5 × (1 , B ) → 1 , 5 × (1 , B ) → 2 8 / 18

  17. UCB for Stationary RL 1. Suppose at time t , N t (1 , B ) = 10 : 5 × (1 , B ) → 1 , 5 × (1 , B ) → 2 N t (2 , B ) = 10 : 5 × (2 , B ) → 1 , 5 × (2 , B ) → 2 8 / 18

  18. UCB for Stationary RL 1. Suppose at time t , N t (1 , B ) = 10 : 5 × (1 , B ) → 1 , 5 × (1 , B ) → 2 N t (2 , B ) = 10 : 5 × (2 , B ) → 1 , 5 × (2 , B ) → 2 Empirical state transition distribution: 8 / 18

  19. UCB for Stationary RL 1. Suppose at time t , N t (1 , B ) = 10 : 5 × (1 , B ) → 1 , 5 × (1 , B ) → 2 N t (2 , B ) = 10 : 5 × (2 , B ) → 1 , 5 × (2 , B ) → 2 Empirical state transition distribution: 2. Confidence intervals: √ � ˆ p t ( ·| 1 , B ) − p ( ·| 1 , B ) � ≤ c t (1 , B ) := C / 10 √ � ˆ p t ( ·| 2 , B ) − p ( ·| 2 , B ) � ≤ c t (2 , B ) := C / 10 8 / 18

  20. UCB for Stationary RL 3. UCB of reward: find the ˚ p that maximizes Pr(visiting state 1) within the confidence interval. 4. Execute the optimal policy w.r.t. the UCB until some termination criteria are met. 9 / 18

  21. UCB for Stationary RL 3. UCB of reward: find the ˚ p that maximizes Pr(visiting state 1) within the confidence interval. 4. Execute the optimal policy w.r.t. the UCB until some termination criteria are met. 9 / 18

  22. UCB for RL cont’d Regret analysis: ◮ LCB of diameter: find the ˚ p that maximizes Pr(commuting) within the confidence interval. �� � ◮ Regret ∝ LCB × ( s , a ) c t ( s , a ) . ◮ Under stationarity, LCB of diameter ≤ Diameter( p ) . Theorem Denote D := Diameter ( p ) , the regret of the UCB algorithm is √ O ( D T ) . ◮ Summary: UCB of reward + LCB of diameter ⇒ low regret. 10 / 18

  23. UCB for RL cont’d Regret analysis: ◮ LCB of diameter: find the ˚ p that maximizes Pr(commuting) within the confidence interval. �� � ◮ Regret ∝ LCB × ( s , a ) c t ( s , a ) . ◮ Under stationarity, LCB of diameter ≤ Diameter( p ) . Theorem Denote D := Diameter ( p ) , the regret of the UCB algorithm is √ O ( D T ) . ◮ Summary: UCB of reward + LCB of diameter ⇒ low regret. 10 / 18

  24. UCB for RL cont’d Regret analysis: ◮ LCB of diameter: find the ˚ p that maximizes Pr(commuting) within the confidence interval. �� � ◮ Regret ∝ LCB × ( s , a ) c t ( s , a ) . ◮ Under stationarity, LCB of diameter ≤ Diameter( p ) . Theorem Denote D := Diameter ( p ) , the regret of the UCB algorithm is √ O ( D T ) . ◮ Summary: UCB of reward + LCB of diameter ⇒ low regret. 10 / 18

  25. UCB for RL cont’d Regret analysis: ◮ LCB of diameter: find the ˚ p that maximizes Pr(commuting) within the confidence interval. �� � ◮ Regret ∝ LCB × ( s , a ) c t ( s , a ) . ◮ Under stationarity, LCB of diameter ≤ Diameter( p ) . Theorem Denote D := Diameter ( p ) , the regret of the UCB algorithm is √ O ( D T ) . ◮ Summary: UCB of reward + LCB of diameter ⇒ low regret. 10 / 18

  26. UCB for RL cont’d Regret analysis: ◮ LCB of diameter: find the ˚ p that maximizes Pr(commuting) within the confidence interval. �� � ◮ Regret ∝ LCB × ( s , a ) c t ( s , a ) . ◮ Under stationarity, LCB of diameter ≤ Diameter( p ) . Theorem Denote D := Diameter ( p ) , the regret of the UCB algorithm is √ O ( D T ) . ◮ Summary: UCB of reward + LCB of diameter ⇒ low regret. 10 / 18

  27. SWUCB for RL According to (Cheung et al. 19) : ◮ SWUCB for RL: UCB for RL with W most recent samples. 11 / 18

  28. SWUCB for RL According to (Cheung et al. 19) : ◮ SWUCB for RL: UCB for RL with W most recent samples. ◮ The perils of drift: Under non-stationarity, LCB of diameter ≫ Diameter( p s ) for all s ∈ [ T ] . 11 / 18

  29. Perils of Non-Stationarity in RL Non-stationarity: The DM faces time-varying environment. 12 / 18

  30. Perils of Non-Stationarity in RL Non-stationarity: The DM faces time-varying environment. Bandit feedback: The DM is not seeing everything. 12 / 18

  31. Perils of Non-Stationarity in RL Non-stationarity: The DM faces time-varying environment. Bandit feedback: The DM is not seeing everything. Collected data: { (1 , B ) → 1 , (2 , B ) → 2 } 12 / 18

  32. Perils of Non-Stationarity in RL Non-stationarity: The DM faces time-varying environment. Bandit feedback: The DM is not seeing everything. Collected data: { (1 , B ) → 1 , (2 , B ) → 2 } Empirical state transition ˆ p t : 12 / 18

  33. Perils of Non-Stationarity in RL Non-stationarity: The DM faces time-varying environment. Bandit feedback: The DM is not seeing everything. Collected data: { (1 , B ) → 1 , (2 , B ) → 2 } Empirical state transition ˆ p t : Diameter explodes! 12 / 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend