learning fair policies in multiobjective deep
play

Learning Fair Policies in Multiobjective (Deep) Reinforcement - PowerPoint PPT Presentation

Learning Fair Policies in Multiobjective (Deep) Reinforcement Learning with Average and Discounted Rewards Umer Siddique , Paul Weng, and Matthieu Zimmer University of Michigan-Shanghai Jiao Tong University Joint Institute ICML 2020 U. Siddique,


  1. Learning Fair Policies in Multiobjective (Deep) Reinforcement Learning with Average and Discounted Rewards Umer Siddique , Paul Weng, and Matthieu Zimmer University of Michigan-Shanghai Jiao Tong University Joint Institute ICML 2020 U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 1 / 11

  2. Overview Motivation and Problem 1 Theoretical Discussions & Algorithms 2 Experimental Results 3 Conclusion 4 U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 2 / 11

  3. Motivation: Why should we care about fair systems? Figure: Network with a fat-tree topology from Ruffy et al. (2019). U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 3 / 11

  4. Motivation: Why should we care about fair systems? Figure: Network with a fat-tree topology from Ruffy et al. (2019). Fairness consideration to users is crucial U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 3 / 11

  5. Motivation: Why should we care about fair systems? Figure: Network with a fat-tree topology from Ruffy et al. (2019). Fairness consideration to users is crucial Existing approaches to tackle this issue includes: Utilitarian approach Egalitarian approach U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 3 / 11

  6. Fairness Fairness includes: Efficiency Impartiality Equity U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 4 / 11

  7. Fairness Fairness includes: Efficiency Impartiality Equity Fairness encoded in a Social Welfare Function (SWF) U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 4 / 11

  8. Fairness Fairness includes: Efficiency Impartiality Equity Fairness encoded in a Social Welfare Function (SWF) We focus on generalized Gini social welfare function (GGF) U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 4 / 11

  9. Problem Statement GGF can be defined as: D � w i v ↑ GGF w ( v ) = i i =1 U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11

  10. Problem Statement GGF can be defined as:  v ↑  1     v ↑ D   2   � w i v ↑ GGF w ( v ) = i = [ w 1 w 2 . . . w D ]       i =1 . . .       v ↑ D U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11

  11. Problem Statement GGF can be defined as:  v ↑  1  ≤    v ↑ D   2   � w i v ↑ GGF w ( v ) = i = [ w 1 > w 2 > . . . > w D ]   ≤     i =1 . . .     ≤   v ↑ D U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11

  12. Problem Statement GGF can be defined as:  v ↑  1  ≤    v ↑ D   2   � w i v ↑ GGF w ( v ) = i = [ w 1 > w 2 > . . . > w D ]   ≤     i =1 . . .     ≤   v ↑ D Fair optimization problem in RL: arg max GGF w ( J ( π )) (1) π U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11

  13. Problem Statement GGF can be defined as:  v ↑  1  ≤    v ↑ D   2   � w i v ↑ GGF w ( v ) = i = [ w 1 > w 2 > . . . > w D ]   ≤     i =1 . . .     ≤   v ↑ D Fair optimization problem in RL: arg max GGF w ( J ( π )) (1) π � ∞ � � � h 1 � γ t − 1 R t � where J ( π ) = E P π or J ( π ) = lim . h E P π R t h →∞ t =1 t =1 γ -discounted rewards average rewards U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11

  14. Theoretical Discussion Assumption: MDPs are weakly-communicating U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11

  15. Theoretical Discussion Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies Existence of stationary Markov fair optimal policy. U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11

  16. Theoretical Discussion Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies Existence of stationary Markov fair optimal policy. Possibly State-Dependent Optimality With average reward, fair optimality stays state-independent. U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11

  17. Theoretical Discussion Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies Existence of stationary Markov fair optimal policy. Possibly State-Dependent Optimality With average reward, fair optimality stays state-independent. Contribution on Approximation Error Approximate average-optimal policy ( π ∗ 1 ) with γ -optimal policy ( π ∗ γ ). U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11

  18. Theoretical Discussion Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies Existence of stationary Markov fair optimal policy. Possibly State-Dependent Optimality With average reward, fair optimality stays state-independent. Contribution on Approximation Error Approximate average-optimal policy ( π ∗ 1 ) with γ -optimal policy ( π ∗ γ ). Theorem: � � GGF w ( µ ( π ∗ γ )) ≥ GGF w ( µ ( π ∗ 1 )) − R (1 − γ ) ρ ( γ, σ ( H P π ∗ 1 )) + ρ ( γ, σ ( H P π ∗ γ )) σ where R = max π � R π � 1 and ρ ( γ, σ ) = γ − (1 − γ ) σ . U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11

  19. Value Based and Policy Gradient Algorithms DQN: Q network takes values in R |A|× D , instead of R |A| , trained with target: Q θ ( s , a ) = r + γ ˆ ˆ Q θ ′ ( s ′ , a ∗ ) , where a ∗ = argmax a ′ ∈A GGF w r + γ ˆ Q θ ′ ( s ′ , a ′ ) � � . U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 7 / 11

  20. Value Based and Policy Gradient Algorithms DQN: Q network takes values in R |A|× D , instead of R |A| , trained with target: Q θ ( s , a ) = r + γ ˆ ˆ Q θ ′ ( s ′ , a ∗ ) , where a ∗ = argmax a ′ ∈A GGF w r + γ ˆ Q θ ′ ( s ′ , a ′ ) � � . To optimize the GGF with policy gradient: ∇ θ GGF w ( J ( π θ )) = ∇ J ( π θ ) GGF w ( J ( π θ )) · ∇ θ J ( π θ ) = w ⊺ σ · ∇ θ J ( π θ ) . U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 7 / 11

  21. Experimental Results What is the impact of optimizing GGF instead of the average of the objectives? Species Conservation 0 . 9 GGF Score 0 . 8 0 . 7 A2C GGF-A2C PPO GGF-PPO U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 8 / 11

  22. Experimental Results What is the impact of optimizing GGF instead of the average of the objectives? Species Conservation 0 . 9 GGF Score 0 . 8 0 . 7 A2C GGF-A2C PPO GGF-PPO Sea-otters 0 . 8 Abalones Average density 0 . 6 0 . 4 0 . 2 0 . 0 A2C GGF-A2C PPO GGF-PPO U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 8 / 11

  23. Experimental Results What is the price of fairness? How those algorithms performs in continuous domains? Species Conservation Average accumulated density 1 . 25 1 . 00 0 . 75 PPO GGF-PPO 0 . 50 10000 20000 30000 40000 50000 Number of Steps U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 9 / 11

  24. Experimental Results What is the price of fairness? How those algorithms performs in continuous domains? Species Conservation Average accumulated density 1 . 25 1 . 00 0 . 75 PPO GGF-PPO 0 . 50 10000 20000 30000 40000 50000 Number of Steps Average accumulated bandwidth Network Congestion Control 14 12 PPO GGF-PPO 0 5 10 15 20 25 Number of Episodes U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 9 / 11

  25. Experimental Results (Traffic Light Control) What is the effect of γ with respect to GGF-average optimality? Traffic Light Control × 10 7 − 1 . 6 − 1 . 8 GGF Score − 2 . 0 − 2 . 2 − 2 . 4 PPO- γ -0.99 GGF-PPO- γ -0.99 PPO- 1 − GGF-PPO- 1 − North East West South 400 Average waiting time 350 300 250 200 PPO GGF-PPO U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 10 / 11

  26. Conclusion Fair optimization in RL setting Theoretical discussion with a new bound Adaptations of DQN, A2C and PPO to solve this problem. Experimental validation in 3 domains U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 11 / 11

  27. Conclusion Fair optimization in RL setting Theoretical discussion with a new bound Adaptations of DQN, A2C and PPO to solve this problem. Experimental validation in 3 domains Future Works: Extend to distributed control Consider other fair social welfare functions Directly solve average reward problems U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 11 / 11

  28. Ruffy, F., Przystupa, M., and Beschastnikh, I. (2019). Iroko: A framework to prototype reinforcement learning for data center traffic control. In Workshop on ML for Systems at NeurIPS . U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 11 / 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend