Learning Fair Policies in Multiobjective (Deep) Reinforcement - - PowerPoint PPT Presentation

learning fair policies in multiobjective deep
SMART_READER_LITE
LIVE PREVIEW

Learning Fair Policies in Multiobjective (Deep) Reinforcement - - PowerPoint PPT Presentation

Learning Fair Policies in Multiobjective (Deep) Reinforcement Learning with Average and Discounted Rewards Umer Siddique , Paul Weng, and Matthieu Zimmer University of Michigan-Shanghai Jiao Tong University Joint Institute ICML 2020 U. Siddique,


slide-1
SLIDE 1

Learning Fair Policies in Multiobjective (Deep) Reinforcement Learning with Average and Discounted Rewards

Umer Siddique, Paul Weng, and Matthieu Zimmer

University of Michigan-Shanghai Jiao Tong University Joint Institute

ICML 2020

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 1 / 11

slide-2
SLIDE 2

Overview

1

Motivation and Problem

2

Theoretical Discussions & Algorithms

3

Experimental Results

4

Conclusion

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 2 / 11

slide-3
SLIDE 3

Motivation: Why should we care about fair systems?

Figure: Network with a fat-tree topology from Ruffy et al. (2019).

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 3 / 11

slide-4
SLIDE 4

Motivation: Why should we care about fair systems?

Figure: Network with a fat-tree topology from Ruffy et al. (2019).

Fairness consideration to users is crucial

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 3 / 11

slide-5
SLIDE 5

Motivation: Why should we care about fair systems?

Figure: Network with a fat-tree topology from Ruffy et al. (2019).

Fairness consideration to users is crucial Existing approaches to tackle this issue includes:

Utilitarian approach Egalitarian approach

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 3 / 11

slide-6
SLIDE 6

Fairness

Fairness includes:

Efficiency Impartiality Equity

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 4 / 11

slide-7
SLIDE 7

Fairness

Fairness includes:

Efficiency Impartiality Equity

Fairness encoded in a Social Welfare Function (SWF)

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 4 / 11

slide-8
SLIDE 8

Fairness

Fairness includes:

Efficiency Impartiality Equity

Fairness encoded in a Social Welfare Function (SWF) We focus on generalized Gini social welfare function (GGF)

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 4 / 11

slide-9
SLIDE 9

Problem Statement

GGF can be defined as: GGFw(v) =

D

  • i=1

w iv ↑

i

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 5 / 11

slide-10
SLIDE 10

Problem Statement

GGF can be defined as: GGFw(v) =

D

  • i=1

w iv ↑

i = [w 1 w 2 . . . w D]

           v ↑

1

v ↑

2

. . . v ↑

D

          

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 5 / 11

slide-11
SLIDE 11

Problem Statement

GGF can be defined as: GGFw(v) =

D

  • i=1

w iv ↑

i = [w 1>w 2> . . . >w D]

           v ↑

1

v ↑

2

. . .

v ↑

D

          

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 5 / 11

slide-12
SLIDE 12

Problem Statement

GGF can be defined as: GGFw(v) =

D

  • i=1

w iv ↑

i = [w 1>w 2> . . . >w D]

           v ↑

1

v ↑

2

. . .

v ↑

D

           Fair optimization problem in RL: arg max

π

GGFw(J(π)) (1)

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 5 / 11

slide-13
SLIDE 13

Problem Statement

GGF can be defined as: GGFw(v) =

D

  • i=1

w iv ↑

i = [w 1>w 2> . . . >w D]

           v ↑

1

v ↑

2

. . .

v ↑

D

           Fair optimization problem in RL: arg max

π

GGFw(J(π)) (1)

where J(π) = EPπ ∞

  • t=1

γt−1Rt

  • r

J(π) = lim

h→∞

1 h EPπ

  • h
  • t=1

Rt

  • .

γ-discounted rewards average rewards

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 5 / 11

slide-14
SLIDE 14

Theoretical Discussion

Assumption: MDPs are weakly-communicating

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 6 / 11

slide-15
SLIDE 15

Theoretical Discussion

Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies

Existence of stationary Markov fair optimal policy.

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 6 / 11

slide-16
SLIDE 16

Theoretical Discussion

Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies

Existence of stationary Markov fair optimal policy.

Possibly State-Dependent Optimality

With average reward, fair optimality stays state-independent.

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 6 / 11

slide-17
SLIDE 17

Theoretical Discussion

Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies

Existence of stationary Markov fair optimal policy.

Possibly State-Dependent Optimality

With average reward, fair optimality stays state-independent.

Contribution on Approximation Error

Approximate average-optimal policy (π∗

1) with γ-optimal policy (π∗ γ).

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 6 / 11

slide-18
SLIDE 18

Theoretical Discussion

Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies

Existence of stationary Markov fair optimal policy.

Possibly State-Dependent Optimality

With average reward, fair optimality stays state-independent.

Contribution on Approximation Error

Approximate average-optimal policy (π∗

1) with γ-optimal policy (π∗ γ).

Theorem:

GGFw(µ(π∗

γ)) ≥ GGFw(µ(π∗ 1)) − R(1 − γ)

  • ρ(γ, σ(HPπ∗

1 )) + ρ(γ, σ(HPπ∗ γ ))

  • where R = maxπ Rπ1 and ρ(γ, σ) =

σ γ−(1−γ)σ.

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 6 / 11

slide-19
SLIDE 19

Value Based and Policy Gradient Algorithms

DQN: Q network takes values in R|A|×D, instead of R|A|, trained with target: ˆ Qθ(s, a) = r + γ ˆ Qθ′(s′, a∗), where a∗ = argmaxa′∈A GGFw

  • r + γ ˆ

Qθ′(s′, a′)

  • .
  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 7 / 11

slide-20
SLIDE 20

Value Based and Policy Gradient Algorithms

DQN: Q network takes values in R|A|×D, instead of R|A|, trained with target: ˆ Qθ(s, a) = r + γ ˆ Qθ′(s′, a∗), where a∗ = argmaxa′∈A GGFw

  • r + γ ˆ

Qθ′(s′, a′)

  • .

To optimize the GGF with policy gradient: ∇θGGFw(J(πθ)) =∇J(πθ)GGFw(J(πθ)) · ∇θJ(πθ) =w ⊺

σ · ∇θJ(πθ).

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 7 / 11

slide-21
SLIDE 21

Experimental Results

What is the impact of optimizing GGF instead of the average of the

  • bjectives?

A2C GGF-A2C PPO GGF-PPO 0.7 0.8 0.9 GGF Score

Species Conservation

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 8 / 11

slide-22
SLIDE 22

Experimental Results

What is the impact of optimizing GGF instead of the average of the

  • bjectives?

A2C GGF-A2C PPO GGF-PPO 0.7 0.8 0.9 GGF Score

Species Conservation

A2C GGF-A2C PPO GGF-PPO 0.0 0.2 0.4 0.6 0.8 Average density Sea-otters Abalones

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 8 / 11

slide-23
SLIDE 23

Experimental Results

What is the price of fairness? How those algorithms performs in continuous domains?

10000 20000 30000 40000 50000 Number of Steps 0.50 0.75 1.00 1.25 Average accumulated density

Species Conservation

PPO GGF-PPO

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 9 / 11

slide-24
SLIDE 24

Experimental Results

What is the price of fairness? How those algorithms performs in continuous domains?

10000 20000 30000 40000 50000 Number of Steps 0.50 0.75 1.00 1.25 Average accumulated density

Species Conservation

PPO GGF-PPO 5 10 15 20 25 Number of Episodes 12 14 Average accumulated bandwidth

Network Congestion Control

PPO GGF-PPO

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 9 / 11

slide-25
SLIDE 25

Experimental Results (Traffic Light Control)

What is the effect of γ with respect to GGF-average optimality?

PPO-γ-0.99 GGF-PPO-γ-0.99 PPO-1− GGF-PPO-1− −2.4 −2.2 −2.0 −1.8 −1.6 GGF Score ×107

Traffic Light Control

PPO GGF-PPO 200 250 300 350 400 Average waiting time North East West South

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 10 / 11

slide-26
SLIDE 26

Conclusion

Fair optimization in RL setting Theoretical discussion with a new bound Adaptations of DQN, A2C and PPO to solve this problem. Experimental validation in 3 domains

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 11 / 11

slide-27
SLIDE 27

Conclusion

Fair optimization in RL setting Theoretical discussion with a new bound Adaptations of DQN, A2C and PPO to solve this problem. Experimental validation in 3 domains Future Works: Extend to distributed control Consider other fair social welfare functions Directly solve average reward problems

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 11 / 11

slide-28
SLIDE 28

Ruffy, F., Przystupa, M., and Beschastnikh, I. (2019). Iroko: A framework to prototype reinforcement learning for data center traffic control. In Workshop on ML for Systems at NeurIPS.

  • U. Siddique, P. Weng, and M. Zimmer

Fair Policies in RL 11 / 11