Dynamic spectrum access under partial observations: A restless - - PowerPoint PPT Presentation

dynamic spectrum access under partial observations a
SMART_READER_LITE
LIVE PREVIEW

Dynamic spectrum access under partial observations: A restless - - PowerPoint PPT Presentation

Dynamic spectrum access under partial observations: A restless bandit approach Nima Akbarzadeh, Aditya Mahajan McGill University, Electrical and Computer Engineering Department June 3, 2019 1/23 Restless Bandits Example 2/23 Channel


slide-1
SLIDE 1

1/23

Dynamic spectrum access under partial

  • bservations: A restless bandit approach

Nima Akbarzadeh, Aditya Mahajan

McGill University, Electrical and Computer Engineering Department

June 3, 2019

slide-2
SLIDE 2

2/23

Restless Bandits Example

slide-3
SLIDE 3

3/23

Channel Scheduling Problem

At which time, which channel and which resource should be used? Features: Time-varying channels Partially-observable environment Resource Allocation Examples: Cognitive radio networks Resource constraint jamming

slide-4
SLIDE 4

4/23

Model (Channel)

n finite state Markov channels, N = {1, . . . , n}. State space is finite ordered set Si, i ∈ N

Markov state process: {Si

t}t≥0

Transition Probability Matrix: Pi

Resource: rate, power, bandwidth, etc., R = {∅, r1, . . . , rk} Payoff: ρi(s, r), s ∈ Si, r ∈ R

ρi(s, r) = 0 if r = ∅

Example: Si = {sbad, sgood}, R = {rlow, rhigh} ρi(s, r) =      rlow, if r = rlow rhigh, if r = rhigh and s = sgood 0, if r = rhigh and s = sbad

slide-5
SLIDE 5

4/23

Model (Channel)

n finite state Markov channels, N = {1, . . . , n}. State space is finite ordered set Si, i ∈ N

Markov state process: {Si

t}t≥0

Transition Probability Matrix: Pi

Resource: rate, power, bandwidth, etc., R = {∅, r1, . . . , rk} Payoff: ρi(s, r), s ∈ Si, r ∈ R

ρi(s, r) = 0 if r = ∅

Example: Si = {sbad, sgood}, R = {rlow, rhigh} ρi(s, r) =      rlow, if r = rlow rhigh, if r = rhigh and s = sgood 0, if r = rhigh and s = sbad

slide-6
SLIDE 6

5/23

Model (Transmitter)

Two decisions to make at each time t: Select L channels indexed by Lt Ai

t = 1 if i ∈ Lt and 0 otherwise

Select resources denoted by Ri

t

Ri

t = ∅ if i /

∈ Lt Observation Process: Y i

t =

  • Si

t,

if Ai

t = 1

E, if Ai

t = 0.

Strategies: At = ft(Y0:t−1, R0:t−1, A0:t−1), Rt = gt(Y0:t−1, R0:t−1, A0:t−1, At).

slide-7
SLIDE 7

5/23

Model (Transmitter)

Two decisions to make at each time t: Select L channels indexed by Lt Ai

t = 1 if i ∈ Lt and 0 otherwise

Select resources denoted by Ri

t

Ri

t = ∅ if i /

∈ Lt Observation Process: Y i

t =

  • Si

t,

if Ai

t = 1

E, if Ai

t = 0.

Strategies: At = ft(Y0:t−1, R0:t−1, A0:t−1), Rt = gt(Y0:t−1, R0:t−1, A0:t−1, At).

slide-8
SLIDE 8

5/23

Model (Transmitter)

Two decisions to make at each time t: Select L channels indexed by Lt Ai

t = 1 if i ∈ Lt and 0 otherwise

Select resources denoted by Ri

t

Ri

t = ∅ if i /

∈ Lt Observation Process: Y i

t =

  • Si

t,

if Ai

t = 1

E, if Ai

t = 0.

Strategies: At = ft(Y0:t−1, R0:t−1, A0:t−1), Rt = gt(Y0:t−1, R0:t−1, A0:t−1, At).

slide-9
SLIDE 9

6/23

Model (Optimization Problem)

Problem

Given a discount factor β ∈ (0, 1), a set of resources R, and the state space, transition probability, and reward function (Si, Pi, ρi)i∈N for all channels, choose a communication strategy (f, g) to maximize J(f, g) = E ∞

  • t=0

βt

i∈N

ρi(Si

t, Ri t)Ai t

  • .
slide-10
SLIDE 10

7/23

Literature Review and Approaches

Partially Observable Markov Decision Process (POMDP). POMDP models suffer from curse of dimensionality:

The state space size is exponential in the number of channels

Simplified modelling assumptions:

Two state Gilbert-Elliot channels Multi-state channels but identical Fully-observable Markov Decision Process (MDP)

slide-11
SLIDE 11

8/23

Our contributions

Multi-state non-identical channels Restless Bandit approach Convert the POMDP into a countable MDP Finite-state Approximation of the MDP

slide-12
SLIDE 12

9/23

POMDP (Belief State)

Belief state: Πi

t(s) = P(Si t = s | Y i 0:t−1, Ri 0:t−1, Ai 0:t−1).

Proposition

Let Πt denote (Π1

t , . . . , Πn t ). Then, without loss of optimality,

At = ft(Πt) Rt = gt(Πt, At).

Recall: f is chennel selection policy and g is resource selection policy.

slide-13
SLIDE 13

10/23

Optimal Resource Allocation Strategy

No need for joint optimization of (f, g). Let ¯ ρi(π) := max

r∈R

  • s∈Si

π(s)ρi(s, r), ri,∗(π) := arg max

r∈R

  • s∈Si

π(s)ρi(s, r).

Proposition

Define gi,∗ : ∆(Si) × {0, 1} → R as follows gi,∗(π, 0) = ∅, gi,∗(π, 1) = ri,∗(π). For any channel selection policy, (g∗, g∗, . . . ) is an optimal resource allocation strategy.

slide-14
SLIDE 14

11/23

Restless Bandit Model

(1) Each {Πi

t}t≥0, i ∈ N, is a bandit process.

(2) The transmitter can activate L of these processes. (3) Belief state evolution: Πi

t+1 =

  • δSi

t,

if process i is activated, Ai

t = 1,

Πi

t · Pi,

if process i is passive, Ai

t = 0.

(4) Expected reward: ρi

t =

  • ¯

ρi(Πi

t),

if process i is activated, Ai

t = 1,

0, if process i is passive, Ai

t = 0.

Dynamics: Process:

. . . → Πi

t f

− → Ai

t g∗

− − → Ri

t → Y i t → ρi t

  • time t

→ Πi

t+1 → . . .

slide-15
SLIDE 15

11/23

Restless Bandit Model

(1) Each {Πi

t}t≥0, i ∈ N, is a bandit process.

(2) The transmitter can activate L of these processes. (3) Belief state evolution: Πi

t+1 =

  • δSi

t,

if process i is activated, Ai

t = 1,

Πi

t · Pi,

if process i is passive, Ai

t = 0.

(4) Expected reward: ρi

t =

  • ¯

ρi(Πi

t),

if process i is activated, Ai

t = 1,

0, if process i is passive, Ai

t = 0.

Dynamics: Process:

. . . → Πi

t f

− → Ai

t g∗

− − → Ri

t → Y i t → ρi t

  • time t

→ Πi

t+1 → . . .

slide-16
SLIDE 16

11/23

Restless Bandit Model

(1) Each {Πi

t}t≥0, i ∈ N, is a bandit process.

(2) The transmitter can activate L of these processes. (3) Belief state evolution: Πi

t+1 =

  • δSi

t,

if process i is activated, Ai

t = 1,

Πi

t · Pi,

if process i is passive, Ai

t = 0.

(4) Expected reward: ρi

t =

  • ¯

ρi(Πi

t),

if process i is activated, Ai

t = 1,

0, if process i is passive, Ai

t = 0.

Dynamics:

(1, 0, 0) (0, 1, 0) (0, 0, 1)

Process:

. . . → Πi

t f

− → Ai

t g∗

− − → Ri

t → Y i t → ρi t

  • time t

→ Πi

t+1 → . . .

slide-17
SLIDE 17

11/23

Restless Bandit Model

(1) Each {Πi

t}t≥0, i ∈ N, is a bandit process.

(2) The transmitter can activate L of these processes. (3) Belief state evolution: Πi

t+1 =

  • δSi

t,

if process i is activated, Ai

t = 1,

Πi

t · Pi,

if process i is passive, Ai

t = 0.

(4) Expected reward: ρi

t =

  • ¯

ρi(Πi

t),

if process i is activated, Ai

t = 1,

0, if process i is passive, Ai

t = 0.

Dynamics:

(1, 0, 0) (0, 1, 0) (0, 0, 1)

Process:

. . . → Πi

t f

− → Ai

t g∗

− − → Ri

t → Y i t → ρi t

  • time t

→ Πi

t+1 → . . .

slide-18
SLIDE 18

11/23

Restless Bandit Model

(1) Each {Πi

t}t≥0, i ∈ N, is a bandit process.

(2) The transmitter can activate L of these processes. (3) Belief state evolution: Πi

t+1 =

  • δSi

t,

if process i is activated, Ai

t = 1,

Πi

t · Pi,

if process i is passive, Ai

t = 0.

(4) Expected reward: ρi

t =

  • ¯

ρi(Πi

t),

if process i is activated, Ai

t = 1,

0, if process i is passive, Ai

t = 0.

Dynamics:

(1, 0, 0) (0, 1, 0) (0, 0, 1)

Process:

. . . → Πi

t f

− → Ai

t g∗

− − → Ri

t → Y i t → ρi t

  • time t

→ Πi

t+1 → . . .

slide-19
SLIDE 19

12/23

Restless Bandit Solution

The main idea is to decompose the coupled n-channel

  • ptimization problem to n independent one-channel problems.

When the Whittle indexability is satisfied, then one may propose a Whittle index policy. The channels with minimum indices are selected. The index strategy performs close-to-optimal for many applications in the state-of-arts works. Goal: We provide an efficient algorithm to check the indexability and compute the Whittle index.

slide-20
SLIDE 20

13/23

Problem Decomposition

Modified per-step reward: (¯ ρi(π) − λ)ai where λ can be viewed as the cost for transmitting over channel i.

Problem

Given channel i ∈ N, the discount factor β ∈ (0, 1), the cost λ ∈ R, and the belief state space, transition probability, reward function tuple (∆(Si), Pi, ρi), choose a policy f i : ∆(Si) → {0, 1} to maximize Ji

λ(f i) := E

  • t=0

βt ¯ ρi(Πi

t) − λ

  • Ai

t

  • .
slide-21
SLIDE 21

14/23

Dynamic Programming (Belief State)

Theorem

Let V i

λ : ∆(Si) → R be the unique fixed point of equation

V i

λ(π) = max a∈{0,1} Qi λ(π, a)

where Qi

λ(π, 0) = βV i λ(π · Pi)

Qi

λ(π, 1) = ¯

ρi

λ(π) − λ + β

  • s∈Si

π(s)V i

λ(δs).

Let f i

λ(π) = 1 if Qi λ(π, 1) ≥ Qi λ(π, 0), and f i λ(π) = 0 otherwise.

Then, f i

λ is optimal for Problem 2.

Challenge: Continuous state space!

slide-22
SLIDE 22

14/23

Dynamic Programming (Belief State)

Theorem

Let V i

λ : ∆(Si) → R be the unique fixed point of equation

V i

λ(π) = max a∈{0,1} Qi λ(π, a)

where Qi

λ(π, 0) = βV i λ(π · Pi)

Qi

λ(π, 1) = ¯

ρi

λ(π) − λ + β

  • s∈Si

π(s)V i

λ(δs).

Let f i

λ(π) = 1 if Qi λ(π, 1) ≥ Qi λ(π, 0), and f i λ(π) = 0 otherwise.

Then, f i

λ is optimal for Problem 2.

Challenge: Continuous state space!

slide-23
SLIDE 23

15/23

Information State

Let Oi

t ∈ Si denote the last observed state of channel i and

K i

t ∈ Z≥0 denote the time since the last observation. Then, we

have (Oi

t+1, K i t+1) =

  • (Si

t, 0)

if Ai

t = 1

(Oi

t, K i t + 1)

if Ai

t = 0.

Lemma

At any time t, Πi

t = δOi

t · (Pi)K i t almost surely.

Example:   0.5 0.25 0.25 0.25 0.5 0.25 0.25 0.25 0.5  

slide-24
SLIDE 24

15/23

Information State

Let Oi

t ∈ Si denote the last observed state of channel i and

K i

t ∈ Z≥0 denote the time since the last observation. Then, we

have (Oi

t+1, K i t+1) =

  • (Si

t, 0)

if Ai

t = 1

(Oi

t, K i t + 1)

if Ai

t = 0.

Lemma

At any time t, Πi

t = δOi

t · (Pi)K i t almost surely.

Example:   0.5 0.25 0.25 0.25 0.5 0.25 0.25 0.25 0.5  

slide-25
SLIDE 25

15/23

Information State

Let Oi

t ∈ Si denote the last observed state of channel i and

K i

t ∈ Z≥0 denote the time since the last observation. Then, we

have (Oi

t+1, K i t+1) =

  • (Si

t, 0)

if Ai

t = 1

(Oi

t, K i t + 1)

if Ai

t = 0.

Lemma

At any time t, Πi

t = δOi

t · (Pi)K i t almost surely.

Example:   0.5 0.25 0.25 0.25 0.5 0.25 0.25 0.25 0.5  

(1, 0, 0) (0, 1, 0) (0, 0, 1)

slide-26
SLIDE 26

15/23

Information State

Let Oi

t ∈ Si denote the last observed state of channel i and

K i

t ∈ Z≥0 denote the time since the last observation. Then, we

have (Oi

t+1, K i t+1) =

  • (Si

t, 0)

if Ai

t = 1

(Oi

t, K i t + 1)

if Ai

t = 0.

Lemma

At any time t, Πi

t = δOi

t · (Pi)K i t almost surely.

Example:   0.5 0.25 0.25 0.25 0.5 0.25 0.25 0.25 0.5  

(1, 0, 0) (0, 1, 0) (0, 0, 1)

(1, 0, 0) = (1, 0, 0)·   0.5 0.25 0.25 0.25 0.5 0.25 0.25 0.25 0.5  

slide-27
SLIDE 27

15/23

Information State

Let Oi

t ∈ Si denote the last observed state of channel i and

K i

t ∈ Z≥0 denote the time since the last observation. Then, we

have (Oi

t+1, K i t+1) =

  • (Si

t, 0)

if Ai

t = 1

(Oi

t, K i t + 1)

if Ai

t = 0.

Lemma

At any time t, Πi

t = δOi

t · (Pi)K i t almost surely.

Example:   0.5 0.25 0.25 0.25 0.5 0.25 0.25 0.25 0.5  

(1, 0, 0) (0, 1, 0) (0, 0, 1)

(1, 0, 0) ·   0.5 0.25 0.25 0.25 0.5 0.25 0.25 0.25 0.5  

1

slide-28
SLIDE 28

15/23

Information State

Let Oi

t ∈ Si denote the last observed state of channel i and

K i

t ∈ Z≥0 denote the time since the last observation. Then, we

have (Oi

t+1, K i t+1) =

  • (Si

t, 0)

if Ai

t = 1

(Oi

t, K i t + 1)

if Ai

t = 0.

Lemma

At any time t, Πi

t = δOi

t · (Pi)K i t almost surely.

Example:   0.5 0.25 0.25 0.25 0.5 0.25 0.25 0.25 0.5  

(1, 0, 0) (0, 1, 0) (0, 0, 1)

(1, 0, 0) ·   0.5 0.25 0.25 0.25 0.5 0.25 0.25 0.25 0.5  

2

slide-29
SLIDE 29

15/23

Information State

Let Oi

t ∈ Si denote the last observed state of channel i and

K i

t ∈ Z≥0 denote the time since the last observation. Then, we

have (Oi

t+1, K i t+1) =

  • (Si

t, 0)

if Ai

t = 1

(Oi

t, K i t + 1)

if Ai

t = 0.

Lemma

At any time t, Πi

t = δOi

t · (Pi)K i t almost surely.

Example:   0.5 0.25 0.25 0.25 0.5 0.25 0.25 0.25 0.5  

(1, 0, 0) (0, 1, 0) (0, 0, 1)

(1, 0, 0) ·   0.5 0.25 0.25 0.25 0.5 0.25 0.25 0.25 0.5  

3

slide-30
SLIDE 30

15/23

Information State

Let Oi

t ∈ Si denote the last observed state of channel i and

K i

t ∈ Z≥0 denote the time since the last observation. Then, we

have (Oi

t+1, K i t+1) =

  • (Si

t, 0)

if Ai

t = 1

(Oi

t, K i t + 1)

if Ai

t = 0.

Lemma

At any time t, Πi

t = δOi

t · (Pi)K i t almost surely.

Example:   0.5 0.25 0.25 0.25 0.5 0.25 0.25 0.25 0.5  

(1, 0, 0) (0, 1, 0) (0, 0, 1)

(0, 1, 0) ·   0.5 0.25 0.25 0.25 0.5 0.25 0.25 0.25 0.5  

slide-31
SLIDE 31

16/23

Dynamic Programming (Information State)

Difficult to solve dynamic programming based on belief state πi as the state space is ∆(Si). A new dynamic programming can be written considering the information state (oi, ki) where the state space is Si × Z≥0. Pros and cons: The state space is countable but still dynamic programming is computationally infeasible.

slide-32
SLIDE 32

17/23

Finite-State Approximation

Dynamic Programming (Finite state space & Computable!)

Given m ∈ N, let Nm := {0, . . . , m} and V i

λ,m : Si × Nm → R

denote the unique fixed point of V i

λ,m(o, k) = max a∈{0,1}{Qi λ,m(o, k, a)}

Qi

λ,m(o, k, 0) = βV i λ,m(o, k + 1 ∧ m)

Qi

λ,m(o, k, 1) = ¯

ρi(o, k) − λ + β

  • s∈Si

(Pi)k

  • sV i

λ,m(s, 0).

Let f i

λ,m(o, k) = 1 if Qi λ,m(o, k, 1) ≥ Qi λ,m(o, k, 0), and

f i

λ,m(o, k) = 0 o.w.

slide-33
SLIDE 33

18/23

Finite-State Approximation

Approximation Limits

(i) limm→∞V i

λ,m(o, k) = V i λ(o, k), ∀(o, k) ∈ Si × Z≥0.

(ii) Let f i,∗

λ (o, k) be any fixed point of {f i λ,m(o, k)}m≥1. Then, the

policy f i,∗

λ (o, k) is optimal for sub-problem i.

slide-34
SLIDE 34

19/23

Indexability

Let passive set for process i be Pi

λ =

  • (o, k) ∈ Si × Nm : f i

λ,m(o, k) = 0

  • .

Recall: f i

λ,m is the policy obtained by dynamic programming.

Definition (Indexability)

For any λ1, λ2 ∈ R process i is indexable if λ1 ≤ λ2 = ⇒ Pi

λ1 ⊆ Pi λ2.

Definition (Whittle index)

The Whittle index of information state (o, k) of process i is defined as wi(o, k) = inf

  • λ ∈ R : (o, k) /

∈ Pi

λ

  • .
slide-35
SLIDE 35

20/23

Algorithms

λ LB UB

w(1, 1) w(1, 2) w(2, 1) w(2, 2) 1 1 1 1

  • 0 1

1 1

  • 0 0

1 1

  • 0 0

0 1

  • 0 0

0 0

  • Procedure:
slide-36
SLIDE 36

20/23

Algorithms

λ LB UB

w(1, 1) w(1, 2) w(2, 1) w(2, 2) 1 1 1 1

  • 0 1

1 1

  • 0 0

1 1

  • 0 0

0 1

  • 0 0

0 0

  • Procedure:

λ

(1) (2) (3) (4) (5) w(1, 1) 1 1 1 1

  • 0 1

1 1

slide-37
SLIDE 37

20/23

Algorithms

λ LB UB

w(1, 1) w(1, 2) w(2, 1) w(2, 2) 1 1 1 1

  • 0 1

1 1

  • 0 0

1 1

  • 0 0

0 1

  • 0 0

0 0

  • Procedure:

λ

(1) (2) (3) (4) (5) w(1, 1) 1 1 1 1

  • 0 1

1 1

  • λ

(1) (2) (3) (4) (5) w(1, 2) 0 1 1 1

  • 0 0

1 1

slide-38
SLIDE 38

20/23

Algorithms

λ LB UB

w(1, 1) w(1, 2) w(2, 1) w(2, 2) 1 1 1 1

  • 0 1

1 1

  • 0 0

1 1

  • 0 0

0 1

  • 0 0

0 0

  • Procedure:

λ

(1) (2) (3) (4) (5) w(1, 1) 1 1 1 1

  • 0 1

1 1

  • λ

(1) (2) (3) (4) (5) w(1, 2) 0 1 1 1

  • 0 0

1 1

  • λ

w(2, 1) 0 0 1 1

  • 0 0

0 1

slide-39
SLIDE 39

20/23

Algorithms

λ LB UB

w(1, 1) w(1, 2) w(2, 1) w(2, 2) 1 1 1 1

  • 0 1

1 1

  • 0 0

1 1

  • 0 0

0 1

  • 0 0

0 0

  • Procedure:

λ

(1) (2) (3) (4) (5) w(1, 1) 1 1 1 1

  • 0 1

1 1

  • λ

(1) (2) (3) (4) (5) w(1, 2) 0 1 1 1

  • 0 0

1 1

  • λ

w(2, 1) 0 0 1 1

  • 0 0

0 1

  • λ

w(2, 2) 0 0 0 1

  • 0 0

0 0

slide-40
SLIDE 40

21/23

Algorithms

Whittle Index Policy: At each time, Obtain the Whittle index corresponding to current information state of all channels. Transmit over the L channels with the smallest Whittle indices.

slide-41
SLIDE 41

22/23

Conclusion

Dynamic spectrum access problem for transmitting over multiple channels with partially observed channel state. Resource allocation strategy can be computed offline and is not affecting the channel selection strategy. To circumvent the curse of dimensionality, we considered the problem as a restless bandit and use the Whittle index heuristic. By reachable set of beliefs, the problem is converted from the belief-valued processes into a countable-state process. We developed low-complexity algorithms to check whether each channel is indexable and if so, compute the Whittle index for each information state.

slide-42
SLIDE 42

23/23

Q&A

Thank you!