dynamic spectrum access under partial observations a
play

Dynamic spectrum access under partial observations: A restless - PowerPoint PPT Presentation

Dynamic spectrum access under partial observations: A restless bandit approach Nima Akbarzadeh, Aditya Mahajan McGill University, Electrical and Computer Engineering Department June 3, 2019 1/23 Restless Bandits Example 2/23 Channel


  1. Dynamic spectrum access under partial observations: A restless bandit approach Nima Akbarzadeh, Aditya Mahajan McGill University, Electrical and Computer Engineering Department June 3, 2019 1/23

  2. Restless Bandits Example 2/23

  3. Channel Scheduling Problem At which time, which channel and which resource should be used? Features: Time-varying channels Partially-observable environment Resource Allocation Examples: Cognitive radio networks Resource constraint jamming 3/23

  4. Model (Channel) n finite state Markov channels, N = { 1 , . . . , n } . State space is finite ordered set S i , i ∈ N Markov state process: { S i t } t ≥ 0 Transition Probability Matrix: P i Resource: rate, power, bandwidth, etc., R = {∅ , r 1 , . . . , r k } Payoff: ρ i ( s , r ), s ∈ S i , r ∈ R ρ i ( s , r ) = 0 if r = ∅ Example: S i = { s bad , s good } , R = { r low , r high }  r low , if r = r low   ρ i ( s , r ) = r high , if r = r high and s = s good  0 , if r = r high and s = s bad  4/23

  5. Model (Channel) n finite state Markov channels, N = { 1 , . . . , n } . State space is finite ordered set S i , i ∈ N Markov state process: { S i t } t ≥ 0 Transition Probability Matrix: P i Resource: rate, power, bandwidth, etc., R = {∅ , r 1 , . . . , r k } Payoff: ρ i ( s , r ), s ∈ S i , r ∈ R ρ i ( s , r ) = 0 if r = ∅ Example: S i = { s bad , s good } , R = { r low , r high }  r low , if r = r low   ρ i ( s , r ) = r high , if r = r high and s = s good  0 , if r = r high and s = s bad  4/23

  6. Model (Transmitter) Two decisions to make at each time t : Select L channels indexed by L t A i t = 1 if i ∈ L t and 0 otherwise Select resources denoted by R i t R i t = ∅ if i / ∈ L t Observation Process: � S i if A i t , t = 1 Y i t = if A i E , t = 0 . Strategies: A t = f t ( Y 0: t − 1 , R 0: t − 1 , A 0: t − 1 ) , R t = g t ( Y 0: t − 1 , R 0: t − 1 , A 0: t − 1 , A t ) . 5/23

  7. Model (Transmitter) Two decisions to make at each time t : Select L channels indexed by L t A i t = 1 if i ∈ L t and 0 otherwise Select resources denoted by R i t R i t = ∅ if i / ∈ L t Observation Process: � S i if A i t , t = 1 Y i t = if A i E , t = 0 . Strategies: A t = f t ( Y 0: t − 1 , R 0: t − 1 , A 0: t − 1 ) , R t = g t ( Y 0: t − 1 , R 0: t − 1 , A 0: t − 1 , A t ) . 5/23

  8. Model (Transmitter) Two decisions to make at each time t : Select L channels indexed by L t A i t = 1 if i ∈ L t and 0 otherwise Select resources denoted by R i t R i t = ∅ if i / ∈ L t Observation Process: � S i if A i t , t = 1 Y i t = if A i E , t = 0 . Strategies: A t = f t ( Y 0: t − 1 , R 0: t − 1 , A 0: t − 1 ) , R t = g t ( Y 0: t − 1 , R 0: t − 1 , A 0: t − 1 , A t ) . 5/23

  9. Model (Optimization Problem) Problem Given a discount factor β ∈ (0 , 1) , a set of resources R , and the state space, transition probability, and reward function ( S i , P i , ρ i ) i ∈N for all channels, choose a communication strategy ( f , g ) to maximize � ∞ � � β t � ρ i ( S i t , R i t ) A i J ( f , g ) = E . t t =0 i ∈N 6/23

  10. Literature Review and Approaches Partially Observable Markov Decision Process (POMDP). POMDP models suffer from curse of dimensionality: The state space size is exponential in the number of channels Simplified modelling assumptions: Two state Gilbert-Elliot channels Multi-state channels but identical Fully-observable Markov Decision Process (MDP) 7/23

  11. Our contributions Multi-state non-identical channels Restless Bandit approach Convert the POMDP into a countable MDP Finite-state Approximation of the MDP 8/23

  12. POMDP (Belief State) Belief state: Π i t ( s ) = P ( S i t = s | Y i 0: t − 1 , R i 0: t − 1 , A i 0: t − 1 ) . Proposition Let Π t denote (Π 1 t , . . . , Π n t ) . Then, without loss of optimality, A t = f t ( Π t ) R t = g t ( Π t , A t ) . Recall: f is chennel selection policy and g is resource selection policy. 9/23

  13. Optimal Resource Allocation Strategy No need for joint optimization of ( f , g ). Let � ρ i ( π ) := max π ( s ) ρ i ( s , r ) , ¯ r ∈R s ∈S i � r i , ∗ ( π ) := arg max π ( s ) ρ i ( s , r ) . r ∈R s ∈S i Proposition Define g i , ∗ : ∆( S i ) × { 0 , 1 } → R as follows g i , ∗ ( π, 0) = ∅ , g i , ∗ ( π, 1) = r i , ∗ ( π ) . For any channel selection policy, ( g ∗ , g ∗ , . . . ) is an optimal resource allocation strategy. 10/23

  14. Restless Bandit Model (1) Each { Π i t } t ≥ 0 , i ∈ N , is a bandit process. (2) The transmitter can activate L of these processes. (3) Belief state evolution: � if process i is activated, A i δ S i t , t = 1 , Π i t +1 = Π i t · P i , if process i is passive, A i t = 0 . (4) Expected reward: � ρ i (Π i if process i is activated, A i ¯ t ) , t = 1 , ρ i t = if process i is passive, A i 0 , t = 0 . Process: g ∗ . . . → Π i → A i f → R i t → Y i t → ρ i → Π i Dynamics: t +1 → . . . − − − t t t � �� � time t 11/23

  15. Restless Bandit Model (1) Each { Π i t } t ≥ 0 , i ∈ N , is a bandit process. (2) The transmitter can activate L of these processes. (3) Belief state evolution: � if process i is activated, A i δ S i t , t = 1 , Π i t +1 = Π i t · P i , if process i is passive, A i t = 0 . (4) Expected reward: � ρ i (Π i if process i is activated, A i ¯ t ) , t = 1 , ρ i t = if process i is passive, A i 0 , t = 0 . Process: g ∗ . . . → Π i → A i f → R i t → Y i t → ρ i → Π i Dynamics: t +1 → . . . − − − t t t � �� � time t 11/23

  16. Restless Bandit Model (1) Each { Π i t } t ≥ 0 , i ∈ N , is a bandit process. (2) The transmitter can activate L of these processes. (3) Belief state evolution: � if process i is activated, A i δ S i t , t = 1 , Π i t +1 = Π i t · P i , if process i is passive, A i t = 0 . (4) Expected reward: � ρ i (Π i if process i is activated, A i ¯ t ) , t = 1 , ρ i t = if process i is passive, A i 0 , t = 0 . (0, 0, 1) Dynamics: Process: g ∗ . . . → Π i → A i f → R i t → Y i t → ρ i → Π i t +1 → . . . − − − t t t � �� � time t (1, 0, 0) (0, 1, 0) 11/23

  17. Restless Bandit Model (1) Each { Π i t } t ≥ 0 , i ∈ N , is a bandit process. (2) The transmitter can activate L of these processes. (3) Belief state evolution: � if process i is activated, A i δ S i t , t = 1 , Π i t +1 = Π i t · P i , if process i is passive, A i t = 0 . (4) Expected reward: � ρ i (Π i if process i is activated, A i ¯ t ) , t = 1 , ρ i t = if process i is passive, A i 0 , t = 0 . (0, 0, 1) Dynamics: Process: g ∗ . . . → Π i → A i f → R i t → Y i t → ρ i → Π i t +1 → . . . − − − t t t � �� � time t (1, 0, 0) (0, 1, 0) 11/23

  18. Restless Bandit Model (1) Each { Π i t } t ≥ 0 , i ∈ N , is a bandit process. (2) The transmitter can activate L of these processes. (3) Belief state evolution: � if process i is activated, A i δ S i t , t = 1 , Π i t +1 = Π i t · P i , if process i is passive, A i t = 0 . (4) Expected reward: � ρ i (Π i if process i is activated, A i ¯ t ) , t = 1 , ρ i t = if process i is passive, A i 0 , t = 0 . (0, 0, 1) Dynamics: Process: g ∗ . . . → Π i → A i f → R i t → Y i t → ρ i → Π i t +1 → . . . − − − t t t � �� � time t (1, 0, 0) (0, 1, 0) 11/23

  19. Restless Bandit Solution The main idea is to decompose the coupled n -channel optimization problem to n independent one-channel problems. When the Whittle indexability is satisfied, then one may propose a Whittle index policy. The channels with minimum indices are selected. The index strategy performs close-to-optimal for many applications in the state-of-arts works. Goal: We provide an efficient algorithm to check the indexability and compute the Whittle index. 12/23

  20. Problem Decomposition ρ i ( π ) − λ ) a i where λ can be viewed as Modified per-step reward: (¯ the cost for transmitting over channel i . Problem Given channel i ∈ N , the discount factor β ∈ (0 , 1) , the cost λ ∈ R , and the belief state space, transition probability, reward function tuple (∆( S i ) , P i , ρ i ) , choose a policy f i : ∆( S i ) → { 0 , 1 } to maximize � ∞ � � J i λ ( f i ) := E β t � ρ i (Π i A i � ¯ t ) − λ . t t =0 13/23

  21. Dynamic Programming (Belief State) Theorem Let V i λ : ∆( S i ) → R be the unique fixed point of equation V i a ∈{ 0 , 1 } Q i λ ( π ) = max λ ( π, a ) where Q i λ ( π, 0) = β V i λ ( π · P i ) � Q i ρ i π ( s ) V i λ ( π, 1) = ¯ λ ( π ) − λ + β λ ( δ s ) . s ∈S i Let f i λ ( π ) = 1 if Q i λ ( π, 1) ≥ Q i λ ( π, 0) , and f i λ ( π ) = 0 otherwise. Then, f i λ is optimal for Problem 2. Challenge: Continuous state space! 14/23

  22. Dynamic Programming (Belief State) Theorem Let V i λ : ∆( S i ) → R be the unique fixed point of equation V i a ∈{ 0 , 1 } Q i λ ( π ) = max λ ( π, a ) where Q i λ ( π, 0) = β V i λ ( π · P i ) � Q i ρ i π ( s ) V i λ ( π, 1) = ¯ λ ( π ) − λ + β λ ( δ s ) . s ∈S i Let f i λ ( π ) = 1 if Q i λ ( π, 1) ≥ Q i λ ( π, 0) , and f i λ ( π ) = 0 otherwise. Then, f i λ is optimal for Problem 2. Challenge: Continuous state space! 14/23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend