Algorithms for Product Pricing and Energy Allocation in Energy - - PowerPoint PPT Presentation
Algorithms for Product Pricing and Energy Allocation in Energy - - PowerPoint PPT Presentation
Algorithms for Product Pricing and Energy Allocation in Energy Harvesting Sensor Networks MSc(Engg.) Thesis Defense Sindhu P R Advisor: Prof. Shalabh Bhatnagar Department of Computer Science and Automation Indian Institute of Science,
Stochastic System
System ✘ ✒ ❊ ❬f✭✒❀ ✘✮❪ The output f✭✒❀ ✘✮ depends on the control (or input) ✒ and a noise element ✘ Objective: Find input ✒, which minimizes (or maximizes) ❊❬f✭✒❀ ✘✮❪
Overview
1 Product Pricing
Problem Formulation Algorithm Results
Overview
1 Product Pricing
Problem Formulation Algorithm Results
2 Energy Sharing in Sensor Networks
Problem Formulation Algorithms Approximation Algorithms Results
Product Demand and Price
New product introduced into market faces demand uncertainty
Size of target population not known Response to promotional campaigns cannot be ascertained
Product price, quality and technolgy influence demand Dynamically varying price gives better profit compared to fixed pricing
Product Pricing Problem
Product sold over the time interval ❬0❀ T❪ X✭t✮: Cumulative sales (or demand) upto time t u✭X✭t✮✮: Manufacturing cost per unit of the product Pr✭t✮: Price of product at time t
Objective
Find Pr✭t✮ which maximizes expected cumulative profit over ❬0❀ T❪ J✄ ❂ max
Pr ✭t✮ ❊
✔❩ T ✭Pr✭t✮ u✭X✭t✮✮✮ X✭t✮dt ☞ ☞ ☞ X✭0✮ ❂ X0 ✕
Bass Diffusion Model
Used to quantify how demand grows with time Driving forces for product diffusion - mass-media communication and word-of-mouth communication Change in cumulative sales X✭t✮ of a product is stated as dX✭t✮ dt ❂ p✭M X✭t✮✮ ✰ qX✭t✮ M ✭M X✭t✮✮
- M: Market potential
- p, q: Coefficients of innovation and imitation
Growth of Cumulative Demand
Change in X✭t✮ is modelled as a Stochastic Differential Equation (SDE)-
dX✭t✮ ❂ ✭M X✭t✮✮ ✒ p ✰ qX✭t✮ M ✓ ✭1 ✌Pr✭t✮✮dt ✰ ✛✭X✭t✮✮dW✭t✮
- ✌ determines the extent of influence of Pr✭t✮ on the product diffusion
- ✛✭✿✮ - diffusion term, ❢W✭t✮❀ t ✕ 0❣ - standard Brownian motion.
Growth of Cumulative Demand
Change in X✭t✮ is modelled as a Stochastic Differential Equation (SDE)-
dX✭t✮ ❂ ✭M X✭t✮✮ ✒ p ✰ qX✭t✮ M ✓ ✭1 ✌Pr✭t✮✮dt ✰ ✛✭X✭t✮✮dW✭t✮
- ✌ determines the extent of influence of Pr✭t✮ on the product diffusion
- ✛✭✿✮ - diffusion term, ❢W✭t✮❀ t ✕ 0❣ - standard Brownian motion.
Approach
Discretize SDE and solve in discrete time setting Formulate the problem in the setting of simulation optimization Tune the (discrete) price trajectory to obtain optimum performance
SDE Discretization
T ❂ Nh for some N ❃ 1, 0 ❁ h ❁ 1 ✑ ✭ ✮
✭ ✮ ✑
✭ ✮
- ❀
✭ ✮
✁ ❂ ✭
- ✮✭
✰ ✮
- ✌
✭ ✮
✁
- ❀
✭ ✮
✁ ❂
- ✭
✮
✭ ✮
✁
❂
✛✭ ✮ ✑ ✛✭ ✮ ✛ ✓ ✭✁✮ ✛✭✁✮
SDE Discretization
T ❂ Nh for some N ❃ 1, 0 ❁ h ❁ 1 Cumulative demand at decision time j is Xj and Xj ✑ X✭jh✮
✭ ✮ ✑
✭ ✮
- ❀
✭ ✮
✁ ❂ ✭
- ✮✭
✰ ✮
- ✌
✭ ✮
✁
- ❀
✭ ✮
✁ ❂
- ✭
✮
✭ ✮
✁
❂
✛✭ ✮ ✑ ✛✭ ✮ ✛ ✓ ✭✁✮ ✛✭✁✮
SDE Discretization
T ❂ Nh for some N ❃ 1, 0 ❁ h ❁ 1 Cumulative demand at decision time j is Xj and Xj ✑ X✭jh✮ Price set at j is Pr✭j✮ ✑ Pr✭jh✮
- ❀
✭ ✮
✁ ❂ ✭
- ✮✭
✰ ✮
- ✌
✭ ✮
✁
- ❀
✭ ✮
✁ ❂
- ✭
✮
✭ ✮
✁
❂
✛✭ ✮ ✑ ✛✭ ✮ ✛ ✓ ✭✁✮ ✛✭✁✮
SDE Discretization
T ❂ Nh for some N ❃ 1, 0 ❁ h ❁ 1 Cumulative demand at decision time j is Xj and Xj ✑ X✭jh✮ Price set at j is Pr✭j✮ ✑ Pr✭jh✮ Drift term, b
- Xj❀ Pr✭j✮
✁ ❂ ✭M Xj✮✭p ✰ q
M Xj✮
- 1 ✌Pr✭j✮
✁
- ❀
✭ ✮
✁ ❂
- ✭
✮
✭ ✮
✁
❂
✛✭ ✮ ✑ ✛✭ ✮ ✛ ✓ ✭✁✮ ✛✭✁✮
SDE Discretization
T ❂ Nh for some N ❃ 1, 0 ❁ h ❁ 1 Cumulative demand at decision time j is Xj and Xj ✑ X✭jh✮ Price set at j is Pr✭j✮ ✑ Pr✭jh✮ Drift term, b
- Xj❀ Pr✭j✮
✁ ❂ ✭M Xj✮✭p ✰ q
M Xj✮
- 1 ✌Pr✭j✮
✁ g
- Xj❀ Pr✭j✮
✁ ❂
- u✭Xj✮ Pr✭j✮
✁
❂
✛✭ ✮ ✑ ✛✭ ✮ ✛ ✓ ✭✁✮ ✛✭✁✮
SDE Discretization
T ❂ Nh for some N ❃ 1, 0 ❁ h ❁ 1 Cumulative demand at decision time j is Xj and Xj ✑ X✭jh✮ Price set at j is Pr✭j✮ ✑ Pr✭jh✮ Drift term, b
- Xj❀ Pr✭j✮
✁ ❂ ✭M Xj✮✭p ✰ q
M Xj✮
- 1 ✌Pr✭j✮
✁ g
- Xj❀ Pr✭j✮
✁ ❂
- u✭Xj✮ Pr✭j✮
✁ X1 ❂ 0 and X0 is assumed to be known ✛✭ ✮ ✑ ✛✭ ✮ ✛ ✓ ✭✁✮ ✛✭✁✮
SDE Discretization
T ❂ Nh for some N ❃ 1, 0 ❁ h ❁ 1 Cumulative demand at decision time j is Xj and Xj ✑ X✭jh✮ Price set at j is Pr✭j✮ ✑ Pr✭jh✮ Drift term, b
- Xj❀ Pr✭j✮
✁ ❂ ✭M Xj✮✭p ✰ q
M Xj✮
- 1 ✌Pr✭j✮
✁ g
- Xj❀ Pr✭j✮
✁ ❂
- u✭Xj✮ Pr✭j✮
✁ X1 ❂ 0 and X0 is assumed to be known ✛✭Xj✮ ✑ ✛✭jh✮ ✛ ✓ ✭✁✮ ✛✭✁✮
SDE Discretization
T ❂ Nh for some N ❃ 1, 0 ❁ h ❁ 1 Cumulative demand at decision time j is Xj and Xj ✑ X✭jh✮ Price set at j is Pr✭j✮ ✑ Pr✭jh✮ Drift term, b
- Xj❀ Pr✭j✮
✁ ❂ ✭M Xj✮✭p ✰ q
M Xj✮
- 1 ✌Pr✭j✮
✁ g
- Xj❀ Pr✭j✮
✁ ❂
- u✭Xj✮ Pr✭j✮
✁ X1 ❂ 0 and X0 is assumed to be known ✛✭Xj✮ ✑ ✛✭jh✮ ✛ ✓ ✭✁✮ is the derivative of ✛✭✁✮
SDE Discretization (contd.)
Discretized SDE
Xj✰1 ❂ Xj ✰ b✭Xj❀ Pr✭j✮✮h ✰ ✛✭Xj✮ ♣ hZj✰1 ✰ 1 2✛ ✓ ✭Xj✮✛✭Xj✮h✭Z 2
j✰1 1✮
- For 0 ✔ j ✔ N 1, Zj✰1 are independent N✭0❀ 1✮ distributed samples
- ✭ ✮❀
✭ ✮❀ ✁ ✁ ✁ ❀ ✭ ✮
✁ ❂ ❊ ✷ ✹
- ❳
❂
✭ ❀
✭ ✮✮✭
- ✮
☞ ☞ ☞ ✸ ✺
SDE Discretization (contd.)
Discretized SDE
Xj✰1 ❂ Xj ✰ b✭Xj❀ Pr✭j✮✮h ✰ ✛✭Xj✮ ♣ hZj✰1 ✰ 1 2✛ ✓ ✭Xj✮✛✭Xj✮h✭Z 2
j✰1 1✮
- For 0 ✔ j ✔ N 1, Zj✰1 are independent N✭0❀ 1✮ distributed samples
Discretized Objective Function
J
- Pr✭0✮❀ Pr✭1✮❀ ✁ ✁ ✁ ❀ Pr✭N1✮
✁ ❂ ❊ ✷ ✹
N1
❳
j❂0
g✭Xj❀ Pr✭j✮✮✭Xj Xj1✮ ☞ ☞ ☞ X0 ✸ ✺
Optimal Price Vector
Objective: Find P✄
r ❂ ✭P✄ r✭0✮❀ P✄ r✭1✮❀ ✁ ✁ ✁ ❀ P✄ r✭N1✮✮❃ where,
❢P✄
r✭0✮❀ P✄ r✭1✮❀ ✁ ✁ ✁ ❀ P✄ r✭N1✮❣ ❂
arg min
❢Pr✭0✮❀✁✁✁ ❀Pr✭N1✮❣
JX0✭Pr✭0✮❀ Pr✭1✮❀ ✁ ✁ ✁ ❀ Pr✭N1✮✮ Price set at stage k influences the demand in that stage as well as in future stages The initial price influences the single-stage costs of all stages
Gradient Search Using Simulations
✒ f✭✒✮ f✭✒1✮ ☞r✒f
Figure: Gradient Search
Need to minimize f✭✒✮ Find r✒f Update ✒ in the negative direction
- f gradient
In pricing problem, f✭✒✮ ❂ J✭Pr✭0✮❀ Pr✭1✮❀ ✁ ✁ ✁ ❀ Pr✭N1✮✮
Smoothed Functional Technique
Smooth rJ✭Pr✭0✮❀ Pr✭1✮❀ ✁ ✁ ✁ ❀ Pr✭N1✮✮ by convolution with Ndimensional Gaussian density function, G✭✑✮, ✑ ✷ ❘N ☞ ❃ 0 is a small constant
x f✭x✮ f✭x✮ ✍ G✭✑✮
☞
✭ ✮ ❂ ☞ ❊ ❤✑ ✭ ✭ ✰ ☞✑✮ ✭ ☞✑✮✮ ❥ ✐
Smoothed Functional Technique
Smooth rJ✭Pr✭0✮❀ Pr✭1✮❀ ✁ ✁ ✁ ❀ Pr✭N1✮✮ by convolution with Ndimensional Gaussian density function, G✭✑✮, ✑ ✷ ❘N ☞ ❃ 0 is a small constant
x f✭x✮ f✭x✮ ✍ G✭✑✮
The gradient estimate is D☞ JX0✭Pr✮ ❂ 1 ☞ ❊ ❤✑ 2 ✭JX0✭Pr ✰ ☞✑✮ JX0✭Pr ☞✑✮✮ ❥ Pr ✐
Gradient Estimate
In our simulation-based algorithm, expectation is replaced by samples
- btained from simulation.
For large L, small ☞, rJX0✭Pr✮ ❂ 1 ☞ 1 L ✧
L
❳
n❂1
✑✭n✮ 2
- JX0✭Pr ✰ ☞✑✭n✮✮ JX0✭Pr ☞✑✭n✮✮
✁ ★ ❦ D☞JX0✭Pr✮ rJX0✭Pr✮ ❦✦ 0 as ☞ ✦ 0 and L ✦ ✶
Two-timescale Smoothed Functional (SF) Algorithm
1 Generate ✑✭n✮ 2 Get P✰ r ❂ Pr✭n✮ ✰ ☞✑✭n✮ and P r ❂ Pr✭n✮ ☞✑✭n✮ 3 Compute X✰ and X 4 Compute ✍✰ j ❂ g✭X ✰ i ✭n✮❀ P✰ r✭j✮✭n✮✮✭X ✰ i ✭n✮ X ✰ i1✭n✮✮,
✍
j
❂ g✭X
i ✭n✮❀ P r✭j✮✭n✮✮✭X i ✭n✮ X i1✭n✮✮, 0 ✔ j ✔ N 1 5 Compute ✍j ❂ ✑j ✭n✮ 2☞ N
P
i❂j
✏ ✍✰
j ✍ j
✑ , 0 ✔ j ✔ N 1, ✍ ❂ ✭✍0❀ ✿ ✿ ✿ ❀ ✍N1✮❃
6 Gradient estimate: Y✭n ✰ 1✮ ❂ ✭1 c✭n✮✮Y✭n✮ ✰ c✭n✮✍ 7 Update price: Pr✭n ✰ 1✮ ❂ Pr✭n✮ ☛✭n✮Y✭n ✰ 1✮
Performance of SF Algorithm
Figure: Evolution of cumulative demand Figure: Optimal Price Trajectory Simulation Parameters: M = 10, ✛ ❂ 0✿1, u✭Xj✮ ❂ 100 0✿2Xj, p ❂ 0✿02, q ❂ 0✿5 and ✌ ❂ 0✿005 L ❂ 50❀ 000, N ❂ 400, h ❂ 0✿25, T ❂ 100 Myopic price is 140 Objective function values: myopic ❂ 472✿6 ✝ 0✿9, optimal = 670 ✝ 6
Summary
We proposed a simulation-based algorithm for pricing a product over the interval ❬0❀ T❪, when demand is uncertain The algorithm utilised smoothed functional gradient estimator to find the
- ptimal pricing policy
Future Directions:
- Simulation based optimization methods for products in the presence of
competing products
1 Product Pricing
Problem Formulation Algorithm Results
2 Energy Sharing in Sensor Networks
Problem Formulation Algorithms Approximation Algorithms Results
Sensor Network
Fusion Node ❂ Sensor Node Group of autonomous sensing nodes Sensed data is transmitted to a fusion node Fusion node obtains data from multiple nodes and processes it
Conventional Sensor Node
t❂T1 t❂T1✰T2 t❂T1✰T2✰T3
Node is powered by a non-rechargeable battery Energy is used for sensing and transmission Node stops operating once battery is exhausted
Energy Harvesting Sensor Node
t❂T1 t❂T1✰T2 t❂T1✰T2✰T3
Node replenishes energy, by harvesting energy from the environment Energy harvested is location and time dependent Techniques required to prevent energy starvation in nodes
Sensor Mote
Energy Sharing Problem
System comprises of n sensors and a common energy harvesting (EH) source Each sensor has a finite data buffer in which sensed data is stored The EH source has a finite buffer to store the harvested energy Nodes share the energy harvested by the EH source
Energy Sharing Problem
System comprises of n sensors and a common energy harvesting (EH) source Each sensor has a finite data buffer in which sensed data is stored The EH source has a finite buffer to store the harvested energy Nodes share the energy harvested by the EH source
Aim
To develop energy sharing algorithms which minimize the average delay in transmission of data from the nodes
Approach
Model the problem as an infinite-horizon average cost Markov decision process (MDP) Control decisions based on:
- amount of data in each of the sensors
- energy available in the EH source
Develop Reinforcement Learning (RL) algorithms which
- optimally allocate energy to every sensor
- work without requiring a system model
Reinforcement Learning
Environment: probabilistically evolves
- ver states
Reinforcement: the cost incurred after performing an action in a state Policy: determines which action to be taken at every state Goal: learn a policy which minimizes the long-run average cost The RL controller learns the policy which achieves this goal using trial-and-error process
Environment Controller sk c✭sk❀ Tk✮ sk✰1 Tk
System Model
Node 1 Node 2 Node n
EH Source
q1
k
q2
k
qn
k
E
k
T 1
k
T 2
k
T n
k
min✭q1
k ❀g✭T 1 k ✮✮
min✭q2
k ❀g✭T 2 k ✮✮
min✭qn
k ❀g✭T n k ✮✮
X1
k
X2
k
Xn
k
Y
k
Slotted, discrete-time model for the environment D
MAX : Data buffer size of each sensor
E
MAX : Energy buffer size of EH source
System Model
Node 1 Node 2 Node n
EH Source
q1
k
q2
k
qn
k
E
k
T 1
k
T 2
k
T n
k
min✭q1
k ❀g✭T 1 k ✮✮
min✭q2
k ❀g✭T 2 k ✮✮
min✭qn
k ❀g✭T n k ✮✮
X1
k
X2
k
Xn
k
Y
k
In slot k,
- Ek: Energy buffer level, qi
k, 1 ✔ i ✔ n: Data buffer level of node i
- EH source harvests Yk units of energy
- Source provides T i
k units of energy to node i, 1 ✔ i ✔ n
- Node i generates X i
k bits of data and transmits g✭T i k ✮ bits of data, 1 ✔ i ✔ n
System Model (contd.)
Data buffer queue lengths evolve as qi
k✰1 ❂ ✭qi k g✭T i k ✮✮✰ ✰ X i k
1 ✔ i ✔ n where ✭qi
k g✭T i k ✮✮✰ ❂ max
- ✭qi
k g✭T i k ✮❀ 0
✁
Energy buffer queue length evolves as Ek✰1 ❂ ✒ Ek
n
❳
i❂1
T i
k
✓ ✰ Yk
Model Assumptions
1 Generated data at time k ✰ 1, Xk✰1 ✱ ✭X 1 k✰1❀ ✿ ✿ ✿ ❀ X n k✰1✮ evolves as a
jointly Markov process: Xk✰1 ❂ f 1✭Xk❀ Wk✮❀ k ✕ 0
- f 1 is some arbitrary vector valued function with n components
- ❢Wk❀ k ✕ 1❣ is a noise sequence with probability distribution P✭Wk ❥ Xk✮
depending on Xk.
2 Energy arrival process evolves as:
Yk✰1 ❂ f 2✭Yk❀ Vk✮❀ k ✕ 0
- f 2 is some scalar vector valued function
- ❢Vk❀ k ✕ 1❣ is a noise sequence with probability distribution P✭Vk ❥ Yk✮
depending on Yk.
MDP Formulation
State: sk ❂
- q1
k❀ q2 k❀ ✿ ✿ ✿ ❀ qn k ❀ Ek❀ X 1 k1❀ ✿ ✿ ✿ ❀ X n k1❀ Yk1
✁ , sk ✷ S Action: T✭sk✮ ❂
- T 1✭sk✮❀ T 2✭sk✮❀ ✿ ✿ ✿ ❀ T n✭sk✮
✁ ✷ A, specifies the number
- f energy bits to be given to each node at time k
Stationary Policy ✙ ❂ ❢T❀ T❀ ✿ ✿ ✿❣ Single Stage Cost: c✭sk❀ T✭sk✮✮ ❂
n
P
i❂1
- qi
k g✭T i✭s k✮✮
✁✰ ✿ Long-run average cost per step ✕✙: lim
m✦✶ ❊
✔
1 m m1
P
k❂0
c✭sk❀ T✭sk✮✮ ✕ ✙✄ ❂ ✭
✄❀ ✄❀ ✿ ✿ ✿✮
MDP Formulation
State: sk ❂
- q1
k❀ q2 k❀ ✿ ✿ ✿ ❀ qn k ❀ Ek❀ X 1 k1❀ ✿ ✿ ✿ ❀ X n k1❀ Yk1
✁ , sk ✷ S Action: T✭sk✮ ❂
- T 1✭sk✮❀ T 2✭sk✮❀ ✿ ✿ ✿ ❀ T n✭sk✮
✁ ✷ A, specifies the number
- f energy bits to be given to each node at time k
Stationary Policy ✙ ❂ ❢T❀ T❀ ✿ ✿ ✿❣ Single Stage Cost: c✭sk❀ T✭sk✮✮ ❂
n
P
i❂1
- qi
k g✭T i✭s k✮✮
✁✰ ✿ Long-run average cost per step ✕✙: lim
m✦✶ ❊
✔
1 m m1
P
k❂0
c✭sk❀ T✭sk✮✮ ✕
Objective
Find a stationary optimal policy ✙✄ ❂ ✭T ✄❀ T ✄❀ ✿ ✿ ✿✮ which minimizes the sum
- f remaining data queue lengths of all buffers. This policy also minimizes the
mean delay of data transmission.
Q-learning
Q-value of a state-action pair ✭s❀ a✮ is Q✭s❀ a✮ Initially, Q✭s❀ a✮ ❂ 0 ✽s ✷ S❀ a ✷ A ir: Reference state Simulate the MDP for a large number of iterations At iteration j, for the state-action pair visited during simulation, Q value is updated:
Qj✰1✭s❀ a✮ ❂ ✭1 ☛✭j✮✮Qj✭s❀ a✮ ✰ ☛✭j✮ ✂ ✒ c✭s❀ a✮ ✰ min
b✷A✭s✵✮Qj✭s✵❀ b✮ min u✷A✭ir ✮Qj✭ir❀ u✮
✓
Step-size sequence ☛✭j✮ ❃ 0❀ ✽j ✕ 0 satisfies the following conditions:
❳
j
☛✭j✮ ❂ ✶ and ❳
j
☛2✭j✮ ❁ ✶✿
Q-Learning: Exploration Mechanisms
✎-greedy Method: At state s, in iteration j select random action with probability ✎ and greedy action with probability 1 ✎ UCB Method:
Ns✭j✮: number of times state s is visited until time j Ns❀a✭j✮ be the number of times action a is picked in state s upto time j Q-value of state-action pair ✭s❀ a✮ at time j is Qj✭s❀ a✮ Action for the current state s is chosen using the following rule a✵ ❂ arg max
a✷A✭s✮
✥ Qj✭s❀ a✮ ✰ ☞ s ln Ns✭j✮ Ns❀a✭j✮ ✦
Q-learning: Optimal Policy
Q-learning algorithm converges to optimal value function Q✄ The optimal action a✄ for state s is obtained by a✄ ❂ arg min
a✵✷A✭s✮
Q✄✭s❀ a✵✮ Optimal Policy: T ✄✭s✮ ❂ a✄, ✽s ✷ S ✭ ❀ ✮ ❂ ❥ ✂ ❥ ✙ ❂ ❥ ✂ ❥ ✙
Q-learning: Optimal Policy
Q-learning algorithm converges to optimal value function Q✄ The optimal action a✄ for state s is obtained by a✄ ❂ arg min
a✵✷A✭s✮
Q✄✭s❀ a✵✮ Optimal Policy: T ✄✭s✮ ❂ a✄, ✽s ✷ S
Issues
- Need lookup table to store and update Q-value for every ✭s❀ a✮ tuple
- Computationally expensive: With n ❂ 2, ❥S ✂ A❥ ✙ 306
- Condition exacerbated when there are more number of sensors, n ❂ 4,
❥S ✂ A❥ ✙ 3010
Approximation Approaches
1 State and action space aggregation
Concept is to cluster states and actions based on monotonicity property of value function Define a Q-value for aggregate state-action pair Results in cardinality reduction
2 Policy Approximation
Need to search in the space of all policies Aim is to find a near optimal stationary randomized policy
Property of Value Function
Consider two nodes and a source H✄✭q1❀ q2❀ E✮ be the differential value of state ✭q1❀ q2❀ E✮ q ❁ qL ✔ D
MAX and E MAX ✕ EL ❃ E. Then H✄ is monotonic
H✄✭q1❀ q2❀ E✮ ✔ H✄✭qL
1❀ q2❀ E✮
H✄✭q1❀ q2❀ E✮ ✕ H✄✭q1❀ q2❀ EL✮✿ Value function of the clustered state will be the average of the value function of the individual states Value function of aggregated state will be close to the value function of the unaggregated state if the difference between values of states in a cluster is small
State and Action Space Aggregation
Partition 2 Partition 1 Partition 3 E
MAX
y1
u
y2
u
Partition 2 Partition 1 Partition 3 D
MAX
x1
u
x2
u
Data buffer partitions: sets d1❀ d2❀ ✿ ✿ ✿ ❀ ds, where di ❂ ✭xi
L❀ xi U✮
Energy buffer partitions: sets e1❀ e2❀ ✿ ✿ ✿ ❀ er, where ej ❂ ✭yj
L❀ yj U✮
0 ❂ x1
L ❁ x1 U ❁ x2 L ❁ x2 U ❁ ✿ ✿ ✿ ❁ xs L ❁ xs U ❂ D MAX and
xi
U ✰ 1 ❂ xi✰1 L
❀ 1 ✔ i ✔ s 1✿ 0 ❂ y1
L ❁ y1 U ❁ y2 L ❁ y2 U ❁ ✿ ✿ ✿ ❁ yr L ❁ yr U ❂ E MAX and
yi
U ✰ 1 ❂ yi✰1 L
1 ✔ i ✔ r 1✿
Cardinality Reduction
Aggregate state: s✵ ❂ ❢l1❀ ✿ ✿ ✿ ❀ ln✰1❀ ln✰2❀ ✿ ✿ ✿ ❀ l 2n✰1❀ l 2n✰2❣
- li is the data buffer level in the ith node, 1 ✔ i ✔ n
- ln✰1 is the energy buffer level
- li ✷ ❢1❀ ✿ ✿ ✿ ❀ s❣, 1 ✔ i ✔ n, n ✰ 2 ✔ i ✔ 2n ✰ 1
- ln✰1❀ l 2n✰2 ✷ ❢1❀ ✿ ✿ ✿ ❀ r❣
Aggregate action for state s✵ is t
✵ ❂ ✭t1❀ ✿ ✿ ✿ ❀ tn✮
where ti ✷ ❢1❀ ✿ ✿ ✿ ❀ ln✰1❣❀ 1 ✔ i ✔ n s ✜ D
MAX , r ✜ E MAX
❥S
✵ ✂ A ✵❥ ✙ 410 with four nodes, E MAX ❂ D MAX ❂ 30 and four partitions of
data and energy buffers
Q-learning with State Aggregation
S
✵: Aggregate state space
A
✵: Aggregate action space
Q-value Q✭s✵❀ t✵✮: defined for every aggregate state-action tuple Initially, Q✭s✵❀ t✵✮ ❂ 0 and updated using the following rule
Qj✰1✭s✵❀ a✵✮ ❂ ✭1 ☛✭j✮✮Qj✭s✵❀ a✵✮ ✰ ☛✭j✮✂ ✥ c✭s✵❀ a✵✮ ✰ min
b✷A✵ ✭v✵✮
Qj✭v✵❀ b✮ min
u✷A✵ ✭r✵✮
Qj✭r ✵❀ u✮ ✦
Policy Approximation
The set of policies is approximated by a parameter vector ✒ ❂ ✭✒1❀ ✿ ✿ ✿ ❀ ✒M✮ Class of parameterized random policies is ❢✙✒❀ ✒ ✷ ❘M❣ ✙✒✭s✵❀ a✵✮: Probability of picking aggregate action a✵ in aggregate state s✵ Goal: Tune ✒ such that ✙✒ is near optimal Search the policy space in an organised manner
Cross-Entropy Method
Parameterized Boltzmann Policies
✣sa ✷ ❘M: Feature vector for aggregate state-action tuple ✭s❀ a✮ ✙✒✭s❀ a✮ ❂ e✒❃✣sa P
b✷A✭s✮
e✒❃✣sb ✽s ✷ S
✵❀ ✽a ✷ A ✵✭s✮
✒ ❂ ✭✒1❀ ✿ ✿ ✿ ❀ ✒M✮, ✒i ✘ N✭✖i❀ ✛i✮, where 1 ✔ i ✔ M
Approach
Generate N parameter vectors ✭✒1❀ ✿ ✿ ✿ ❀ ✒N✮ Simulate MDP trajectory using each parameter vector ✒j Compute average cost ✕j of each policy ✙✒j Choose trajectory j, if ✕j ❁ ✕c Update ✭✖i❀ ✛i✮ using the parameter vectors corresponding to these trajectories
Cross-Entropy Method (contd.)
A quantile value ✚ ✷ ✭0❀ 1✮ is selected The average cost values are sorted in descending order. Let ✕1❀ ✿ ✿ ✿ ❀ ✕N be the sorted order. Hence ✕1 ✕ ✿ ✿ ✿ ✕ ✕N. ✕c ❂ ✕❞✭1✚✮❡N average cost is picked as the threshold level Meta-parameters ❢✭✖t
i ❀ ✛t i ✮❀ 1 ✔ i ✔ M❣ are updated after iteration t as
follows: ✖✭t✰1✮
i
❂
N
P
j❂1
I❢✕j ✔✕c❣✒j
i N
P
j❂1
I❢✕j ✔✕c❣ ✛2✭t✰1✮
i
❂
N
P
j❂1
I❢✕j ✔✕c❣ ✏ ✒j
i ✖✭t✰1✮ i
✑2
N
P
j❂1
I❢✕j ✔✕c❣ ✿
Comparison Algorithms Used
1 Greedy method
- Takes as input ✭q1
k ❀ ✿ ✿ ✿ ❀ qn k ✮
- Energy available in source is Ek
- Number of energy units required to transmit qi
k bits of data = g1✭qi k✮
- Energy provided is tk ❂ min
✥ Ek❀
n
P
i❂1
g1✭qi
k✮
✦
2 Combined nodes method
- State: wk ❂
✥
n
P
i❂1
qi
k❀ Ek
✦
- Action: vk ❂ total energy that needs to be distributed between the nodes
- Uses Q-learning update rule to update Q✭wk❀ vk✮
- Finds the total optimal energy to be supplied, but not the exact split
Results - Two nodes and EH source
1 2 3 4 5 6 7 8 9 10 0✿2 0✿4 0✿6 0✿8 1 1✿2 ❊ ❬✦1❪ Normalized Average Cost
Greedy Combined Nodes QL QL ✎-greedy Q-UCB
Figure: Performance comparison of policies
EMAX ❂ 20, DMAX ❂ 10 Xk ❂ AXk1 ✰ ✦, ✦ ❂ ✭✦1❀ ✦2✮❃ ❊ ❬✦2❪ ❂ 1✿0 A ❂ 0✿2 0✿3
0✿3 0✿2
✁ Yk ❂ bYk1 ✰ ✤, b ❂ 0✿5, ❊ ❬✤❪ ❂ 20 ✤, ✦ are Poisson distributed Conversion function g✭T i
k✮ ❂ 2 ln✭1 ✰ T i k✮
Results - Four nodes and EH source
0✿25 0✿5 0✿75 1 1✿25 1✿5 1✿75 2 2✿25 2✿5 2✿75 3 1 1✿5 2 2✿5 3 3✿5 4 4✿5 5 ❊ ✂ X 1✄ Normalized Average Cost
Greedy Combined Nodes QL Cross Entropy QL-SA ✎greedy QL-SA-UCB
Figure: Performance comparison of policies
EMAX ❂ DMAX ❂ 30 X 1❀ X 2❀ X 3❀ X 4❀ Y are Poisson Distributed ❊ ✂ X 2✄ ❂ ❊ ✂ X 3✄ ❂ ❊ ✂ X 4✄ ❂ 1✿0, ❊ ❬Y❪ ❂ 25 ❊ ✂ X 1✄ is varied Six partitions of energy and data buffers ✎ ❂ 0✿1 Conversion function g✭T i
k✮ ❂ ln✭1 ✰ T i k✮
Summary
We proposed algorithms to manage energy available through harvesting In order to deal with the curse of dimensionality, we also developed approximation algorithms Future Directions:
- Gradient-based approaches
- Features used in approximation methods can be tuned
Questions?
References I
- S. Bhatnagar, V. S. Borkar, and K. J. Prabuchandran, “Feature search in
the grassmanian in online reinforcement learning,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, pp. 746–758, 2013.
- I. Menache, S. Mannor, and N. Shimkin, “Basis function adaptation in
temporal difference reinforcement learning,” Annals of Operations Research, vol. 134, no. 1, pp. 215–238, 2005.
- K. J. Prabuchandran, S. K. Meena, and S. Bhatnagar, “Q-learning based
energy management policies for a single sensor node with finite buffer,” Wireless Communications Letters, IEEE, vol. 2, no. 1, pp. 82–85, 2013.
- K. Raman and R. Chatterjee, “Optimal monopolist pricing under demand
uncertainty in dynamic markets,” Management Science, pp. 144–162, 1995.
- V. Sharma, U. Mukherji, V. Joseph, and S. Gupta, “Optimal energy
management policies for energy harvesting sensor nodes,” IEEE Transactions on Wireless Communications, vol. 9, no. 4, pp. 1326–1336, 2010.
References II
- S. Chakravarty, S. Padakandla, and S. Bhatnagar, “A simulation-based
algorithm for optimal pricing policy under demand uncertainty,” International Transactions in Operational Research, 2013. [Online]. Available: http://dx.doi.org/10.1111/itor.12064
- S. Padakandla, K. J. Prabuchandran, and S. Bhatnagar, “Energy sharing