Algorithms for Product Pricing and Energy Allocation in Energy - - PowerPoint PPT Presentation

algorithms for product pricing and energy allocation in
SMART_READER_LITE
LIVE PREVIEW

Algorithms for Product Pricing and Energy Allocation in Energy - - PowerPoint PPT Presentation

Algorithms for Product Pricing and Energy Allocation in Energy Harvesting Sensor Networks MSc(Engg.) Thesis Defense Sindhu P R Advisor: Prof. Shalabh Bhatnagar Department of Computer Science and Automation Indian Institute of Science,


slide-1
SLIDE 1

Algorithms for Product Pricing and Energy Allocation in Energy Harvesting Sensor Networks

MSc(Engg.) Thesis Defense Sindhu P R

Advisor: Prof. Shalabh Bhatnagar

Department of Computer Science and Automation Indian Institute of Science, Bangalore

slide-2
SLIDE 2

Stochastic System

System ✘ ✒ ❊ ❬f✭✒❀ ✘✮❪ The output f✭✒❀ ✘✮ depends on the control (or input) ✒ and a noise element ✘ Objective: Find input ✒, which minimizes (or maximizes) ❊❬f✭✒❀ ✘✮❪

slide-3
SLIDE 3

Overview

1 Product Pricing

Problem Formulation Algorithm Results

slide-4
SLIDE 4

Overview

1 Product Pricing

Problem Formulation Algorithm Results

2 Energy Sharing in Sensor Networks

Problem Formulation Algorithms Approximation Algorithms Results

slide-5
SLIDE 5

Product Demand and Price

New product introduced into market faces demand uncertainty

Size of target population not known Response to promotional campaigns cannot be ascertained

Product price, quality and technolgy influence demand Dynamically varying price gives better profit compared to fixed pricing

slide-6
SLIDE 6

Product Pricing Problem

Product sold over the time interval ❬0❀ T❪ X✭t✮: Cumulative sales (or demand) upto time t u✭X✭t✮✮: Manufacturing cost per unit of the product Pr✭t✮: Price of product at time t

Objective

Find Pr✭t✮ which maximizes expected cumulative profit over ❬0❀ T❪ J✄ ❂ max

Pr ✭t✮ ❊

✔❩ T ✭Pr✭t✮ u✭X✭t✮✮✮ X✭t✮dt ☞ ☞ ☞ X✭0✮ ❂ X0 ✕

slide-7
SLIDE 7

Bass Diffusion Model

Used to quantify how demand grows with time Driving forces for product diffusion - mass-media communication and word-of-mouth communication Change in cumulative sales X✭t✮ of a product is stated as dX✭t✮ dt ❂ p✭M X✭t✮✮ ✰ qX✭t✮ M ✭M X✭t✮✮

  • M: Market potential
  • p, q: Coefficients of innovation and imitation
slide-8
SLIDE 8

Growth of Cumulative Demand

Change in X✭t✮ is modelled as a Stochastic Differential Equation (SDE)-

dX✭t✮ ❂ ✭M X✭t✮✮ ✒ p ✰ qX✭t✮ M ✓ ✭1 ✌Pr✭t✮✮dt ✰ ✛✭X✭t✮✮dW✭t✮

  • ✌ determines the extent of influence of Pr✭t✮ on the product diffusion
  • ✛✭✿✮ - diffusion term, ❢W✭t✮❀ t ✕ 0❣ - standard Brownian motion.
slide-9
SLIDE 9

Growth of Cumulative Demand

Change in X✭t✮ is modelled as a Stochastic Differential Equation (SDE)-

dX✭t✮ ❂ ✭M X✭t✮✮ ✒ p ✰ qX✭t✮ M ✓ ✭1 ✌Pr✭t✮✮dt ✰ ✛✭X✭t✮✮dW✭t✮

  • ✌ determines the extent of influence of Pr✭t✮ on the product diffusion
  • ✛✭✿✮ - diffusion term, ❢W✭t✮❀ t ✕ 0❣ - standard Brownian motion.

Approach

Discretize SDE and solve in discrete time setting Formulate the problem in the setting of simulation optimization Tune the (discrete) price trajectory to obtain optimum performance

slide-10
SLIDE 10

SDE Discretization

T ❂ Nh for some N ❃ 1, 0 ❁ h ❁ 1 ✑ ✭ ✮

✭ ✮ ✑

✭ ✮

✭ ✮

✁ ❂ ✭

  • ✮✭

✰ ✮

✭ ✮

✭ ✮

✁ ❂

✭ ✮

✛✭ ✮ ✑ ✛✭ ✮ ✛ ✓ ✭✁✮ ✛✭✁✮

slide-11
SLIDE 11

SDE Discretization

T ❂ Nh for some N ❃ 1, 0 ❁ h ❁ 1 Cumulative demand at decision time j is Xj and Xj ✑ X✭jh✮

✭ ✮ ✑

✭ ✮

✭ ✮

✁ ❂ ✭

  • ✮✭

✰ ✮

✭ ✮

✭ ✮

✁ ❂

✭ ✮

✛✭ ✮ ✑ ✛✭ ✮ ✛ ✓ ✭✁✮ ✛✭✁✮

slide-12
SLIDE 12

SDE Discretization

T ❂ Nh for some N ❃ 1, 0 ❁ h ❁ 1 Cumulative demand at decision time j is Xj and Xj ✑ X✭jh✮ Price set at j is Pr✭j✮ ✑ Pr✭jh✮

✭ ✮

✁ ❂ ✭

  • ✮✭

✰ ✮

✭ ✮

✭ ✮

✁ ❂

✭ ✮

✛✭ ✮ ✑ ✛✭ ✮ ✛ ✓ ✭✁✮ ✛✭✁✮

slide-13
SLIDE 13

SDE Discretization

T ❂ Nh for some N ❃ 1, 0 ❁ h ❁ 1 Cumulative demand at decision time j is Xj and Xj ✑ X✭jh✮ Price set at j is Pr✭j✮ ✑ Pr✭jh✮ Drift term, b

  • Xj❀ Pr✭j✮

✁ ❂ ✭M Xj✮✭p ✰ q

M Xj✮

  • 1 ✌Pr✭j✮

✭ ✮

✁ ❂

✭ ✮

✛✭ ✮ ✑ ✛✭ ✮ ✛ ✓ ✭✁✮ ✛✭✁✮

slide-14
SLIDE 14

SDE Discretization

T ❂ Nh for some N ❃ 1, 0 ❁ h ❁ 1 Cumulative demand at decision time j is Xj and Xj ✑ X✭jh✮ Price set at j is Pr✭j✮ ✑ Pr✭jh✮ Drift term, b

  • Xj❀ Pr✭j✮

✁ ❂ ✭M Xj✮✭p ✰ q

M Xj✮

  • 1 ✌Pr✭j✮

✁ g

  • Xj❀ Pr✭j✮

✁ ❂

  • u✭Xj✮ Pr✭j✮

✛✭ ✮ ✑ ✛✭ ✮ ✛ ✓ ✭✁✮ ✛✭✁✮

slide-15
SLIDE 15

SDE Discretization

T ❂ Nh for some N ❃ 1, 0 ❁ h ❁ 1 Cumulative demand at decision time j is Xj and Xj ✑ X✭jh✮ Price set at j is Pr✭j✮ ✑ Pr✭jh✮ Drift term, b

  • Xj❀ Pr✭j✮

✁ ❂ ✭M Xj✮✭p ✰ q

M Xj✮

  • 1 ✌Pr✭j✮

✁ g

  • Xj❀ Pr✭j✮

✁ ❂

  • u✭Xj✮ Pr✭j✮

✁ X1 ❂ 0 and X0 is assumed to be known ✛✭ ✮ ✑ ✛✭ ✮ ✛ ✓ ✭✁✮ ✛✭✁✮

slide-16
SLIDE 16

SDE Discretization

T ❂ Nh for some N ❃ 1, 0 ❁ h ❁ 1 Cumulative demand at decision time j is Xj and Xj ✑ X✭jh✮ Price set at j is Pr✭j✮ ✑ Pr✭jh✮ Drift term, b

  • Xj❀ Pr✭j✮

✁ ❂ ✭M Xj✮✭p ✰ q

M Xj✮

  • 1 ✌Pr✭j✮

✁ g

  • Xj❀ Pr✭j✮

✁ ❂

  • u✭Xj✮ Pr✭j✮

✁ X1 ❂ 0 and X0 is assumed to be known ✛✭Xj✮ ✑ ✛✭jh✮ ✛ ✓ ✭✁✮ ✛✭✁✮

slide-17
SLIDE 17

SDE Discretization

T ❂ Nh for some N ❃ 1, 0 ❁ h ❁ 1 Cumulative demand at decision time j is Xj and Xj ✑ X✭jh✮ Price set at j is Pr✭j✮ ✑ Pr✭jh✮ Drift term, b

  • Xj❀ Pr✭j✮

✁ ❂ ✭M Xj✮✭p ✰ q

M Xj✮

  • 1 ✌Pr✭j✮

✁ g

  • Xj❀ Pr✭j✮

✁ ❂

  • u✭Xj✮ Pr✭j✮

✁ X1 ❂ 0 and X0 is assumed to be known ✛✭Xj✮ ✑ ✛✭jh✮ ✛ ✓ ✭✁✮ is the derivative of ✛✭✁✮

slide-18
SLIDE 18

SDE Discretization (contd.)

Discretized SDE

Xj✰1 ❂ Xj ✰ b✭Xj❀ Pr✭j✮✮h ✰ ✛✭Xj✮ ♣ hZj✰1 ✰ 1 2✛ ✓ ✭Xj✮✛✭Xj✮h✭Z 2

j✰1 1✮

  • For 0 ✔ j ✔ N 1, Zj✰1 are independent N✭0❀ 1✮ distributed samples
  • ✭ ✮❀

✭ ✮❀ ✁ ✁ ✁ ❀ ✭ ✮

✁ ❂ ❊ ✷ ✹

✭ ❀

✭ ✮✮✭

☞ ☞ ☞ ✸ ✺

slide-19
SLIDE 19

SDE Discretization (contd.)

Discretized SDE

Xj✰1 ❂ Xj ✰ b✭Xj❀ Pr✭j✮✮h ✰ ✛✭Xj✮ ♣ hZj✰1 ✰ 1 2✛ ✓ ✭Xj✮✛✭Xj✮h✭Z 2

j✰1 1✮

  • For 0 ✔ j ✔ N 1, Zj✰1 are independent N✭0❀ 1✮ distributed samples

Discretized Objective Function

J

  • Pr✭0✮❀ Pr✭1✮❀ ✁ ✁ ✁ ❀ Pr✭N1✮

✁ ❂ ❊ ✷ ✹

N1

j❂0

g✭Xj❀ Pr✭j✮✮✭Xj Xj1✮ ☞ ☞ ☞ X0 ✸ ✺

slide-20
SLIDE 20

Optimal Price Vector

Objective: Find P✄

r ❂ ✭P✄ r✭0✮❀ P✄ r✭1✮❀ ✁ ✁ ✁ ❀ P✄ r✭N1✮✮❃ where,

❢P✄

r✭0✮❀ P✄ r✭1✮❀ ✁ ✁ ✁ ❀ P✄ r✭N1✮❣ ❂

arg min

❢Pr✭0✮❀✁✁✁ ❀Pr✭N1✮❣

JX0✭Pr✭0✮❀ Pr✭1✮❀ ✁ ✁ ✁ ❀ Pr✭N1✮✮ Price set at stage k influences the demand in that stage as well as in future stages The initial price influences the single-stage costs of all stages

slide-21
SLIDE 21

Gradient Search Using Simulations

✒ f✭✒✮ f✭✒1✮ ☞r✒f

Figure: Gradient Search

Need to minimize f✭✒✮ Find r✒f Update ✒ in the negative direction

  • f gradient

In pricing problem, f✭✒✮ ❂ J✭Pr✭0✮❀ Pr✭1✮❀ ✁ ✁ ✁ ❀ Pr✭N1✮✮

slide-22
SLIDE 22

Smoothed Functional Technique

Smooth rJ✭Pr✭0✮❀ Pr✭1✮❀ ✁ ✁ ✁ ❀ Pr✭N1✮✮ by convolution with Ndimensional Gaussian density function, G✭✑✮, ✑ ✷ ❘N ☞ ❃ 0 is a small constant

x f✭x✮ f✭x✮ ✍ G✭✑✮

✭ ✮ ❂ ☞ ❊ ❤✑ ✭ ✭ ✰ ☞✑✮ ✭ ☞✑✮✮ ❥ ✐

slide-23
SLIDE 23

Smoothed Functional Technique

Smooth rJ✭Pr✭0✮❀ Pr✭1✮❀ ✁ ✁ ✁ ❀ Pr✭N1✮✮ by convolution with Ndimensional Gaussian density function, G✭✑✮, ✑ ✷ ❘N ☞ ❃ 0 is a small constant

x f✭x✮ f✭x✮ ✍ G✭✑✮

The gradient estimate is D☞ JX0✭Pr✮ ❂ 1 ☞ ❊ ❤✑ 2 ✭JX0✭Pr ✰ ☞✑✮ JX0✭Pr ☞✑✮✮ ❥ Pr ✐

slide-24
SLIDE 24

Gradient Estimate

In our simulation-based algorithm, expectation is replaced by samples

  • btained from simulation.

For large L, small ☞, rJX0✭Pr✮ ❂ 1 ☞ 1 L ✧

L

n❂1

✑✭n✮ 2

  • JX0✭Pr ✰ ☞✑✭n✮✮ JX0✭Pr ☞✑✭n✮✮

✁ ★ ❦ D☞JX0✭Pr✮ rJX0✭Pr✮ ❦✦ 0 as ☞ ✦ 0 and L ✦ ✶

slide-25
SLIDE 25

Two-timescale Smoothed Functional (SF) Algorithm

1 Generate ✑✭n✮ 2 Get P✰ r ❂ Pr✭n✮ ✰ ☞✑✭n✮ and P r ❂ Pr✭n✮ ☞✑✭n✮ 3 Compute X✰ and X 4 Compute ✍✰ j ❂ g✭X ✰ i ✭n✮❀ P✰ r✭j✮✭n✮✮✭X ✰ i ✭n✮ X ✰ i1✭n✮✮,

j

❂ g✭X

i ✭n✮❀ P r✭j✮✭n✮✮✭X i ✭n✮ X i1✭n✮✮, 0 ✔ j ✔ N 1 5 Compute ✍j ❂ ✑j ✭n✮ 2☞ N

P

i❂j

✏ ✍✰

j ✍ j

✑ , 0 ✔ j ✔ N 1, ✍ ❂ ✭✍0❀ ✿ ✿ ✿ ❀ ✍N1✮❃

6 Gradient estimate: Y✭n ✰ 1✮ ❂ ✭1 c✭n✮✮Y✭n✮ ✰ c✭n✮✍ 7 Update price: Pr✭n ✰ 1✮ ❂ Pr✭n✮ ☛✭n✮Y✭n ✰ 1✮

slide-26
SLIDE 26

Performance of SF Algorithm

Figure: Evolution of cumulative demand Figure: Optimal Price Trajectory Simulation Parameters: M = 10, ✛ ❂ 0✿1, u✭Xj✮ ❂ 100 0✿2Xj, p ❂ 0✿02, q ❂ 0✿5 and ✌ ❂ 0✿005 L ❂ 50❀ 000, N ❂ 400, h ❂ 0✿25, T ❂ 100 Myopic price is 140 Objective function values: myopic ❂ 472✿6 ✝ 0✿9, optimal = 670 ✝ 6

slide-27
SLIDE 27

Summary

We proposed a simulation-based algorithm for pricing a product over the interval ❬0❀ T❪, when demand is uncertain The algorithm utilised smoothed functional gradient estimator to find the

  • ptimal pricing policy

Future Directions:

  • Simulation based optimization methods for products in the presence of

competing products

slide-28
SLIDE 28

1 Product Pricing

Problem Formulation Algorithm Results

2 Energy Sharing in Sensor Networks

Problem Formulation Algorithms Approximation Algorithms Results

slide-29
SLIDE 29

Sensor Network

Fusion Node ❂ Sensor Node Group of autonomous sensing nodes Sensed data is transmitted to a fusion node Fusion node obtains data from multiple nodes and processes it

slide-30
SLIDE 30

Conventional Sensor Node

t❂T1 t❂T1✰T2 t❂T1✰T2✰T3

Node is powered by a non-rechargeable battery Energy is used for sensing and transmission Node stops operating once battery is exhausted

slide-31
SLIDE 31

Energy Harvesting Sensor Node

t❂T1 t❂T1✰T2 t❂T1✰T2✰T3

Node replenishes energy, by harvesting energy from the environment Energy harvested is location and time dependent Techniques required to prevent energy starvation in nodes

slide-32
SLIDE 32

Sensor Mote

slide-33
SLIDE 33

Energy Sharing Problem

System comprises of n sensors and a common energy harvesting (EH) source Each sensor has a finite data buffer in which sensed data is stored The EH source has a finite buffer to store the harvested energy Nodes share the energy harvested by the EH source

slide-34
SLIDE 34

Energy Sharing Problem

System comprises of n sensors and a common energy harvesting (EH) source Each sensor has a finite data buffer in which sensed data is stored The EH source has a finite buffer to store the harvested energy Nodes share the energy harvested by the EH source

Aim

To develop energy sharing algorithms which minimize the average delay in transmission of data from the nodes

slide-35
SLIDE 35

Approach

Model the problem as an infinite-horizon average cost Markov decision process (MDP) Control decisions based on:

  • amount of data in each of the sensors
  • energy available in the EH source

Develop Reinforcement Learning (RL) algorithms which

  • optimally allocate energy to every sensor
  • work without requiring a system model
slide-36
SLIDE 36

Reinforcement Learning

Environment: probabilistically evolves

  • ver states

Reinforcement: the cost incurred after performing an action in a state Policy: determines which action to be taken at every state Goal: learn a policy which minimizes the long-run average cost The RL controller learns the policy which achieves this goal using trial-and-error process

Environment Controller sk c✭sk❀ Tk✮ sk✰1 Tk

slide-37
SLIDE 37

System Model

Node 1 Node 2 Node n

EH Source

q1

k

q2

k

qn

k

E

k

T 1

k

T 2

k

T n

k

min✭q1

k ❀g✭T 1 k ✮✮

min✭q2

k ❀g✭T 2 k ✮✮

min✭qn

k ❀g✭T n k ✮✮

X1

k

X2

k

Xn

k

Y

k

Slotted, discrete-time model for the environment D

MAX : Data buffer size of each sensor

E

MAX : Energy buffer size of EH source

slide-38
SLIDE 38

System Model

Node 1 Node 2 Node n

EH Source

q1

k

q2

k

qn

k

E

k

T 1

k

T 2

k

T n

k

min✭q1

k ❀g✭T 1 k ✮✮

min✭q2

k ❀g✭T 2 k ✮✮

min✭qn

k ❀g✭T n k ✮✮

X1

k

X2

k

Xn

k

Y

k

In slot k,

  • Ek: Energy buffer level, qi

k, 1 ✔ i ✔ n: Data buffer level of node i

  • EH source harvests Yk units of energy
  • Source provides T i

k units of energy to node i, 1 ✔ i ✔ n

  • Node i generates X i

k bits of data and transmits g✭T i k ✮ bits of data, 1 ✔ i ✔ n

slide-39
SLIDE 39

System Model (contd.)

Data buffer queue lengths evolve as qi

k✰1 ❂ ✭qi k g✭T i k ✮✮✰ ✰ X i k

1 ✔ i ✔ n where ✭qi

k g✭T i k ✮✮✰ ❂ max

  • ✭qi

k g✭T i k ✮❀ 0

Energy buffer queue length evolves as Ek✰1 ❂ ✒ Ek

n

i❂1

T i

k

✓ ✰ Yk

slide-40
SLIDE 40

Model Assumptions

1 Generated data at time k ✰ 1, Xk✰1 ✱ ✭X 1 k✰1❀ ✿ ✿ ✿ ❀ X n k✰1✮ evolves as a

jointly Markov process: Xk✰1 ❂ f 1✭Xk❀ Wk✮❀ k ✕ 0

  • f 1 is some arbitrary vector valued function with n components
  • ❢Wk❀ k ✕ 1❣ is a noise sequence with probability distribution P✭Wk ❥ Xk✮

depending on Xk.

2 Energy arrival process evolves as:

Yk✰1 ❂ f 2✭Yk❀ Vk✮❀ k ✕ 0

  • f 2 is some scalar vector valued function
  • ❢Vk❀ k ✕ 1❣ is a noise sequence with probability distribution P✭Vk ❥ Yk✮

depending on Yk.

slide-41
SLIDE 41

MDP Formulation

State: sk ❂

  • q1

k❀ q2 k❀ ✿ ✿ ✿ ❀ qn k ❀ Ek❀ X 1 k1❀ ✿ ✿ ✿ ❀ X n k1❀ Yk1

✁ , sk ✷ S Action: T✭sk✮ ❂

  • T 1✭sk✮❀ T 2✭sk✮❀ ✿ ✿ ✿ ❀ T n✭sk✮

✁ ✷ A, specifies the number

  • f energy bits to be given to each node at time k

Stationary Policy ✙ ❂ ❢T❀ T❀ ✿ ✿ ✿❣ Single Stage Cost: c✭sk❀ T✭sk✮✮ ❂

n

P

i❂1

  • qi

k g✭T i✭s k✮✮

✁✰ ✿ Long-run average cost per step ✕✙: lim

m✦✶ ❊

1 m m1

P

k❂0

c✭sk❀ T✭sk✮✮ ✕ ✙✄ ❂ ✭

✄❀ ✄❀ ✿ ✿ ✿✮

slide-42
SLIDE 42

MDP Formulation

State: sk ❂

  • q1

k❀ q2 k❀ ✿ ✿ ✿ ❀ qn k ❀ Ek❀ X 1 k1❀ ✿ ✿ ✿ ❀ X n k1❀ Yk1

✁ , sk ✷ S Action: T✭sk✮ ❂

  • T 1✭sk✮❀ T 2✭sk✮❀ ✿ ✿ ✿ ❀ T n✭sk✮

✁ ✷ A, specifies the number

  • f energy bits to be given to each node at time k

Stationary Policy ✙ ❂ ❢T❀ T❀ ✿ ✿ ✿❣ Single Stage Cost: c✭sk❀ T✭sk✮✮ ❂

n

P

i❂1

  • qi

k g✭T i✭s k✮✮

✁✰ ✿ Long-run average cost per step ✕✙: lim

m✦✶ ❊

1 m m1

P

k❂0

c✭sk❀ T✭sk✮✮ ✕

Objective

Find a stationary optimal policy ✙✄ ❂ ✭T ✄❀ T ✄❀ ✿ ✿ ✿✮ which minimizes the sum

  • f remaining data queue lengths of all buffers. This policy also minimizes the

mean delay of data transmission.

slide-43
SLIDE 43

Q-learning

Q-value of a state-action pair ✭s❀ a✮ is Q✭s❀ a✮ Initially, Q✭s❀ a✮ ❂ 0 ✽s ✷ S❀ a ✷ A ir: Reference state Simulate the MDP for a large number of iterations At iteration j, for the state-action pair visited during simulation, Q value is updated:

Qj✰1✭s❀ a✮ ❂ ✭1 ☛✭j✮✮Qj✭s❀ a✮ ✰ ☛✭j✮ ✂ ✒ c✭s❀ a✮ ✰ min

b✷A✭s✵✮Qj✭s✵❀ b✮ min u✷A✭ir ✮Qj✭ir❀ u✮

Step-size sequence ☛✭j✮ ❃ 0❀ ✽j ✕ 0 satisfies the following conditions:

j

☛✭j✮ ❂ ✶ and ❳

j

☛2✭j✮ ❁ ✶✿

slide-44
SLIDE 44

Q-Learning: Exploration Mechanisms

✎-greedy Method: At state s, in iteration j select random action with probability ✎ and greedy action with probability 1 ✎ UCB Method:

Ns✭j✮: number of times state s is visited until time j Ns❀a✭j✮ be the number of times action a is picked in state s upto time j Q-value of state-action pair ✭s❀ a✮ at time j is Qj✭s❀ a✮ Action for the current state s is chosen using the following rule a✵ ❂ arg max

a✷A✭s✮

✥ Qj✭s❀ a✮ ✰ ☞ s ln Ns✭j✮ Ns❀a✭j✮ ✦

slide-45
SLIDE 45

Q-learning: Optimal Policy

Q-learning algorithm converges to optimal value function Q✄ The optimal action a✄ for state s is obtained by a✄ ❂ arg min

a✵✷A✭s✮

Q✄✭s❀ a✵✮ Optimal Policy: T ✄✭s✮ ❂ a✄, ✽s ✷ S ✭ ❀ ✮ ❂ ❥ ✂ ❥ ✙ ❂ ❥ ✂ ❥ ✙

slide-46
SLIDE 46

Q-learning: Optimal Policy

Q-learning algorithm converges to optimal value function Q✄ The optimal action a✄ for state s is obtained by a✄ ❂ arg min

a✵✷A✭s✮

Q✄✭s❀ a✵✮ Optimal Policy: T ✄✭s✮ ❂ a✄, ✽s ✷ S

Issues

  • Need lookup table to store and update Q-value for every ✭s❀ a✮ tuple
  • Computationally expensive: With n ❂ 2, ❥S ✂ A❥ ✙ 306
  • Condition exacerbated when there are more number of sensors, n ❂ 4,

❥S ✂ A❥ ✙ 3010

slide-47
SLIDE 47

Approximation Approaches

1 State and action space aggregation

Concept is to cluster states and actions based on monotonicity property of value function Define a Q-value for aggregate state-action pair Results in cardinality reduction

2 Policy Approximation

Need to search in the space of all policies Aim is to find a near optimal stationary randomized policy

slide-48
SLIDE 48

Property of Value Function

Consider two nodes and a source H✄✭q1❀ q2❀ E✮ be the differential value of state ✭q1❀ q2❀ E✮ q ❁ qL ✔ D

MAX and E MAX ✕ EL ❃ E. Then H✄ is monotonic

H✄✭q1❀ q2❀ E✮ ✔ H✄✭qL

1❀ q2❀ E✮

H✄✭q1❀ q2❀ E✮ ✕ H✄✭q1❀ q2❀ EL✮✿ Value function of the clustered state will be the average of the value function of the individual states Value function of aggregated state will be close to the value function of the unaggregated state if the difference between values of states in a cluster is small

slide-49
SLIDE 49

State and Action Space Aggregation

Partition 2 Partition 1 Partition 3 E

MAX

y1

u

y2

u

Partition 2 Partition 1 Partition 3 D

MAX

x1

u

x2

u

Data buffer partitions: sets d1❀ d2❀ ✿ ✿ ✿ ❀ ds, where di ❂ ✭xi

L❀ xi U✮

Energy buffer partitions: sets e1❀ e2❀ ✿ ✿ ✿ ❀ er, where ej ❂ ✭yj

L❀ yj U✮

0 ❂ x1

L ❁ x1 U ❁ x2 L ❁ x2 U ❁ ✿ ✿ ✿ ❁ xs L ❁ xs U ❂ D MAX and

xi

U ✰ 1 ❂ xi✰1 L

❀ 1 ✔ i ✔ s 1✿ 0 ❂ y1

L ❁ y1 U ❁ y2 L ❁ y2 U ❁ ✿ ✿ ✿ ❁ yr L ❁ yr U ❂ E MAX and

yi

U ✰ 1 ❂ yi✰1 L

1 ✔ i ✔ r 1✿

slide-50
SLIDE 50

Cardinality Reduction

Aggregate state: s✵ ❂ ❢l1❀ ✿ ✿ ✿ ❀ ln✰1❀ ln✰2❀ ✿ ✿ ✿ ❀ l 2n✰1❀ l 2n✰2❣

  • li is the data buffer level in the ith node, 1 ✔ i ✔ n
  • ln✰1 is the energy buffer level
  • li ✷ ❢1❀ ✿ ✿ ✿ ❀ s❣, 1 ✔ i ✔ n, n ✰ 2 ✔ i ✔ 2n ✰ 1
  • ln✰1❀ l 2n✰2 ✷ ❢1❀ ✿ ✿ ✿ ❀ r❣

Aggregate action for state s✵ is t

✵ ❂ ✭t1❀ ✿ ✿ ✿ ❀ tn✮

where ti ✷ ❢1❀ ✿ ✿ ✿ ❀ ln✰1❣❀ 1 ✔ i ✔ n s ✜ D

MAX , r ✜ E MAX

❥S

✵ ✂ A ✵❥ ✙ 410 with four nodes, E MAX ❂ D MAX ❂ 30 and four partitions of

data and energy buffers

slide-51
SLIDE 51

Q-learning with State Aggregation

S

✵: Aggregate state space

A

✵: Aggregate action space

Q-value Q✭s✵❀ t✵✮: defined for every aggregate state-action tuple Initially, Q✭s✵❀ t✵✮ ❂ 0 and updated using the following rule

Qj✰1✭s✵❀ a✵✮ ❂ ✭1 ☛✭j✮✮Qj✭s✵❀ a✵✮ ✰ ☛✭j✮✂ ✥ c✭s✵❀ a✵✮ ✰ min

b✷A✵ ✭v✵✮

Qj✭v✵❀ b✮ min

u✷A✵ ✭r✵✮

Qj✭r ✵❀ u✮ ✦

slide-52
SLIDE 52

Policy Approximation

The set of policies is approximated by a parameter vector ✒ ❂ ✭✒1❀ ✿ ✿ ✿ ❀ ✒M✮ Class of parameterized random policies is ❢✙✒❀ ✒ ✷ ❘M❣ ✙✒✭s✵❀ a✵✮: Probability of picking aggregate action a✵ in aggregate state s✵ Goal: Tune ✒ such that ✙✒ is near optimal Search the policy space in an organised manner

slide-53
SLIDE 53

Cross-Entropy Method

Parameterized Boltzmann Policies

✣sa ✷ ❘M: Feature vector for aggregate state-action tuple ✭s❀ a✮ ✙✒✭s❀ a✮ ❂ e✒❃✣sa P

b✷A✭s✮

e✒❃✣sb ✽s ✷ S

✵❀ ✽a ✷ A ✵✭s✮

✒ ❂ ✭✒1❀ ✿ ✿ ✿ ❀ ✒M✮, ✒i ✘ N✭✖i❀ ✛i✮, where 1 ✔ i ✔ M

Approach

Generate N parameter vectors ✭✒1❀ ✿ ✿ ✿ ❀ ✒N✮ Simulate MDP trajectory using each parameter vector ✒j Compute average cost ✕j of each policy ✙✒j Choose trajectory j, if ✕j ❁ ✕c Update ✭✖i❀ ✛i✮ using the parameter vectors corresponding to these trajectories

slide-54
SLIDE 54

Cross-Entropy Method (contd.)

A quantile value ✚ ✷ ✭0❀ 1✮ is selected The average cost values are sorted in descending order. Let ✕1❀ ✿ ✿ ✿ ❀ ✕N be the sorted order. Hence ✕1 ✕ ✿ ✿ ✿ ✕ ✕N. ✕c ❂ ✕❞✭1✚✮❡N average cost is picked as the threshold level Meta-parameters ❢✭✖t

i ❀ ✛t i ✮❀ 1 ✔ i ✔ M❣ are updated after iteration t as

follows: ✖✭t✰1✮

i

N

P

j❂1

I❢✕j ✔✕c❣✒j

i N

P

j❂1

I❢✕j ✔✕c❣ ✛2✭t✰1✮

i

N

P

j❂1

I❢✕j ✔✕c❣ ✏ ✒j

i ✖✭t✰1✮ i

✑2

N

P

j❂1

I❢✕j ✔✕c❣ ✿

slide-55
SLIDE 55

Comparison Algorithms Used

1 Greedy method

  • Takes as input ✭q1

k ❀ ✿ ✿ ✿ ❀ qn k ✮

  • Energy available in source is Ek
  • Number of energy units required to transmit qi

k bits of data = g1✭qi k✮

  • Energy provided is tk ❂ min

✥ Ek❀

n

P

i❂1

g1✭qi

k✮

2 Combined nodes method

  • State: wk ❂

n

P

i❂1

qi

k❀ Ek

  • Action: vk ❂ total energy that needs to be distributed between the nodes
  • Uses Q-learning update rule to update Q✭wk❀ vk✮
  • Finds the total optimal energy to be supplied, but not the exact split
slide-56
SLIDE 56

Results - Two nodes and EH source

1 2 3 4 5 6 7 8 9 10 0✿2 0✿4 0✿6 0✿8 1 1✿2 ❊ ❬✦1❪ Normalized Average Cost

Greedy Combined Nodes QL QL ✎-greedy Q-UCB

Figure: Performance comparison of policies

EMAX ❂ 20, DMAX ❂ 10 Xk ❂ AXk1 ✰ ✦, ✦ ❂ ✭✦1❀ ✦2✮❃ ❊ ❬✦2❪ ❂ 1✿0 A ❂ 0✿2 0✿3

0✿3 0✿2

✁ Yk ❂ bYk1 ✰ ✤, b ❂ 0✿5, ❊ ❬✤❪ ❂ 20 ✤, ✦ are Poisson distributed Conversion function g✭T i

k✮ ❂ 2 ln✭1 ✰ T i k✮

slide-57
SLIDE 57

Results - Four nodes and EH source

0✿25 0✿5 0✿75 1 1✿25 1✿5 1✿75 2 2✿25 2✿5 2✿75 3 1 1✿5 2 2✿5 3 3✿5 4 4✿5 5 ❊ ✂ X 1✄ Normalized Average Cost

Greedy Combined Nodes QL Cross Entropy QL-SA ✎greedy QL-SA-UCB

Figure: Performance comparison of policies

EMAX ❂ DMAX ❂ 30 X 1❀ X 2❀ X 3❀ X 4❀ Y are Poisson Distributed ❊ ✂ X 2✄ ❂ ❊ ✂ X 3✄ ❂ ❊ ✂ X 4✄ ❂ 1✿0, ❊ ❬Y❪ ❂ 25 ❊ ✂ X 1✄ is varied Six partitions of energy and data buffers ✎ ❂ 0✿1 Conversion function g✭T i

k✮ ❂ ln✭1 ✰ T i k✮

slide-58
SLIDE 58

Summary

We proposed algorithms to manage energy available through harvesting In order to deal with the curse of dimensionality, we also developed approximation algorithms Future Directions:

  • Gradient-based approaches
  • Features used in approximation methods can be tuned
slide-59
SLIDE 59

Questions?

slide-60
SLIDE 60

References I

  • S. Bhatnagar, V. S. Borkar, and K. J. Prabuchandran, “Feature search in

the grassmanian in online reinforcement learning,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, pp. 746–758, 2013.

  • I. Menache, S. Mannor, and N. Shimkin, “Basis function adaptation in

temporal difference reinforcement learning,” Annals of Operations Research, vol. 134, no. 1, pp. 215–238, 2005.

  • K. J. Prabuchandran, S. K. Meena, and S. Bhatnagar, “Q-learning based

energy management policies for a single sensor node with finite buffer,” Wireless Communications Letters, IEEE, vol. 2, no. 1, pp. 82–85, 2013.

  • K. Raman and R. Chatterjee, “Optimal monopolist pricing under demand

uncertainty in dynamic markets,” Management Science, pp. 144–162, 1995.

  • V. Sharma, U. Mukherji, V. Joseph, and S. Gupta, “Optimal energy

management policies for energy harvesting sensor nodes,” IEEE Transactions on Wireless Communications, vol. 9, no. 4, pp. 1326–1336, 2010.

slide-61
SLIDE 61

References II

  • S. Chakravarty, S. Padakandla, and S. Bhatnagar, “A simulation-based

algorithm for optimal pricing policy under demand uncertainty,” International Transactions in Operational Research, 2013. [Online]. Available: http://dx.doi.org/10.1111/itor.12064

  • S. Padakandla, K. J. Prabuchandran, and S. Bhatnagar, “Energy sharing

for multiple sensor nodes with finite buffers,” IEEE Transactions on Communications, 2014 (Under Review).

slide-62
SLIDE 62

Thank You