USING REINFORCEMENT LEARNING A Comparison of Q-Learning and SARSA - - PowerPoint PPT Presentation

using reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

USING REINFORCEMENT LEARNING A Comparison of Q-Learning and SARSA - - PowerPoint PPT Presentation

ADAPTIVE POWER MANAGEMENT OF ENERGY HARVESTING SENSOR NODES USING REINFORCEMENT LEARNING A Comparison of Q-Learning and SARSA Algorithms


slide-1
SLIDE 1

ADAPTIVE POWER MANAGEMENT OF ENERGY HARVESTING SENSOR NODES USING REINFORCEMENT LEARNING

SHASWOT SHRESTHAMALI MASAAKI KONDO HIROSHI NAKAMURA

A Comparison of Q-Learning and SARSA Algorithms

THE UNIVERSITY OF TOKYO

適応的電力制御を行う環境発電駆動センサノードの強化学習戦略の比較評価

SWoPP 2017

slide-2
SLIDE 2

INTRODUCTION

 Use Reinforcement Learning (RL) for power

management in Energy Harvesting Sensor Nodes (EHSN)

Adaptive control behavior

Near-optimal performance

 Comparison between different RL algorithms

Q-Learning

SARSA 25-Mar-20 2

slide-3
SLIDE 3

ENERGY HARVESTING SENSOR NODE CONCEPT

25-Mar-20 3

  • CONSTRAINTS
  • Sensor node has to be operating at

ALL times

  • Battery cannot be completely

depleted

  • Battery cannot be overcharged

(exceed 100%)

  • Battery size is finite
  • Charging/discharging rates are

finite

RF Transceiver MCU Sensor Power Manager Memory Mixed Signal Circuits Harvested Energy

slide-4
SLIDE 4

OBJECTIVE: NODE-LEVEL ENERGY NEUTRALITY

25-Mar-20 4 Energy Harvested Energy Consumed

  • We want to use ALL the energy that is

harvested.

  • One way of achieving that is by ensuring

node level energy neutrality – the condition when

the amount of energy harvested equals the amount of energy consumed.

  • Autonomous Perpetual
  • peration can be achieved
slide-5
SLIDE 5

http://www.mdpi.com/sensors/sensors-12-02175/article_deploy/html/images/sensors-12- 02175f5-1024.png

DIFFERENT SENSORS CHALLENGES

25-Mar-20 5

Environmental Sensor Networks – P.I. Corke et. al. https://sites.google.com/site/sarmavrudhula/home/research/energy-management-of-wireless-sensor- networks

MOVING SENSORS DIFFERENT ENVIRONMENTS

slide-6
SLIDE 6

SOLUTION

Preparing heuristic, user-defined contingency solutions for all possible scenarios is impractical. 25-Mar-20 6

We want a one-size-fits-all solution sensor nodes that are capable of:

  • autonomously learning optimal

strategies

  • adapting once they have been

deployed in the environment.

slide-7
SLIDE 7

SOLUTION

25-Mar-20 7

➢ Use RL for adaptive control ➢ Use a solar energy harvesting sensor node as a case example

slide-8
SLIDE 8

Q-Learning Results (ETNET 2017)

ETNET 2017 (Kumejima) 8

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Naïve Kansal Our method using RL Efficiency(%) Energy Wasted(%)

Duty Cycle is proportional to battery level Fix duty cycle for present day by predicting total energy for next day Higher Efficiency Lower Waste

𝐵𝑑𝑢𝑣𝑏𝑚 𝐸𝑣𝑢𝑧 𝐷𝑧𝑑𝑚𝑓 𝐵𝑑ℎ𝑗𝑓𝑤𝑏𝑐𝑚𝑓 𝑁𝑏𝑦𝑗𝑛𝑣𝑛 𝐸𝑣𝑢𝑧 𝐷𝑧𝑑𝑚𝑓 𝑈𝑝𝑢𝑏𝑚 𝐹𝑜𝑓𝑠𝑕𝑧 𝑋𝑏𝑡𝑢𝑓𝑒 𝑈𝑝𝑢𝑏𝑚 𝐹𝑜𝑓𝑠𝑕𝑧 𝐼𝑏𝑠𝑤𝑓𝑡𝑢𝑓𝑒

𝐹𝑜𝑓𝑠𝑕𝑧 𝑋𝑏𝑡𝑢𝑓 = 𝐹𝑜𝑓𝑠𝑕𝑧 𝐼𝑏𝑠𝑤𝑓𝑡𝑢𝑓𝑒 −𝑂𝑝𝑒𝑓 𝐹𝑜𝑓𝑠𝑕𝑧 − 𝐷ℎ𝑏𝑠𝑕𝑗𝑜𝑕 𝐹𝑜𝑓𝑠𝑕𝑧

slide-9
SLIDE 9

Q-Learning (ETNET 2017)

❑ Demonstrated that RL approaches outperform traditional methods. ❑ Limitations

  • State explosion
  • 200 x 5 x 6 states
  • Q-table becomes too large to train using random policy
  • Long training times
  • Required 10 years worth of training
  • Reward function did not reflect the true objective of

energy neutrality.

9 ETNET 2017 (Kumejima)

slide-10
SLIDE 10

REINFORCEMENT LEARNING

IN A NUTSHELL 25-Mar-20 10

slide-11
SLIDE 11

REINFORCEMENT LEARNING

25-Mar-20 11 Environment

REWARD, New State ACTION: Choose Duty Cycle

What action should I take to accumulate total maximum reward?

OBSERVATIONS: Battery Level Energy Harvested Weather Forecast

Agent (Power Manager)

  • Type of Machine Learning based on experience rather than instruction
  • Map situations (states) into actions – and receive as much reward as

possible

slide-12
SLIDE 12

REINFORCEMENT LEARNING

25-Mar-20 12

  • IMPORTANT CONCEPTS

▫ Q-VALUE ▫ ELIGIBILITY TRACES

slide-13
SLIDE 13

Q-VALUE

25-Mar-20 13 State si

  • To give a measure of the

“goodness” of an action in a particular state, we assign each state-action pair a Q-value: Q(state, action)

  • Learned from past

(training) experiences.

  • Higher Q-value → better the choice of

action for that state.

  • Q(s,a) value is the expected cumulative

reward that you can get starting from state s and taking action a State sj State sk State sl r1 r2 r3 a2 a1 a3 𝑅(𝑡𝑗, 𝑏1) 𝑅(𝑡𝑗, 𝑏2) 𝑅(𝑡𝑗, 𝑏3)

slide-14
SLIDE 14

Q-VALUE

25-Mar-20 14

  • To give a measure of the

“goodness” of an action in a particular state, we assign each state-action pair a Q-value: Q(state, action)

  • Learned from past

(training) experiences.

  • Higher Q-value → better the choice of

action for that state.

  • Q(s,a) value is the expected cumulative

reward that you can get starting from state s and taking action a State si State sj State sk State sl r2 a2 𝑅(𝑡𝑗, 𝑏2)

slide-15
SLIDE 15

LEARNING Q-VALUES

25-Mar-20 15

TO FIND 𝑹 𝒕𝒍, 𝒃𝒍

  • Start with arbitrary guesses for 𝑅 𝑡𝑙, 𝑏𝑙
  • Update 𝑅 𝑡𝑙, 𝑏𝑙 incrementally towards the target value (Bootstrapping)
  • General Update Rule

𝑂𝑓𝑥𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 ← 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 + 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓[𝑈𝑏𝑠𝑕𝑓𝑢 − 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓] 𝑂𝑓𝑥𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 ← 1 − 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 + 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑈𝑏𝑠𝑕𝑓𝑢 𝑅 𝑡𝑙, 𝑏𝑙 ← 1 − 𝛽 𝑅 𝑡𝑙, 𝑏𝑙 + 𝛽 × 𝑈𝑏𝑠𝑕𝑓𝑢

?

slide-16
SLIDE 16

SARSA VS Q-LEARNING

25-Mar-20 16

  • Agent starts at state sk and takes some action ak according to policy .
  • Receives a reward rk and is transported to new state sk+1.

SARSA

  • The agent considers taking the

next action ak+1.

  • The Q-value Q(sk,ak) is then

updated. Q-LEARNING

  • The agent assumes the next action will

be the action with the highest Q-value.

  • The Q-value Q(sk,ak) is then updated.
  • -greedy policy is used i.e. random actions are taken with probability  to allow

exploration. 𝑅 𝑡𝑙, 𝑏𝑙 ← 1 − 𝛽 𝑅 𝑡𝑙, 𝑏𝑙 + α[𝑠

𝑙 + 𝛿 max 𝑏

𝑅(𝑡𝑙+1, 𝑏)] 𝑅𝜌 𝑡𝑙, 𝑏𝑙 ← 1 − 𝛽 𝑅𝜌 𝑡𝑙, 𝑏𝑙 + α[𝑠𝑙 + 𝛿𝑅𝜌 𝑡𝑙+1, 𝑏𝑙+1 ]

slide-17
SLIDE 17

𝑅𝜌 𝑡𝑙, 𝑏𝑙 ← 1 − 𝛽 𝑅𝜌 𝑡𝑙, 𝑏𝑙 + α[𝑠

𝑙 + 𝛿𝑅𝜌 𝑡𝑙+1, 𝑏𝑙+1 ]

SARSA VS Q-LEARNING

SARSA 𝑅 𝑡𝑙, 𝑏𝑙 ← 1 − 𝛽 𝑅 𝑡𝑙, 𝑏𝑙 + α[𝑠

𝑙 + 𝛿 max 𝑏

𝑅(𝑡𝑙+1, 𝑏)] Q-Learning

25-Mar-20 17 𝑂𝑓𝑥𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 ← 1 − 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 + 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑈𝑏𝑠𝑕𝑓𝑢

slide-18
SLIDE 18

SARSA VS Q-LEARNING

SARSA

On-policy learning:

updates the policy it is using during training

Update is carried out by considering the next action to be taken

Faster convergence but requires an initial policy.

Easier to integrate with function approximation

Q-Learning

Off-policy learning:

final learned policy is the same regardless of training methods

Assumes the best actions will always be taken

Takes longer to converge

Difficult to integrate with linear function approximation 25-Mar-20 18

SARSA Q-Learning Choosing Next Action -greedy policy -greedy policy Updating Q -greedy policy Greedy policy

slide-19
SLIDE 19

ELIGIBILITY TRACES

25-Mar-20 21

State 1 State 2 State 24

REWARD

Action 1 Action 2 Action 24

Update Q(State 24, Action 24) Update Q(State 2, Action 2) Update Q(State 1, Action 1)

  • In our model, one action is taken every hour. The reward is awarded at the end of 24
  • hours. A single action cannot justify the reward at the end. A series of 24 state-action

pairs are responsible for the reward.

  • To update the Q-values of the appropriate state-action pairs, we introduce a memory

variable, 𝑓(𝑡, 𝑏), called the eligibility trace.

  • 𝑓 𝑡, 𝑏 for ALL state-action pairs decays by 𝜇 at every time step.
  • If the state-action pair 𝑡𝑙, 𝑏𝑙 is visited, 𝑓 𝑡𝑙, 𝑏𝑙 is incremented by one.
slide-20
SLIDE 20

SARSA() AND Q-LEARNING ()

25-Mar-20 22

  • SARSA() – integrate eligibility traces with SARSA algorithm
  • Q() – integrate eligibility traces with Q-Learning algorithm
  • , 0 < 𝜇 < 1, is the strength with which Q-values of early

contributing state-action pairs are updated as a consequence

  • f the final reward.
slide-21
SLIDE 21

ADAPTIVE POWER CONTROL USING REINFORCEMENT LEARNING ALGORITHMS

25-Mar-20 23

  • SARSA() – SARSA with eligibility traces
  • SARSA
  • Q() – Q-Learning with eligibility traces
  • Q-Learning
slide-22
SLIDE 22

STATE DEFINITION

25-Mar-20 24 Distance from energy neutrality, 𝑇𝑒𝑗𝑡𝑢(𝑢𝑙) Battery, 𝑇𝑐𝑏𝑢𝑢(𝑢𝑙) Harvested Energy, 𝑇𝑓ℎ𝑏𝑠𝑤𝑓𝑡𝑢(𝑢𝑙) Weather Forecast, 𝑇𝑒𝑏𝑧(𝑢𝑙)

  • 20000 mWh

Low (< 20%) 0 mWh Very little sun

  • 19000 mWh

Mid (20% to 80%) 0 to 100 mWh Overcast ⋮ High (> 80%) 100 mWh to 500 mWh Partly Cloudy 0 mWh 500 mWh to 1000 mWh Fair ⋮ 1000 mWh to 1500 mWh Sunny 19000 mWh 1500 mWh to 2000 mWh Very Sunny 20000 mWh > 2000 mWh

State at 𝑢𝑙 = 𝑇𝑒𝑗𝑡𝑢 𝑙 , 𝑇𝑐𝑏𝑢𝑢 𝑢𝑙 , 𝑇𝑓ℎ𝑏𝑠𝑤𝑓𝑡𝑢 𝑢𝑙 , 𝑇𝑒𝑏𝑧 𝑢𝑙

slide-23
SLIDE 23

ACTION SPACE

25-Mar-20 25

  • Choose duty cycle of the sensor node

𝐵 = 𝑏 𝑢𝑙 ∈ 1,2,3,4,5

ACTION

𝑏 𝑢𝑙

DUTY CYCLE (%) ENERGY CONSUMED PER HOUR (mWh) 1 20 100 2 40 200 3 60 300 4 80 400 5 100 500

slide-24
SLIDE 24

REWARD FUNCTION

25-Mar-20 26

  • Awarded at the end of an episode (day).
  • Each episode consists of 24 one-hour epochs.
  • We want the net energy difference between

initial and final battery levels to be zero.

  • Use a reward scheme that depends on Energy

Neutral Performance (ENP) at the end of the episode (𝑢𝑙 = 𝑈).

  • Energy Neutral Performance can be defined here

as

▫ |𝐽𝑜𝑗𝑢𝑗𝑏𝑚 𝑐𝑏𝑢𝑢𝑓𝑠𝑧 𝑚𝑓𝑤𝑓𝑚 – 𝐺𝑗𝑜𝑏𝑚 𝑑𝑣𝑠𝑠𝑓𝑜𝑢 𝑐𝑏𝑢𝑢𝑓𝑠𝑧 𝑚𝑓𝑤𝑓𝑚|

slide-25
SLIDE 25

TRAINING AND TESTING

25-Mar-20 27

  • Training:

Tokyo, Year 2010

  • T

esting: Tokyo, Year 2010/2011 Wakkanai, Year 2010/2011

  • Wakkanai has a much colder climate than that of

Tokyo and received much lesser solar radiation.

  • We observe the adaptive behavior of our solution

when the location of implementation is different from the location of its training

slide-26
SLIDE 26

RESULTS

25-Mar-20 29

slide-27
SLIDE 27

SARSA VS Q-LEARNING

25-Mar-20 30

Day (2011, T

  • kyo)

Battery Percentage Battery overflows and has to be reset to initial level (60%)

SARSA() (with eligibility traces) SARSA Q() (with eligibility traces) Q-Learning

SARSA() battery profile does not need to be reset

slide-28
SLIDE 28

4 8 12 16 20 24 20 40 60 80 100 20 40 60 80

Duty Cycle (%) Battery (%) Optimal Policy using non-causal data Proposed Method Time (Hour)

Solar Energy Profile

DAY 29 Tokyo 2011

Our method comes very close to the

  • ptimal solution

ENERGY NEUTRAL OPERATION

25-Mar-20 31

  • SARSA(λ)compared with

Optimal Policy

  • Optimal Policy

▫ Theoretical upper limit ▫ Calculated using future information and linear programming techniques Battery profiles for SARSA and Offline Policy are very similar

slide-29
SLIDE 29

SARSA VS Q-LEARNING

25-Mar-20 32

  • Every day the

battery is reset to initial battery level

  • ENP (as a percentage
  • f maximum battery

capacity, BMAX) is

  • bserved at the end
  • f each day of the

year. 𝐹𝑂𝑄 = 𝐶𝑏𝑢𝑢𝑓𝑠𝑧 𝑏𝑢 00: 00 − 𝐶𝑏𝑢𝑢𝑓𝑠𝑧 𝑏𝑢 23: 59 𝐹𝑂𝑄 = |60% 𝑝𝑔 𝐶𝑁𝐵𝑌 − 𝐶𝑏𝑢𝑢𝑓𝑠𝑧 𝑏𝑢 23: 59|

slide-30
SLIDE 30

SARSA VS Q-LEARNING

25-Mar-20 33

slide-31
SLIDE 31

OBSERVATIONS

25-Mar-20 35

  • SARSA(λ) – BEST PERFORMANCE.
  • Q(λ) – WORST PERFOMANCE.
  • The “high” learning rate causes Q-values to oscillate with large

amplitudes and the policy cannot converge.

  • A lower learning rate shows better performance but at expense of longer

learning times.

  • SARSA methods have a generally robust performance as compared to Q-

Learning.

  • Using eligibility traces with SARSA enhances the performance.
slide-32
SLIDE 32

SUMMARY

 Adaptive Control is achieved by using SARSA RL methods .

 Results from SARSA RL are near optimal.

 SARSA() outperforms Q-Learning methods.

25-Mar-20 36

slide-33
SLIDE 33

THANK YOU FOR LISTENING

ANY COMMENTS OR QUESTIONS ARE WELCOME 25-Mar-20 37 For further details about our work using SARSA(λ), please refer to our paper to be presented in EMSOFT 2017 and published in ACM TECS Journal.

Adaptive Power Management in Solar Energy Harvesting Node using Reinforcement Learning

shaswot@hal.ipc.i.u-tokyo.ac.jp

This work was partially supported by JSPS KAKENHI Grant Number 16K12405.