ADAPTIVE POWER MANAGEMENT OF ENERGY HARVESTING SENSOR NODES USING REINFORCEMENT LEARNING
SHASWOT SHRESTHAMALI MASAAKI KONDO HIROSHI NAKAMURA
A Comparison of Q-Learning and SARSA Algorithms
THE UNIVERSITY OF TOKYO
適応的電力制御を行う環境発電駆動センサノードの強化学習戦略の比較評価
USING REINFORCEMENT LEARNING A Comparison of Q-Learning and SARSA - - PowerPoint PPT Presentation
ADAPTIVE POWER MANAGEMENT OF ENERGY HARVESTING SENSOR NODES USING REINFORCEMENT LEARNING A Comparison of Q-Learning and SARSA Algorithms
SHASWOT SHRESTHAMALI MASAAKI KONDO HIROSHI NAKAMURA
THE UNIVERSITY OF TOKYO
適応的電力制御を行う環境発電駆動センサノードの強化学習戦略の比較評価
Use Reinforcement Learning (RL) for power
management in Energy Harvesting Sensor Nodes (EHSN)
Adaptive control behavior
Near-optimal performance
Comparison between different RL algorithms
Q-Learning
SARSA 25-Mar-20 2
25-Mar-20 3
ALL times
depleted
(exceed 100%)
finite
RF Transceiver MCU Sensor Power Manager Memory Mixed Signal Circuits Harvested Energy
25-Mar-20 4 Energy Harvested Energy Consumed
harvested.
the amount of energy harvested equals the amount of energy consumed.
http://www.mdpi.com/sensors/sensors-12-02175/article_deploy/html/images/sensors-12- 02175f5-1024.png
25-Mar-20 5
Environmental Sensor Networks – P.I. Corke et. al. https://sites.google.com/site/sarmavrudhula/home/research/energy-management-of-wireless-sensor- networks
Preparing heuristic, user-defined contingency solutions for all possible scenarios is impractical. 25-Mar-20 6
We want a one-size-fits-all solution sensor nodes that are capable of:
strategies
deployed in the environment.
25-Mar-20 7
ETNET 2017 (Kumejima) 8
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Naïve Kansal Our method using RL Efficiency(%) Energy Wasted(%)
Duty Cycle is proportional to battery level Fix duty cycle for present day by predicting total energy for next day Higher Efficiency Lower Waste
𝐵𝑑𝑢𝑣𝑏𝑚 𝐸𝑣𝑢𝑧 𝐷𝑧𝑑𝑚𝑓 𝐵𝑑ℎ𝑗𝑓𝑤𝑏𝑐𝑚𝑓 𝑁𝑏𝑦𝑗𝑛𝑣𝑛 𝐸𝑣𝑢𝑧 𝐷𝑧𝑑𝑚𝑓 𝑈𝑝𝑢𝑏𝑚 𝐹𝑜𝑓𝑠𝑧 𝑋𝑏𝑡𝑢𝑓𝑒 𝑈𝑝𝑢𝑏𝑚 𝐹𝑜𝑓𝑠𝑧 𝐼𝑏𝑠𝑤𝑓𝑡𝑢𝑓𝑒
𝐹𝑜𝑓𝑠𝑧 𝑋𝑏𝑡𝑢𝑓 = 𝐹𝑜𝑓𝑠𝑧 𝐼𝑏𝑠𝑤𝑓𝑡𝑢𝑓𝑒 −𝑂𝑝𝑒𝑓 𝐹𝑜𝑓𝑠𝑧 − 𝐷ℎ𝑏𝑠𝑗𝑜 𝐹𝑜𝑓𝑠𝑧
9 ETNET 2017 (Kumejima)
IN A NUTSHELL 25-Mar-20 10
25-Mar-20 11 Environment
REWARD, New State ACTION: Choose Duty Cycle
What action should I take to accumulate total maximum reward?
OBSERVATIONS: Battery Level Energy Harvested Weather Forecast
Agent (Power Manager)
possible
25-Mar-20 12
25-Mar-20 13 State si
“goodness” of an action in a particular state, we assign each state-action pair a Q-value: Q(state, action)
(training) experiences.
action for that state.
reward that you can get starting from state s and taking action a State sj State sk State sl r1 r2 r3 a2 a1 a3 𝑅(𝑡𝑗, 𝑏1) 𝑅(𝑡𝑗, 𝑏2) 𝑅(𝑡𝑗, 𝑏3)
25-Mar-20 14
“goodness” of an action in a particular state, we assign each state-action pair a Q-value: Q(state, action)
(training) experiences.
action for that state.
reward that you can get starting from state s and taking action a State si State sj State sk State sl r2 a2 𝑅(𝑡𝑗, 𝑏2)
25-Mar-20 15
TO FIND 𝑹 𝒕𝒍, 𝒃𝒍
𝑂𝑓𝑥𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 ← 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 + 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓[𝑈𝑏𝑠𝑓𝑢 − 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓] 𝑂𝑓𝑥𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 ← 1 − 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 + 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑈𝑏𝑠𝑓𝑢 𝑅 𝑡𝑙, 𝑏𝑙 ← 1 − 𝛽 𝑅 𝑡𝑙, 𝑏𝑙 + 𝛽 × 𝑈𝑏𝑠𝑓𝑢
25-Mar-20 16
SARSA
next action ak+1.
updated. Q-LEARNING
be the action with the highest Q-value.
exploration. 𝑅 𝑡𝑙, 𝑏𝑙 ← 1 − 𝛽 𝑅 𝑡𝑙, 𝑏𝑙 + α[𝑠
𝑙 + 𝛿 max 𝑏
𝑅(𝑡𝑙+1, 𝑏)] 𝑅𝜌 𝑡𝑙, 𝑏𝑙 ← 1 − 𝛽 𝑅𝜌 𝑡𝑙, 𝑏𝑙 + α[𝑠𝑙 + 𝛿𝑅𝜌 𝑡𝑙+1, 𝑏𝑙+1 ]
𝑅𝜌 𝑡𝑙, 𝑏𝑙 ← 1 − 𝛽 𝑅𝜌 𝑡𝑙, 𝑏𝑙 + α[𝑠
𝑙 + 𝛿𝑅𝜌 𝑡𝑙+1, 𝑏𝑙+1 ]
SARSA 𝑅 𝑡𝑙, 𝑏𝑙 ← 1 − 𝛽 𝑅 𝑡𝑙, 𝑏𝑙 + α[𝑠
𝑙 + 𝛿 max 𝑏
𝑅(𝑡𝑙+1, 𝑏)] Q-Learning
25-Mar-20 17 𝑂𝑓𝑥𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 ← 1 − 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 + 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑈𝑏𝑠𝑓𝑢
SARSA
On-policy learning:
updates the policy it is using during training
Update is carried out by considering the next action to be taken
Faster convergence but requires an initial policy.
Easier to integrate with function approximation
Q-Learning
Off-policy learning:
final learned policy is the same regardless of training methods
Assumes the best actions will always be taken
Takes longer to converge
Difficult to integrate with linear function approximation 25-Mar-20 18
SARSA Q-Learning Choosing Next Action -greedy policy -greedy policy Updating Q -greedy policy Greedy policy
25-Mar-20 21
State 1 State 2 State 24
REWARD
…
Action 1 Action 2 Action 24
Update Q(State 24, Action 24) Update Q(State 2, Action 2) Update Q(State 1, Action 1)
pairs are responsible for the reward.
variable, 𝑓(𝑡, 𝑏), called the eligibility trace.
25-Mar-20 22
contributing state-action pairs are updated as a consequence
25-Mar-20 23
25-Mar-20 24 Distance from energy neutrality, 𝑇𝑒𝑗𝑡𝑢(𝑢𝑙) Battery, 𝑇𝑐𝑏𝑢𝑢(𝑢𝑙) Harvested Energy, 𝑇𝑓ℎ𝑏𝑠𝑤𝑓𝑡𝑢(𝑢𝑙) Weather Forecast, 𝑇𝑒𝑏𝑧(𝑢𝑙)
Low (< 20%) 0 mWh Very little sun
Mid (20% to 80%) 0 to 100 mWh Overcast ⋮ High (> 80%) 100 mWh to 500 mWh Partly Cloudy 0 mWh 500 mWh to 1000 mWh Fair ⋮ 1000 mWh to 1500 mWh Sunny 19000 mWh 1500 mWh to 2000 mWh Very Sunny 20000 mWh > 2000 mWh
State at 𝑢𝑙 = 𝑇𝑒𝑗𝑡𝑢 𝑙 , 𝑇𝑐𝑏𝑢𝑢 𝑢𝑙 , 𝑇𝑓ℎ𝑏𝑠𝑤𝑓𝑡𝑢 𝑢𝑙 , 𝑇𝑒𝑏𝑧 𝑢𝑙
25-Mar-20 25
𝐵 = 𝑏 𝑢𝑙 ∈ 1,2,3,4,5
ACTION
DUTY CYCLE (%) ENERGY CONSUMED PER HOUR (mWh) 1 20 100 2 40 200 3 60 300 4 80 400 5 100 500
25-Mar-20 26
initial and final battery levels to be zero.
Neutral Performance (ENP) at the end of the episode (𝑢𝑙 = 𝑈).
as
▫ |𝐽𝑜𝑗𝑢𝑗𝑏𝑚 𝑐𝑏𝑢𝑢𝑓𝑠𝑧 𝑚𝑓𝑤𝑓𝑚 – 𝐺𝑗𝑜𝑏𝑚 𝑑𝑣𝑠𝑠𝑓𝑜𝑢 𝑐𝑏𝑢𝑢𝑓𝑠𝑧 𝑚𝑓𝑤𝑓𝑚|
25-Mar-20 27
Tokyo, Year 2010
esting: Tokyo, Year 2010/2011 Wakkanai, Year 2010/2011
Tokyo and received much lesser solar radiation.
when the location of implementation is different from the location of its training
25-Mar-20 29
25-Mar-20 30
Day (2011, T
Battery Percentage Battery overflows and has to be reset to initial level (60%)
SARSA() (with eligibility traces) SARSA Q() (with eligibility traces) Q-Learning
SARSA() battery profile does not need to be reset
4 8 12 16 20 24 20 40 60 80 100 20 40 60 80
Duty Cycle (%) Battery (%) Optimal Policy using non-causal data Proposed Method Time (Hour)
Solar Energy Profile
DAY 29 Tokyo 2011
Our method comes very close to the
25-Mar-20 31
Optimal Policy
▫ Theoretical upper limit ▫ Calculated using future information and linear programming techniques Battery profiles for SARSA and Offline Policy are very similar
25-Mar-20 32
battery is reset to initial battery level
capacity, BMAX) is
year. 𝐹𝑂𝑄 = 𝐶𝑏𝑢𝑢𝑓𝑠𝑧 𝑏𝑢 00: 00 − 𝐶𝑏𝑢𝑢𝑓𝑠𝑧 𝑏𝑢 23: 59 𝐹𝑂𝑄 = |60% 𝑝𝑔 𝐶𝑁𝐵𝑌 − 𝐶𝑏𝑢𝑢𝑓𝑠𝑧 𝑏𝑢 23: 59|
25-Mar-20 33
25-Mar-20 35
amplitudes and the policy cannot converge.
learning times.
Learning.
Results from SARSA RL are near optimal.
25-Mar-20 36
ANY COMMENTS OR QUESTIONS ARE WELCOME 25-Mar-20 37 For further details about our work using SARSA(λ), please refer to our paper to be presented in EMSOFT 2017 and published in ACM TECS Journal.
Adaptive Power Management in Solar Energy Harvesting Node using Reinforcement Learning
This work was partially supported by JSPS KAKENHI Grant Number 16K12405.