USING REINFORCEMENT LEARNING A Comparison of Q-Learning and SARSA - PowerPoint PPT Presentation

ADAPTIVE POWER MANAGEMENT OF ENERGY HARVESTING SENSOR NODES USING REINFORCEMENT LEARNING A Comparison of Q-Learning and SARSA Algorithms 適応的電力制御を行う環境発電駆動センサノードの強化学習戦略の比較評価 SWoPP 2017 SHASWOT SHRESTHAMALI MASAAKI KONDO HIROSHI NAKAMURA THE UNIVERSITY OF TOKYO

INTRODUCTION  Use Reinforcement Learning (RL) for power management in Energy Harvesting Sensor Nodes (EHSN) Adaptive control behavior  Near-optimal performance   Comparison between different RL algorithms Q-Learning  SARSA  25-Mar-20 2

ENERGY HARVESTING SENSOR NODE CONCEPT • CONSTRAINTS Harvested Energy • Sensor node has to be operating at ALL times Power RF Transceiver • Battery cannot be completely Manager depleted • Battery cannot be overcharged MCU Memory (exceed 100%) • Battery size is finite Mixed Signal Sensor Circuits • Charging/discharging rates are finite 25-Mar-20 3

OBJECTIVE: NODE-LEVEL ENERGY NEUTRALITY • We want to use ALL the energy that is harvested. • One way of achieving that is by ensuring node level energy neutrality – the condition when Energy Energy the amount of energy harvested equals Harvested Consumed the amount of energy consumed. • Autonomous Perpetual operation can be achieved 25-Mar-20 4

CHALLENGES https://sites.google.com/site/sarmavrudhula/home/research/energy-management-of-wireless-sensor- networks Environmental Sensor Networks – P.I. Corke et. al. DIFFERENT ENVIRONMENTS MOVING SENSORS DIFFERENT SENSORS http://www.mdpi.com/sensors/sensors-12-02175/article_deploy/html/images/sensors-12- 02175f5-1024.png 25-Mar-20 5

SOLUTION Preparing heuristic, user-defined contingency solutions for all possible scenarios is impractical . We want a one-size-fits-all solution sensor nodes that are capable of: • autonomously learning optimal strategies • adapting once they have been deployed in the environment. 25-Mar-20 6

SOLUTION ➢ Use RL for adaptive control ➢ Use a solar energy harvesting sensor node as a case example 25-Mar-20 7

Q-Learning Results (ETNET 2017) 100% Higher Efficiency 90% 80% 𝐵𝑑𝑢𝑣𝑏𝑚 𝐸𝑣𝑢𝑧 𝐷𝑧𝑑𝑚𝑓 70% 𝐵𝑑ℎ𝑗𝑓𝑤𝑏𝑐𝑚𝑓 𝑁𝑏𝑦𝑗𝑛𝑣𝑛 𝐸𝑣𝑢𝑧 𝐷𝑧𝑑𝑚𝑓 60% 50% Efficiency(%) 40% Energy Wasted(%) 30% 20% Lower Waste 10% 𝑈𝑝𝑢𝑏𝑚 𝐹𝑜𝑓𝑠𝑕𝑧 𝑋𝑏𝑡𝑢𝑓𝑒 0% 𝑈𝑝𝑢𝑏𝑚 𝐹𝑜𝑓𝑠𝑕𝑧 𝐼𝑏𝑠𝑤𝑓𝑡𝑢𝑓𝑒 Naïve Kansal Our method using RL 𝐹𝑜𝑓𝑠𝑕𝑧 𝑋𝑏𝑡𝑢𝑓 = 𝐹𝑜𝑓𝑠𝑕𝑧 𝐼𝑏𝑠𝑤𝑓𝑡𝑢𝑓𝑒 Fix duty cycle for −𝑂𝑝𝑒𝑓 𝐹𝑜𝑓𝑠𝑕𝑧 Duty Cycle is present day by − 𝐷ℎ𝑏𝑠𝑕𝑗𝑜𝑕 𝐹𝑜𝑓𝑠𝑕𝑧 proportional to predicting total battery level energy for next day ETNET 2017 (Kumejima) 8

Q-Learning (ETNET 2017) ❑ Demonstrated that RL approaches outperform traditional methods. ❑ Limitations • State explosion • 200 x 5 x 6 states • Q-table becomes too large to train using random policy • Long training times • Required 10 years worth of training • Reward function did not reflect the true objective of energy neutrality. ETNET 2017 (Kumejima) 9

REINFORCEMENT LEARNING IN A NUTSHELL 25-Mar-20 10

REINFORCEMENT LEARNING • Type of Machine Learning based on experience rather than instruction What action should I take to • Map situations (states) into actions – and receive as much reward as accumulate possible total maximum reward? OBSERVATIONS: Battery Level Energy Harvested Weather Forecast REWARD, New State Environment Agent ACTION: Choose Duty Cycle (Power Manager) 25-Mar-20 11

REINFORCEMENT LEARNING • IMPORTANT CONCEPTS ▫ Q-VALUE ▫ ELIGIBILITY TRACES 25-Mar-20 12

Q-VALUE r 1 • To give a measure of the “goodness” of an action State in a particular state, we s j assign each state-action pair a Q-value: a 1 r 2 Q(state, action) 𝑅(𝑡 𝑗 , 𝑏 1 ) State State a 2 • Learned from past 𝑅(𝑡 𝑗 , 𝑏 2 ) s i s k (training) experiences. a 3 𝑅(𝑡 𝑗 , 𝑏 3 ) r 3 • Higher Q-value → better the choice of action for that state. State s l • Q(s,a) value is the expected cumulative reward that you can get starting from state s and taking action a 25-Mar-20 13

Q-VALUE • To give a measure of the “goodness” of an action State in a particular state, we s j assign each state-action pair a Q-value: r 2 Q(state, action) State State a 2 • Learned from past 𝑅(𝑡 𝑗 , 𝑏 2 ) s i s k (training) experiences. • Higher Q-value → better the choice of action for that state. State s l • Q(s,a) value is the expected cumulative reward that you can get starting from state s and taking action a 25-Mar-20 14

LEARNING Q-VALUES TO FIND 𝑹 𝒕 𝒍 , 𝒃 𝒍 • Start with arbitrary guesses for 𝑅 𝑡 𝑙 , 𝑏 𝑙 • Update 𝑅 𝑡 𝑙 , 𝑏 𝑙 incrementally towards the target value (Bootstrapping) • General Update Rule 𝑂𝑓𝑥𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 ← 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 + 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓[𝑈𝑏𝑠𝑕𝑓𝑢 − 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓] 𝑂𝑓𝑥𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 ← 1 − 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 + 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑈𝑏𝑠𝑕𝑓𝑢 𝑅 𝑡 𝑙 , 𝑏 𝑙 ← 1 − 𝛽 𝑅 𝑡 𝑙 , 𝑏 𝑙 + 𝛽 × 𝑈𝑏𝑠𝑕𝑓𝑢 ? 25-Mar-20 15

SARSA VS Q-LEARNING • Agent starts at state s k and takes some action a k according to policy  . • Receives a reward r k and is transported to new state s k+1 . Q-LEARNING SARSA • The agent assumes the next action will • The agent considers taking the be the action with the highest Q-value. next action a k+1 . • The Q-value Q(s k ,a k ) is then updated. • The Q-value Q(s k ,a k ) is then updated . 𝑅 𝑡 𝑙 , 𝑏 𝑙 ← 1 − 𝛽 𝑅 𝑡 𝑙 , 𝑏 𝑙 + 𝑅 𝜌 𝑡 𝑙 , 𝑏 𝑙 ← 1 − 𝛽 𝑅 𝜌 𝑡 𝑙 , 𝑏 𝑙 + α[𝑠 𝑙 + 𝛿 max 𝑅(𝑡 𝑙+1 , 𝑏)] α[𝑠 𝑙 + 𝛿𝑅 𝜌 𝑡 𝑙+1 , 𝑏 𝑙+1 ] 𝑏 •  -greedy policy is used i.e. random actions are taken with probability  to allow exploration. 25-Mar-20 16

SARSA VS Q-LEARNING SARSA 𝑅 𝜌 𝑡 𝑙 , 𝑏 𝑙 ← 1 − 𝛽 𝑅 𝜌 𝑡 𝑙 , 𝑏 𝑙 + α[𝑠 𝑙 + 𝛿𝑅 𝜌 𝑡 𝑙+1 , 𝑏 𝑙+1 ] 𝑂𝑓𝑥𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 ← 1 − 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 + 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑈𝑏𝑠𝑕𝑓𝑢 Q-Learning 𝑅 𝑡 𝑙 , 𝑏 𝑙 ← 1 − 𝛽 𝑅 𝑡 𝑙 , 𝑏 𝑙 + α[𝑠 𝑙 + 𝛿 max 𝑅(𝑡 𝑙+1 , 𝑏)] 𝑏 25-Mar-20 17

SARSA VS Q-LEARNING Q-Learning SARSA On-policy learning: Off-policy learning:   updates the policy it is using during final learned policy is the same   training regardless of training methods Update is carried out by considering  Assumes the best actions will always  the next action to be taken be taken Faster convergence but requires an  Takes longer to converge  initial policy. Difficult to integrate with linear  Easier to integrate with function  function approximation approximation SARSA Q-Learning  -greedy policy  -greedy policy Choosing Next Action  -greedy policy Updating Q Greedy policy 25-Mar-20 18

ELIGIBILITY TRACES • In our model, one action is taken every hour. The reward is awarded at the end of 24 hours. A single action cannot justify the reward at the end. A series of 24 state-action pairs are responsible for the reward. To update the Q-values of the appropriate state-action pairs, we introduce a memory • variable, 𝑓(𝑡, 𝑏) , called the eligibility trace . • 𝑓 𝑡, 𝑏 for ALL state-action pairs decays by 𝜇 at every time step. • If the state-action pair 𝑡 𝑙 , 𝑏 𝑙 is visited, 𝑓 𝑡 𝑙 , 𝑏 𝑙 is incremented by one. Action 24 Action 2 Action 1 … State 1 State 2 State 24 REWARD Update Q(State 24, Action 24) Update Q(State 2, Action 2) 25-Mar-20 21 Update Q(State 1, Action 1)

SARSA(  ) AND Q-LEARNING (  ) • SARSA(  ) – integrate eligibility traces with SARSA algorithm • Q(  ) – integrate eligibility traces with Q-Learning algorithm •  , 0 < 𝜇 < 1 , is the strength with which Q-values of early contributing state-action pairs are updated as a consequence of the final reward. 25-Mar-20 22

ADAPTIVE POWER CONTROL USING REINFORCEMENT LEARNING ALGORITHMS SARSA(  ) – SARSA with eligibility traces • • SARSA Q(  ) – Q-Learning with eligibility traces • • Q-Learning 25-Mar-20 23

USING REINFORCEMENT LEARNING A Comparison of Q-Learning and SARSA - PowerPoint PPT Presentation

ADAPTIVE POWER MANAGEMENT OF ENERGY HARVESTING SENSOR NODES USING REINFORCEMENT LEARNING A Comparison of Q-Learning and SARSA Algorithms

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

The impact of observation spatial and temporal densification in an ensemble Kalman Filter

Why we DONT want/need an educonf GN3 service Peter Szegedi 19 October 2010 eduCONF Workshop

Unit 01: Introduction to Management Organisational Planning and Goal Setting Introduction to

Application of Combined SWOT and AHP: A Case Study for a Manufacturing Firm Authors: Ali Gorener,

Understanding the Internal Business Environment Curtis Webb PAYMENTS PRODUCT MANAGER,

Amazon Fulfillment Centers Abdelwahab Bourai and Rohan Meringeti Outline Introduction

Information Economics Channel Coordination with Returns Ling-Chieh Kung Department of

Deriving Knowledge from Local Optima Networks for Evolutionary Optimization in Inventory Routing