[PPT] - Reinforcement Learning Shaswot Shresthamali, Masaaki Kondo, Hiroshi PowerPoint Presentation

SLIDE 1

Adaptive Power Management for Energy Harvesting Sensor Nodes using Reinforcement Learning

Shaswot Shresthamali, Masaaki Kondo, Hiroshi Nakamura The University of Tokyo

SLIDE 2

CONTEXT

SLIDE 3

Energy Harvesting Sensor Nodes

Theoretically capable of perpetual operation

8 March 2017 NAKAMURA LABORATORY

3

Sensor Node + Battery + Energy Harvesting Module

(capable of varying the duty cycle)

http://www.libelium.com/resources/ima ges/content/products/plug- sense/details/solar_powered_photo.png

SLIDE 4

Challenge I

Say your battery is at 75% and there is plenty of sunshine Do you

Use the solar power to charge your battery only
Use the solar power to charge your battery and drive the

sensor node. If so, then with what proportion?

8 March 2017 NAKAMURA LABORATORY

4

100 200 300 400 500 600 700 800 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

ENERGY HARVESTED BATTERY TIME

Policy 1 Policy 2 Policy 3 Energy Harvested

SLIDE 5

Challenge II

8 March 2017 NAKAMURA LABORATORY

5

Environmental Sensor Networks – P.I. Corke et. al. https://sites.google.com/site/sarmavrudhula/home/research/energy-management-of-wireless-sensor- networks http://www.mdpi.com/sensors/sensors-12-02175/article_deploy/html/images/sensors-12- 02175f5-1024.png

MOVING SENSORS DIFFERENT ENVIRONMENTS DIFFERENT SENSORS

SLIDE 6

Challenge III

BILLION AND TRILLIONS OF NODES

8 March 2017 NAKAMURA LABORATORY

6

https://bendeetech.files.wordpress.com/2015/12/theinternetofthings2-540x334.jpg?w=350&h=200&crop=1

SLIDE 7

Challenges II and III

When dealing with TRILLIONS of sensor nodes, Customizing each node is impractical, impossible

Nodes should OPTIMIZE themselves.
Nodes should ADAPT to their changing environments.

8 March 2017 NAKAMURA LABORATORY

7

ENERGY HARVESTING NODES NEED TO BE

ADAPTABLE SELF CALIBRATING

SLIDE 8

What this presentation is about

To demonstrate how to overcome the challenges by using Reinforcement Learning (RL)

Brief introduction to Reinforcement Learning
Our approach using RL
How this strategy performs compares to other methods
How this strategy adapts to changing environment

8 March 2017 NAKAMURA LABORATORY

8

SLIDE 9

OBJECTIVES

SLIDE 10

Objectives

Energy Neutral Operation (ENO)

Energy consumed = Energy harvested

Maximize Performance

Maximize Duty Cycle

Minimize Battery Downtime

Battery should never drop to zero

Minimize Energy Waste

Battery should not overcharge

Energy Waste = Energy Harvested –Energy consumed by Node –Energy to charge battery

8 March 2017 NAKAMURA LABORATORY

10

SLIDE 11

SYSTEM MODEL

SLIDE 12

System Model

8 March 2017 NAKAMURA LABORATORY

12

Solar Panel Battery Sensor Node Solar Energy Adaptive Power Manager Duty Cycle Battery Reserve Level Energy being harvested

(Ideal)

SLIDE 13

REINFORCEMENT LEARNING (RL)

A BRIEF INTRODUCTION

SLIDE 14

What is RL

Type of Machine Learning - Learns by interacting with Environment Suited for Sequential Decision Making Tasks Map situations (states) into actions – receive as much reward as possible Based on iterative process of trial and error – similar to how humans learn. (Search and Memory)

8 March 2017 NAKAMURA LABORATORY

14

SLIDE 15

Why Reinforcement Learning

By using RL, it is possible

To optimize nodes with raw high level data and minimal

human input.

To adapt to changes in the environment parameters.

8 March 2017 NAKAMURA LABORATORY

15

SLIDE 16

Reinforcement Learning

8 March 2017 NAKAMURA LABORATORY

16

REWARD, New State OBSERVATIONS: Battery Level Energy Harvested ACTION: Choose Duty Cycle What action should I take to accumulate total maximum reward?

Agent (Power Manager) Environment

http://wedreamabout.com/product/bb-8-droid- the-coolest-star-wars-toy-ever http://www.canstockphoto.com/go- green-icons-concept-tree-12796260.html

SLIDE 17

Reinforcement Learning

The question is:

WHICH ACTION TO TAKE WHEN YOU ARE IN A GIVEN STATE?

EXAMPLE: Lots of sunlight | Battery at 60% Do you

drive the sensor node at full strength without

recharging?

drive the sensor node at half strength with partial

charging?

8 March 2017 NAKAMURA LABORATORY

17

[1] https://spaceplace.nasa.gov/sun-corona/en/ [2] https://handyenergy.ru/ [1] [2]

SLIDE 18

Q- Value

Assign every state-action pair → Q-Value, Q(s,a)

8 March 2017 NAKAMURA LABORATORY

18

State X Action 1 Action 3 Action 2 Q(X,1) Q(X,2) Q(X,3)

Q(s,a) means if the agent

Starts from state s
Takes action a
Q(s,a) is the total reward it can expect in the best case scenario

Higher the Q-value, better the action for that particular state

SLIDE 19

Q-Learning

Challenge → Determining the Q-Values for all state- action pairs. Q-table -> contains Q-Values of all possible state- action pairs Accomplished by Q-Learning Algorithm

Q-values are learned by interacting with environment.
Iterative Process
Bootstrapping approach

8 March 2017 NAKAMURA LABORATORY

19

SLIDE 20

Q-Learning Algorithm

Use arbitrary estimates for Q-values
Use these estimates to decide on actions
Update Q-table by using the rewards received
Repeat until Q-value sufficiently converge

8 March 2017 NAKAMURA LABORATORY

20

SLIDE 21

EXPERIMENTS ON ADAPTIVE POWER MANAGEMENT USING Q-LEARNING

SLIDE 22

State Space

State is defined by :

amount of battery remaining
200 possible levels
amount of energy harvested
5 possible levels

Total possible states: 200 x 5 = 1000

8 March 2017 NAKAMURA LABORATORY

22

SLIDE 23

Action Space

Action: Choose duty cycle of the sensor node 𝐵 = 𝑏 𝑢𝑙 ∈ 10%, 20%, 30% … .100% 10% 50 mW 50% 250 mW 100% 500 mW

8 March 2017 NAKAMURA LABORATORY

23

SLIDE 24

Reward Function

8 March 2017 NAKAMURA LABORATORY

24

The reward depends on:

Distance from energy neutrality at time tk

∆𝑓

𝑜𝑓𝑣𝑢𝑠𝑏𝑚 𝑢𝑙 = 𝑓 ℎ𝑏𝑠𝑤𝑓𝑡𝑢 𝑢𝑙 − 𝑓 𝑜𝑝𝑒𝑓 𝑢𝑙

Amount of battery remaining

SLIDE 25

RESULTS

SLIDE 26

Training and Testing

Training: Tokyo (2000 to 2009) Testing : Tokyo (2010)

Compare it with other methods.
Adaptation to diurnal and seasonal variations.
Greedy and ε-greedy Implementations

8 March 2017 NAKAMURA LABORATORY

26

SLIDE 27

Comparison with other methods

8 March 2017 NAKAMURA LABORATORY

27

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Naïve Kansal Our method using RL Efficiency(%) Energy Wasted(%)

Duty Cycle is proportional to battery level Fix duty cycle for present day by predicting total energy for next day Higher Efficiency Lower Waste

𝐵𝑑𝑢𝑣𝑏𝑚 𝐸𝑣𝑢𝑧 𝐷𝑧𝑑𝑚𝑓 𝐵𝑑ℎ𝑗𝑓𝑤𝑏𝑐𝑚𝑓 𝑁𝑏𝑦𝑗𝑛𝑣𝑛 𝐸𝑣𝑢𝑧 𝐷𝑧𝑑𝑚𝑓 𝑈𝑝𝑢𝑏𝑚 𝐹𝑜𝑓𝑠𝑕𝑧 𝑋𝑏𝑡𝑢𝑓𝑒 𝑈𝑝𝑢𝑏𝑚 𝐹𝑜𝑓𝑠𝑕𝑧 𝐼𝑏𝑠𝑤𝑓𝑡𝑢𝑓𝑒

𝐹𝑜𝑓𝑠𝑕𝑧 𝑋𝑏𝑡𝑢𝑓 = 𝐹𝑜𝑓𝑠𝑕𝑧 𝐼𝑏𝑠𝑤𝑓𝑡𝑢𝑓𝑒 −𝑂𝑝𝑒𝑓 𝐹𝑜𝑓𝑠𝑕𝑧 − 𝐷ℎ𝑏𝑠𝑕𝑗𝑜𝑕 𝐹𝑜𝑓𝑠𝑕𝑧

SLIDE 28

ADAPTATION TO SEASONAL CHANGES

8 March 2017 NAKAMURA LABORATORY

28

SLIDE 29

Performance in Summer

8 March 2017 NAKAMURA LABORATORY

29

High Duty Cycle even during the night

SLIDE 30

Performance in Winter

8 March 2017 NAKAMURA LABORATORY

30

1360 1400 1440 1480 0% 40% 100% 20% 60% 80% EPOCH Duty Cycle (%) Harvested Energy (%) Battery (%) Lower duty cycle during the night

SLIDE 31

ADAPTATION TO CHANGE IN LOCATION

8 March 2017 NAKAMURA LABORATORY

31

SLIDE 32

Implementation: ε-greedy approach

Perfect Q-convergence takes too long. Instead, use ε-greedy approach with non-converged Q-table. ε-greedy approach:

Take the best action by default.
Take a random action with probability ε.
Increasing ε → Exploration
Decreasing ε → Exploitation

8 March 2017 NAKAMURA LABORATORY

32

SLIDE 33

Adaptation to change in climate

8 March 2017 NAKAMURA LABORATORY

33

Wakkanai (very little

sunshine)

Compare between
a greedy approach (Offline)

and

an -greedy approach

(Online).

Training: 2000-2009 Tokyo
Testing: 2010 Wakkanai

SLIDE 34

55 56 57 58 59 60 61 62 63 64 65

2010 2011 2012 2013 2014 2015

Average Duty Cycle (%)

Wakkanai Offline Wakkanai Online

(8) (14)

With -greedy implementation, the agent adapts to the environment and minimizes instances of battery exhaustion.

Adaptation to change in location

8 March 2017 NAKAMURA LABORATORY

34

Total number of times the battery was completely exhausted

(14)

Greedy (Non adaptive) -greedy (adaptive)

SLIDE 35

With and Without Forecast Information

8 March 2017 NAKAMURA LABORATORY

35

SLIDE 36

CONCLUSION

SLIDE 37

CONCLUSION

Proposed system is able to meet objectives of
Energy neutrality
Maximizing performance
Exceeds the performance of other schemes
Capable of adaptation
Inclusion of weather forecast results in smarter
peration

8 March 2017 NAKAMURA LABORATORY

37

SLIDE 38

THANK YOU FOR LISTENING

ANY COMMENTS OR QUESTIONS ARE WELCOME