Reinforcement Learning Shaswot Shresthamali, Masaaki Kondo, Hiroshi - - PowerPoint PPT Presentation

reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning Shaswot Shresthamali, Masaaki Kondo, Hiroshi - - PowerPoint PPT Presentation

Adaptive Power Management for Energy Harvesting Sensor Nodes using Reinforcement Learning Shaswot Shresthamali, Masaaki Kondo, Hiroshi Nakamura The University of Tokyo CONTEXT Energy Harvesting Sensor Nodes Sensor Node (capable of varying


slide-1
SLIDE 1

Adaptive Power Management for Energy Harvesting Sensor Nodes using Reinforcement Learning

Shaswot Shresthamali, Masaaki Kondo, Hiroshi Nakamura The University of Tokyo

slide-2
SLIDE 2

CONTEXT

slide-3
SLIDE 3

Energy Harvesting Sensor Nodes

Theoretically capable of perpetual operation

8 March 2017 NAKAMURA LABORATORY

3

Sensor Node + Battery + Energy Harvesting Module

(capable of varying the duty cycle)

http://www.libelium.com/resources/ima ges/content/products/plug- sense/details/solar_powered_photo.png

slide-4
SLIDE 4

Challenge I

Say your battery is at 75% and there is plenty of sunshine Do you

  • Use the solar power to charge your battery only
  • Use the solar power to charge your battery and drive the

sensor node. If so, then with what proportion?

8 March 2017 NAKAMURA LABORATORY

4

100 200 300 400 500 600 700 800 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

ENERGY HARVESTED BATTERY TIME

Policy 1 Policy 2 Policy 3 Energy Harvested

slide-5
SLIDE 5

Challenge II

8 March 2017 NAKAMURA LABORATORY

5

Environmental Sensor Networks – P.I. Corke et. al. https://sites.google.com/site/sarmavrudhula/home/research/energy-management-of-wireless-sensor- networks http://www.mdpi.com/sensors/sensors-12-02175/article_deploy/html/images/sensors-12- 02175f5-1024.png

MOVING SENSORS DIFFERENT ENVIRONMENTS DIFFERENT SENSORS

slide-6
SLIDE 6

Challenge III

BILLION AND TRILLIONS OF NODES

8 March 2017 NAKAMURA LABORATORY

6

https://bendeetech.files.wordpress.com/2015/12/theinternetofthings2-540x334.jpg?w=350&h=200&crop=1

slide-7
SLIDE 7

Challenges II and III

When dealing with TRILLIONS of sensor nodes, Customizing each node is impractical, impossible

  • Nodes should OPTIMIZE themselves.
  • Nodes should ADAPT to their changing environments.

8 March 2017 NAKAMURA LABORATORY

7

ENERGY HARVESTING NODES NEED TO BE

ADAPTABLE SELF CALIBRATING

slide-8
SLIDE 8

What this presentation is about

To demonstrate how to overcome the challenges by using Reinforcement Learning (RL)

  • Brief introduction to Reinforcement Learning
  • Our approach using RL
  • How this strategy performs compares to other methods
  • How this strategy adapts to changing environment

8 March 2017 NAKAMURA LABORATORY

8

slide-9
SLIDE 9

OBJECTIVES

slide-10
SLIDE 10

Objectives

Energy Neutral Operation (ENO)

  • Energy consumed = Energy harvested

Maximize Performance

  • Maximize Duty Cycle

Minimize Battery Downtime

  • Battery should never drop to zero

Minimize Energy Waste

  • Battery should not overcharge

Energy Waste = Energy Harvested –Energy consumed by Node –Energy to charge battery

8 March 2017 NAKAMURA LABORATORY

10

slide-11
SLIDE 11

SYSTEM MODEL

slide-12
SLIDE 12

System Model

8 March 2017 NAKAMURA LABORATORY

12

Solar Panel Battery Sensor Node Solar Energy Adaptive Power Manager Duty Cycle Battery Reserve Level Energy being harvested

(Ideal)

slide-13
SLIDE 13

REINFORCEMENT LEARNING (RL)

A BRIEF INTRODUCTION

slide-14
SLIDE 14

What is RL

Type of Machine Learning - Learns by interacting with Environment Suited for Sequential Decision Making Tasks Map situations (states) into actions – receive as much reward as possible Based on iterative process of trial and error – similar to how humans learn. (Search and Memory)

8 March 2017 NAKAMURA LABORATORY

14

slide-15
SLIDE 15

Why Reinforcement Learning

By using RL, it is possible

  • To optimize nodes with raw high level data and minimal

human input.

  • To adapt to changes in the environment parameters.

8 March 2017 NAKAMURA LABORATORY

15

slide-16
SLIDE 16

Reinforcement Learning

8 March 2017 NAKAMURA LABORATORY

16

REWARD, New State OBSERVATIONS: Battery Level Energy Harvested ACTION: Choose Duty Cycle What action should I take to accumulate total maximum reward?

Agent (Power Manager) Environment

http://wedreamabout.com/product/bb-8-droid- the-coolest-star-wars-toy-ever http://www.canstockphoto.com/go- green-icons-concept-tree-12796260.html

slide-17
SLIDE 17

Reinforcement Learning

The question is:

WHICH ACTION TO TAKE WHEN YOU ARE IN A GIVEN STATE?

EXAMPLE: Lots of sunlight | Battery at 60% Do you

  • drive the sensor node at full strength without

recharging?

  • drive the sensor node at half strength with partial

charging?

8 March 2017 NAKAMURA LABORATORY

17

[1] https://spaceplace.nasa.gov/sun-corona/en/ [2] https://handyenergy.ru/ [1] [2]

slide-18
SLIDE 18

Q- Value

Assign every state-action pair → Q-Value, Q(s,a)

8 March 2017 NAKAMURA LABORATORY

18

State X Action 1 Action 3 Action 2 Q(X,1) Q(X,2) Q(X,3)

Q(s,a) means if the agent

  • Starts from state s
  • Takes action a
  • Q(s,a) is the total reward it can expect in the best case scenario

Higher the Q-value, better the action for that particular state

slide-19
SLIDE 19

Q-Learning

Challenge → Determining the Q-Values for all state- action pairs. Q-table -> contains Q-Values of all possible state- action pairs Accomplished by Q-Learning Algorithm

  • Q-values are learned by interacting with environment.
  • Iterative Process
  • Bootstrapping approach

8 March 2017 NAKAMURA LABORATORY

19

slide-20
SLIDE 20

Q-Learning Algorithm

Q-Learning Algorithm

  • Use arbitrary estimates for Q-values
  • Use these estimates to decide on actions
  • Update Q-table by using the rewards received
  • Repeat until Q-value sufficiently converge

8 March 2017 NAKAMURA LABORATORY

20

slide-21
SLIDE 21

EXPERIMENTS ON ADAPTIVE POWER MANAGEMENT USING Q-LEARNING

slide-22
SLIDE 22

State Space

State is defined by :

  • amount of battery remaining
  • 200 possible levels
  • amount of energy harvested
  • 5 possible levels

Total possible states: 200 x 5 = 1000

8 March 2017 NAKAMURA LABORATORY

22

slide-23
SLIDE 23

Action Space

Action: Choose duty cycle of the sensor node 𝐵 = 𝑏 𝑢𝑙 ∈ 10%, 20%, 30% … .100% 10% 50 mW 50% 250 mW 100% 500 mW

8 March 2017 NAKAMURA LABORATORY

23

slide-24
SLIDE 24

Reward Function

8 March 2017 NAKAMURA LABORATORY

24

The reward depends on:

  • Distance from energy neutrality at time tk

∆𝑓

𝑜𝑓𝑣𝑢𝑠𝑏𝑚 𝑢𝑙 = 𝑓 ℎ𝑏𝑠𝑤𝑓𝑡𝑢 𝑢𝑙 − 𝑓 𝑜𝑝𝑒𝑓 𝑢𝑙

  • Amount of battery remaining
slide-25
SLIDE 25

RESULTS

slide-26
SLIDE 26

Training and Testing

Training: Tokyo (2000 to 2009) Testing : Tokyo (2010)

  • Compare it with other methods.
  • Adaptation to diurnal and seasonal variations.
  • Greedy and ε-greedy Implementations

8 March 2017 NAKAMURA LABORATORY

26

slide-27
SLIDE 27

Comparison with other methods

8 March 2017 NAKAMURA LABORATORY

27

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Naïve Kansal Our method using RL Efficiency(%) Energy Wasted(%)

Duty Cycle is proportional to battery level Fix duty cycle for present day by predicting total energy for next day Higher Efficiency Lower Waste

𝐵𝑑𝑢𝑣𝑏𝑚 𝐸𝑣𝑢𝑧 𝐷𝑧𝑑𝑚𝑓 𝐵𝑑ℎ𝑗𝑓𝑤𝑏𝑐𝑚𝑓 𝑁𝑏𝑦𝑗𝑛𝑣𝑛 𝐸𝑣𝑢𝑧 𝐷𝑧𝑑𝑚𝑓 𝑈𝑝𝑢𝑏𝑚 𝐹𝑜𝑓𝑠𝑕𝑧 𝑋𝑏𝑡𝑢𝑓𝑒 𝑈𝑝𝑢𝑏𝑚 𝐹𝑜𝑓𝑠𝑕𝑧 𝐼𝑏𝑠𝑤𝑓𝑡𝑢𝑓𝑒

𝐹𝑜𝑓𝑠𝑕𝑧 𝑋𝑏𝑡𝑢𝑓 = 𝐹𝑜𝑓𝑠𝑕𝑧 𝐼𝑏𝑠𝑤𝑓𝑡𝑢𝑓𝑒 −𝑂𝑝𝑒𝑓 𝐹𝑜𝑓𝑠𝑕𝑧 − 𝐷ℎ𝑏𝑠𝑕𝑗𝑜𝑕 𝐹𝑜𝑓𝑠𝑕𝑧

slide-28
SLIDE 28

ADAPTATION TO SEASONAL CHANGES

8 March 2017 NAKAMURA LABORATORY

28

slide-29
SLIDE 29

Performance in Summer

8 March 2017 NAKAMURA LABORATORY

29

High Duty Cycle even during the night

slide-30
SLIDE 30

Performance in Winter

8 March 2017 NAKAMURA LABORATORY

30

1360 1400 1440 1480 0% 40% 100% 20% 60% 80% EPOCH Duty Cycle (%) Harvested Energy (%) Battery (%) Lower duty cycle during the night

slide-31
SLIDE 31

ADAPTATION TO CHANGE IN LOCATION

8 March 2017 NAKAMURA LABORATORY

31

slide-32
SLIDE 32

Implementation: ε-greedy approach

Perfect Q-convergence takes too long. Instead, use ε-greedy approach with non-converged Q-table. ε-greedy approach:

  • Take the best action by default.
  • Take a random action with probability ε.
  • Increasing ε → Exploration
  • Decreasing ε → Exploitation

8 March 2017 NAKAMURA LABORATORY

32

slide-33
SLIDE 33

Adaptation to change in climate

8 March 2017 NAKAMURA LABORATORY

33

  • Wakkanai (very little

sunshine)

  • Compare between
  • a greedy approach (Offline)

and

  • an -greedy approach

(Online).

  • Training: 2000-2009 Tokyo
  • Testing: 2010 Wakkanai
slide-34
SLIDE 34

55 56 57 58 59 60 61 62 63 64 65

2010 2011 2012 2013 2014 2015

Average Duty Cycle (%)

Wakkanai Offline Wakkanai Online

(8) (14)

With -greedy implementation, the agent adapts to the environment and minimizes instances of battery exhaustion.

Adaptation to change in location

8 March 2017 NAKAMURA LABORATORY

34

Total number of times the battery was completely exhausted

(14)

Greedy (Non adaptive) -greedy (adaptive)

slide-35
SLIDE 35

With and Without Forecast Information

8 March 2017 NAKAMURA LABORATORY

35

slide-36
SLIDE 36

CONCLUSION

slide-37
SLIDE 37

CONCLUSION

  • Proposed system is able to meet objectives of
  • Energy neutrality
  • Maximizing performance
  • Exceeds the performance of other schemes
  • Capable of adaptation
  • Inclusion of weather forecast results in smarter
  • peration

8 March 2017 NAKAMURA LABORATORY

37

slide-38
SLIDE 38

THANK YOU FOR LISTENING

ANY COMMENTS OR QUESTIONS ARE WELCOME