Architecting Energy Efficient Computing Platforms Rajesh Gupta, UC - - PowerPoint PPT Presentation

architecting energy efficient computing platforms
SMART_READER_LITE
LIVE PREVIEW

Architecting Energy Efficient Computing Platforms Rajesh Gupta, UC - - PowerPoint PPT Presentation

Architecting Energy Efficient Computing Platforms Rajesh Gupta, UC San Diego http://mesl.ucsd.edu Science of Power Management, April 9, 2009 Credits: Energy Related Projects & Teams Completed Efforts Power Aware Distributed Systems


slide-1
SLIDE 1

Architecting Energy Efficient Computing Platforms

Rajesh Gupta, UC San Diego http://mesl.ucsd.edu

Science of Power Management, April 9, 2009

slide-2
SLIDE 2

Credits: Energy Related Projects & Teams

  • Completed Efforts
  • Power Aware Distributed Systems (PADS)
  • Mani Srivastava, UCLA
  • Cristiano Pereira, Arun Kejariwal.
  • Formal Methods in Power Management
  • Sandy Irani, UC Irvine
  • Sandeep Shukla, Virginia Tech
  • Ravindra Jejurikar, Dinesh Ramanathan, Zhen Ma
  • Ongoing
  • System level Power Management
  • Yuvraj Agrawal, Zhong Yi Jin, Packet Digital, MSR (Ranveer Chandra,

Victor Bahl)

  • GreenLight: Coherent Coprocessing for Energy Efficient Computing
  • Joel Coburn, Arup De, Gerald Clark, M. Florea, ….Tom DeFanti
  • Launching: Non-Volatile Data Intensive Supercomputing NV-DISC
  • Arup De, Steve Swanson
slide-3
SLIDE 3

Outline

Energy and Computing Three Observations Approach and Lessons Learnt

Architectural Design for Low Power Algorithm Design for Power Management

Cross-layer optimization and awareness

For aggressive duty-cycling

Takeaways

slide-4
SLIDE 4

Energy Efficiency is at the front & center of all forms of computing

Current architectural offerings range from

300µW to 30mW per (reasonable) MIPS.

360µm 300µm

Photodiode

Pad to CCR

Vdd Pad

GND Pad/ LFSR Power-on Reset

Charge Pump

360µm 300µm

Photodiode

Pad to CCR

Vdd Pad

GND Pad/ LFSR Power-on Reset

Charge Pump

Stationary Devices Mobile Devices Sensor Devices

W mW µW

slide-5
SLIDE 5

Our Famous Scaling Curves

1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 1950 1960 1970 1980 1990 2000 2010

  • Avg. increase
  • f 57%/year

4004 8086 286 386 486DX Pentium P2 P3 P4 Itanium 2 Madison Trend of minimum transistor switching energy

1 10 100 1000 10000 100000 1000000 1995 2005 2015 2025 2035

Year of First Product Shipment Min transistor switching energy, kTs

High Low trend

Michael Frank, U Florida

(½CV2 gate energy calculated from ITRS ’99 geometry/voltage data)

slide-6
SLIDE 6

Physicists and Computer Scientists Have Been Here Before

Confirmed physical theories define limits

Relativity: speed of light: latencies, bandwidth Quantum: uncertainty: information capacity Quantum: energy, reversibility: processing rate, energy/op

Newton, Einstein:

Energy and mass are the same thing in different units Energy, matter can not exceed SOL. If you do, there exists a FOR

in which causality is violated

Thermodynamics relates heat, temperature and work

Entropy = heat/temperature = log (#states)

Feynman, von Neuman, Shannon, Landauer

Entropy = amount of unknown or incompressible information in a

physical system

Information loss equates heat generation Minimum energy per op same as min energy per bit Energy lost to heat, S.T = kT ln 2 per bit loss, 18eV at 300K

Minimum Vdd of 48mV (with 30mV swing) verified by several groups. Realistically approaching 200mW.

slide-7
SLIDE 7

Our Work: Know or Find Limits, Architectural Design to Reach Limits

Hardware:

What is the right choice and combinations of components?

Processors, Radios, Storage, Networking. [Mobisys 07-08, NSDI 09]

Power System States and Transitions

What is the right choice of power states and methods to move

among these? Dynamic power management, Speed Scaling.

[TCAS-I 09, TOA 07, TCOMP 06, TCAD 06]

Software

How to manage power-related decisions across abstraction

layers (more in software than hardware)? Metadata methods, reflection, introspection. [TVLSI 06, IPDPS 05]

slide-8
SLIDE 8

Three Important Observations

  • O1. Hardware is increasingly heterogeneous

Component efficiency rated against absolute

performance delivered

50 100 150 200 250 300 350 400 450

Zigbee BT 802.11

Idle Power (mW)

50 100 150 200 250

Energy/Bit (nJ/bit)

0.25Mbps 1.1Mbps 11Mbps

Medium range, High power (400mW‐1W), Higher bit‐rate (54Mbps) Short range, low power (20mW‐100mW), lower bit rate (2Mbps) Long Range, very low power (<10mW), voice only

slide-9
SLIDE 9

Three Important Observations

  • O2. Tremendous dynamic variation in power use

6-10x variation in power from active to sleep

modes, even more in radios

packet

Transmit Processing Transmit Amplifier

d

packet

Receive Processing

50 nJ/bit 100 pJ/bit/m

Active State : >140W Idle State : 100W Sleep state : 1.2W Hibernate : 1W

Desktop PC

  • O3. Abstraction stack has a real (high) cost for energy.
slide-10
SLIDE 10

Improving Energy Efficiency: Three Approaches

Reduce distance (O1)

  • Physical, logical

Minimize wasted work (O2)

  • Shutdown, slowdown, procrastinate

Specialized heterogeneous processing (O3)

  • In a generalized execution environment

Apply these lessons to build better architectures, power management algorithms.

slide-11
SLIDE 11

Introduce & Exploit Heterogeneity

Exploit the wide range of power consumption

Duty-cycle higher power consumers …in lieu of low power alternatives when possible

To do this well, three things must happen

Subsystems must be “functionally similar”

Radios – fundamentally send bits across the air

Subsystems must be “heterogeneous”

Operate in different power performance regimes

Subsystems must “collaborate”

Solves the Receiver Side Problem (RSP)

slide-12
SLIDE 12

Architectural Collaboration

Duty cycle the more power consuming

resource using the other

W GN Block Diagram

Power Wi-Fi Radio Serial Interface Other Devices Application Processor

Wireless Sensor Node Supported interface

Prism 802.11b Radio

IP2022 DPAC PIC18F452 SPI External Memory Interface Power

(Sensor Node Processor) (Application Processor) Prism 802.11b Radio

IP2022 DPAC PIC18F452 SPI External Memory Interface Power

(Sensor Node Processor) (Application Processor)

W GN Architecture

Sleep-talking Processors Paging Radios

WiFi Active

WiFi Active WiFi PSM

WiFi Active

BT Active

WiFi Active

BT Sniff

Bluetooth Wi-Fi 264 mW 990 mW 81 mW 5.8 mW 1. Use a low power radio to wake up higher power radio 2. Build a radio-switching hierarchy Effectively expand the power states at a system level E.g. consider a system with Bluetooth and Wi-Fi radios

slide-13
SLIDE 13

Collaborate and Coordinate

Computation Subsystem Dynamic Voltage/Freq. Scaling Communication Subsystem

?

Power-aware Task Scheduling OS/Middleware/Application ?

Modulation, Code Rate EE packet scheduling

Middleware

DAC 2003

slide-14
SLIDE 14
  • 50% energy reduction with CoolSpots
  • VOIP with Cell2Notify can reduce power 1.7-6.4x over

WiFi and better than Cellular radios!

Collaborating Radios

Switch : Wi-Fi -> BT

Bluetooth

Wi-Fi

10 20 30 40 50 60 70 Beth John James Lifetime (Hours of Usage) Using WiFi Using Cell2Notify

70% 230% 540%

Call Log: John 10 20 30 40 50 60 1 3 5 7 9 11 13 15 17 19 21 23 Hour of the Day Duration of Calls (Minutes) Call Log: Beth 10 20 30 40 50 60 1 3 5 7 9 11 13 15 17 19 21 23 Hour of the Day Duration of Calls (Minutes) 10 20 30 40 50 60 1 3 5 7 9 11 13 15 17 19 21 23 Hour of the Day Duration of Calls (Minutes) Call Log: John 10 20 30 40 50 60 1 3 5 7 9 11 13 15 17 19 21 23 Hour of the Day Duration of Calls (Minutes) Call Log: Beth 10 20 30 40 50 60 1 3 5 7 9 11 13 15 17 19 21 23 Hour of the Day Duration of Calls (Minutes) 10 20 30 40 50 60 1 3 5 7 9 11 13 15 17 19 21 23 Hour of the Day Duration of Calls (Minutes)

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Verizon V620 (1xEVDO) SE-GC83 (GPRS/EDGE) Netgear WAG511 (Wi-Fi)

Power Consumption (Watts)

slide-15
SLIDE 15

Collaborating Processors

Somniloquy daemon Somniloquy daemon

Host processor, RAM, peripherals, etc. Operating system, including networking stack Apps Network interface hardware Secondary processor Secondary processor Embedded CPU, RAM, flash Embedded CPU, RAM, flash Embedded OS, including networking stack Embedded OS, including networking stack

wakeup filters wakeup filters

Appln. stubs Appln. stubs Host PC Problem: Power State Design Runs Into Use Models Hosts (PCs) are either Awake (Active) or Sleep (Inactive) Power consumed when Awake = 100X power in Sleep! Network: Assumes hosts are always “Connected” (Awake) Users want machines with the availability of active machine, power of

a sleeping machine.

slide-16
SLIDE 16

USB Interface (Wake up Host + Status + Debug) USB Interface (power + USBNet) 100Mbps Ethernet Interface Processor SD Storage

Prototypes

slide-17
SLIDE 17

Network, Application Level Reachability

Respond to “ping”, ARP queries, maintain DHCP

Maintain availability across the entire protocol stack E.g. ARP(layer 2), ICMP(layer 3), SSH (Application layer)

Desktop going to Sleep 4 seconds Desktop resuming from Sleep 5 seconds

slide-18
SLIDE 18

Web downloads

200MB flash storage, download when PC is asleep

Wake up PC and upload to PC when needed

1 600 1200 1800 2400

92% less energy than using the host PC for download

slide-19
SLIDE 19

Desktops: Power Savings

Using Somniloquy:

– Power drops from >100W to <5W – Assuming a 45 hour work week

620kWh saved per year US $56 savings, 378 kg CO2

Dell Optiplex 745 Power Consumption and transitions between states

State Power

Normal Idle State 102.1W Lowest CPU frequency 97.4W Disable Multiple cores 93.1W “Base Power” 93.1W Suspend state (S3) 1.2W

slide-20
SLIDE 20

Laptops: Extends Battery Lifetime

Using Somniloquy:

– Power drops from >11W to 1W,

Battery life increases from <6 hours to >60 hours

– Provides functionality of the “Baseline” state

Power consumption similar to “Sleep” state

slide-21
SLIDE 21

Improving Energy Efficiency

Reduce distance (O1)

  • Physical, logical

Minimize wasted work (O2)

  • Shutdown, slowdown, procrastinate

Specialized heterogeneous processing (O3)

  • In a generalized execution environment

Apply these lessons to build better architectures, power management algorithms.

slide-22
SLIDE 22

Algorithmically, there are basically two ways to save power

Power Manager Service Requestor Service Provider Queue

  • bservation
  • bservation

command (on, off) request

Power Manager Service Requestor Service Provider Queue

  • bservation
  • bservation

command (on, off) request

Variable Power-Speed System

FIFO Input Buffer

Workload Filter

Power-Speed Control Knob

Variable Power-Speed System

FIFO Input Buffer

Workload Filter

Power-Speed Control Knob

slide-23
SLIDE 23

Algorithmically, there are basically two ways to save power

Shutdown through choice of

right system & device states

Multiple sleep states Also known as Dynamic

Power Management (DPM)

Slowdown through choice of

right system & device states

Multiple active states Also known as Dynamic

Voltage/Frequency Scaling (DVS)

DPM + DVS

Choice between amount of

slowdown and shutdown

Power Manager Service Requestor Service Provider Queue

  • bservation
  • bservation

command (on, off) request

Power Manager Service Requestor Service Provider Queue

  • bservation
  • bservation

command (on, off) request

Variable Power-Speed System

FIFO Input Buffer

Workload Filter

Power-Speed Control Knob

Variable Power-Speed System

FIFO Input Buffer

Workload Filter

Power-Speed Control Knob

Competitive and Adversarial Approaches using Probabilistic Model Checking Machine Learning Techniques Convex Optimization for Thermally Efficient Chip Design

slide-24
SLIDE 24

Our Work In This Context

Quantitative bounds on the quality of DPM algorithms

based on Competitive Analysis [TCAD 01]

DPM strategies for devices with both multiple active and

multiple sleep states [TCAD 02]

Critical speed when using DPM + DVS [SODA 03, TECS02] Optimized slowdown methods under various timing

scenarios [TCOM 06, TCAD 06, DAC 05-06, ECRTS 04-05]

Model the system as a game between DPM algorithm

and an non-deterministic adversary to verify competitive ratio [TVLSI 05]

Parameterized job scheduling problems [DCOSS 08, INFOCOM 09]

slide-25
SLIDE 25

Energy Time

State 4 State1 State2 State3 t1 t2 t3

i i Time

Energy β α + = ) (

For each state i, plot:

Multi-state DPM: Lower Envelope

LEA can be deterministic or probabilistic

PLEA is e/(e-1) competitive.

∫ ∫

∞ − − −

+ − + [ + + =

T i i i T i i T i

dt t p T t T dt t p t T ) ( ] ) ( ) ( ] [ min arg

1 1 1

β α α β α

slide-26
SLIDE 26

Lessons from Slowdown, Shutdown

Slowdown eventually reaches a limit w.r.t. to

work done, quality, timing

Shutdown keeps giving if

There is heterogeneity: large difference between

“on” and “off” power

Keep finding opportunities to duty-cycle actions by

using higher level semantics.

Blocked “Off” Active “On”

Tblock Tactive ideal improvement = 1 + Tblock/Tactive

Need to reach higher layers for shutdown power/energy awareness.

slide-27
SLIDE 27

What does is mean to be ‘aware’?

That the application and the

services know about energy, power

File system, memory management,

process scheduling

Make each of them energy aware

How does one make software to

be “aware”?

Use “reflectivity” in software to build

adaptive software

Ability to reason about and act upon

itself (OS, MW)

slide-28
SLIDE 28

Example: Program Phases & Power Control

1.

Characterize application offline

  • Divide an application into phases of execution
  • A group of program intervals executing similar code
  • Each phase has similar demand on resources, energy use
  • Similar code, similar resource demands (memory, IPC)

2.

Annotate source code

  • Phase signatures

3.

Enable OS (and hardware) to recognize signature

  • Smart hardware and/or online learning techniques

4.

Dynamically tune the power manager

  • As application moves from one phase to another.
slide-29
SLIDE 29

Matching Signatures at Runtime

  • Use performance counters:
  • Can be programmed to generate an interrupt on specified counts
  • ISR provides matching with the meta data and mode changes
  • Every S*10,000 loop branches try a match
  • Phase matching can also be done in hardware
  • Notify power manager to trigger proper action (memory bank

shutdowns)

slide-30
SLIDE 30

Results – Normalized to NAP

Average among bzip, mpeg, ghostscript and ADPCM

A

slide-31
SLIDE 31

Results - overheads

# of phases # instructions

  • verhead

5 2,580 0.7% 10 4,500 1% 20 8,280 2% 30 12,060 3%

  • Approx. 350K instructions for every 10,000 loop branch

instructions

  • Number of instructions executed by the match algorithm

at every 10,000 loop branches to match a partial signature (500 instructions per phase)

  • Size overhead. 4 bytes per inter arrival estimate per bank / phase. 4 x

16 x 10 = 640 bytes assuming 16 banks and 10 phases.

  • The signatures take1280 bytes for 10 phases. Total of 2KB of meta data

A

slide-32
SLIDE 32

Takeaways

Algorithmically we look for the right combination

  • f slowdown and shutdown strategies

Driven by increasingly real, accurate and timely

sensor data that push the available slack to thermal limits

Architecturally we look for the right organization

  • f components for maximal duty cycling

Future increases in energy efficiency lie in

architectures that enable aggressive duty cycling

By continually reaching to the higher levels of decision

making, capturing intent.

“Future lies in system architectures built for aggressive duty-cycling”

slide-33
SLIDE 33

Power Management in Mixed Use Buildings

500 occupants, 750 machines (nom.) Detailed instrumentation to measure

macro and micro-scale power use

39 sensor pods, 156 radios, 70 circuits

  • Subsystems: Air Conditioning, Lighting, …