Models and Metrics for Energy-Efficient Computer Systems Suzanne - - PowerPoint PPT Presentation

models and metrics for energy efficient computer systems
SMART_READER_LITE
LIVE PREVIEW

Models and Metrics for Energy-Efficient Computer Systems Suzanne - - PowerPoint PPT Presentation

Models and Metrics for Energy-Efficient Computer Systems Suzanne Rivoire May 22, 2007 Ph.D. Defense EE Department, Stanford University Power and Energy Concerns Processors: power density [Borkar, Intel] Power and Energy Concerns (2)


slide-1
SLIDE 1

Models and Metrics for Energy-Efficient Computer Systems

Suzanne Rivoire May 22, 2007 Ph.D. Defense EE Department, Stanford University

slide-2
SLIDE 2

Power and Energy Concerns

Processors: power density

[Borkar, Intel]

slide-3
SLIDE 3

Power and Energy Concerns (2)

Personal computers

  • Mobile devices: battery life/usability
  • Desktops: electricity costs, noise

Servers and data centers

  • Power and cooling costs
  • Reliability
  • Density/scalability
  • Pollution
  • Load on utilities
slide-4
SLIDE 4

Underlying Questions

Metrics: What are we aiming for?

Compare energy efficiency Identify / motivate new designs

Models: How do we get there?

Understand how high-level properties affect power Improve power-aware scheduling policies / usage

slide-5
SLIDE 5

Talk Overview

Metrics: JouleSort benchmark

First complete, full-system energy-efficiency benchmark Design of winning system

Models: Mantis approach

Generates family of high-level full-system models Generic, accurate, portable

slide-6
SLIDE 6

JouleSort energy-efficiency benchmark

JouleSort benchmark specification

Workload, metric, guidelines Rationale and pitfalls

Energy-efficient system design: 2007 “winner”

3.5 better than previous best Insights for future designs

[S. Rivoire, M. A. Shah, P. Ranganathan, C. Kozyrakis, “JouleSort: A Balanced Energy-Efficiency Benchmark,” SIGMOD 2007.]

slide-7
SLIDE 7

Why a benchmark?

Track progress, compare systems, spur innovation Current benchmarks/metrics Limitations of current metrics:

Under-specified or “under construction” Limited to a particular component or domain

slide-8
SLIDE 8

Benchmark design goals

Holistic and balanced: exercises all core components Inclusive and representative: meaningful and implementable on many different machines History-proof: meaningful comparisons between scores from different years

slide-9
SLIDE 9

Benchmark specification overview

Workload Metric Rules

slide-10
SLIDE 10

Workload: External sort

Sort randomly permuted 100-byte records with 10-byte keys From file on non-volatile store to file on non-volatile store (“external” storage)

slide-11
SLIDE 11

External sort workload

Simple and balanced

Exercises all core components CPU, memory, disk, I/O, OS, filesystem End-to-end measure of improvement

Inclusive of variety of systems

PDAs, laptops, desktops, supercomputers

Representative of sequential I/O tasks Technology trend bellwether

Supercomputers to clusters, GPU?

slide-12
SLIDE 12

Existing sort benchmarks

Sort benchmarks used since 1985 Pure performance

MinuteSort: How many records sorted in 1 min? Terabyte: How much time to sort 1 TB?

Price-performance

PennySort: How many records sorted for $0.01? Performance-Price: MinuteSort/$$

More info at http://research.microsoft.com/barc/SortBenchmark/

slide-13
SLIDE 13

JouleSort metric choices

How to weigh power and performance?

Equally (energy)?

Energy (Joules) = Power (Watts) Time (sec.)

Privilege performance (energy-delay product)?

What to fix and what to compare?

Fix energy budget and compare records sorted? Fix num. records and compare energy? Fix time budget and compare records/Joule?

slide-14
SLIDE 14

2000 4000 6000 8000 10000 12000 14000 16000 18000 1.0E+05 1.0E+06 1.0E+07 1.0E+08 1.0E+09 1.0E+10

Records Sorted SRecs/J .

Records Sorted

Problem with Fixed Time Budget

1-pass sort < 10 sec (N lg N) complexity SortedRecs/Joule

slide-15
SLIDE 15

Final metric: Fixed input size

3 classes: 10GB, 100GB, 1TB Winner: minimum energy Report (records sorted / Joule) Inter-class comparisons imperfect Adjust classes as technology improves

slide-16
SLIDE 16

Energy measurement setup

Power Power meter Sorting system Monitoring system Wall AC power

Power readings (serial cable) Sort timing (network)

slide-17
SLIDE 17

Talk Overview

Metrics: JouleSort benchmark

First complete, full-system energy-efficiency benchmark Design of winning system

Models: Mantis approach

Generates family of high-level full-system models Generic, accurate, portable

slide-18
SLIDE 18

Representative systems

406 22 140 90 290

Pwr (W)

~3200 59GB n/a 9 GPUTeraSort (estimated) ~3400 10GB 1% 1 Laptop ~3800 10GB >90% 12 Commodity fileserver ~1200 10GB 26% 2 Low-end server 11%

CPU %

~300 5GB 1 Blade

SRecs/J SRecs Disks

slide-19
SLIDE 19

Representative systems

406 22 140 90 290

Pwr (W)

~3200 59GB n/a 9 GPUTeraSort (estimated) ~3400 10GB 1% 1 Laptop ~3800 10GB >90% 12 Commodity fileserver ~1200 10GB 26% 2 Low-end server 11%

CPU %

~300 5GB 1 Blade

SRecs/J SRecs Disks

slide-20
SLIDE 20

Representative systems

406 22 140 90 290

Pwr (W)

~3200 59GB n/a 9 GPUTeraSort (estimated) ~3400 10GB 1% 1 Laptop ~3800 10GB >90% 12 Commodity fileserver ~1200 10GB 26% 2 Low-end server 11%

CPU %

~300 5GB 1 Blade

SRecs/J SRecs Disks

slide-21
SLIDE 21

Energy-Efficient Components: Processor

52% power 75% perf Fileserver CoolSort Sort BW: 313 MB/s 65W (peak) Sort BW: 236 MB/s 34W (peak)

slide-22
SLIDE 22

Energy-Efficient Components: Disks

15% power 50% perf Fileserver Our winner Seagate Barracuda

  • Seq. BW: 80MB/s

13W Hitachi Travelstar

  • Seq. BW: 40MB/s

2W

slide-23
SLIDE 23

CoolSort Design

Asus motherboard: Mobile CPU + 2 PCI-e slots RocketRAID Disk Controllers 13 Hitachi TravelStar 160GB

slide-24
SLIDE 24

2000 4000 6000 8000 10000 12000 2 3 4 5 6 7 8 9 10 11 12 13

Disks Used SortedRecs/Joule

20 40 60 80 100 120 140

SortedRecs/sec (x 10E4)

SRecs/J Perf

Maximizing performance

  • Balanced sort: enough disks to fully utilize CPU
  • Disks running near peak BW

GPUTeraSort

slide-25
SLIDE 25

CoolSort: The 100 GB winner

11,300 records sorted per Joule 3.5 more efficient than GPUTeraSort Average sorting power: 100 W

slide-26
SLIDE 26

Insights for future designs

Low-hanging fruit: use low-power HW

Best power-performance trade-off Still need to fully utilize resources Challenge: adequate interfaces and “glue” to bring laptop components into servers

Scaledown efficiency

Limited dynamic range For fixed HW: peak efficiency = peak performance How can we design machines that perform equally well in different benchmark classes?

slide-27
SLIDE 27

Benchmark limitations

Tests energy efficiency at high utilization -- but most servers are under-utilized

How efficient is system at 50% utilization? 20%?

Doesn’t measure building power/cooling Real goal: TCOSort

JouleSort and PennySort give pieces of the answer

slide-28
SLIDE 28

JouleSort Conclusions

Need energy-efficiency benchmark JouleSort specification

  • Simple, representative, full-system benchmark
  • Workload, metric, measurement rules

CoolSort system

  • 3.5 better than 2006 estimated winner
  • Mobile components, server-class interfaces

Part of the sort benchmark suite

  • joulesort.stanford.edu
slide-29
SLIDE 29

Talk Overview

Metrics: JouleSort benchmark

First complete, full-system energy-efficiency benchmark Design of winning system

Models: Mantis approach

Generates family of high-level full-system models Generic, accurate, portable

slide-30
SLIDE 30

Who needs power models?

Component and system designers

How do design decisions affect power?

Users

How do my usage patterns affect power?

Data center schedulers

How will workload distribution decisions affect power?

slide-31
SLIDE 31

Power modeling goals

Goal: Online, full-system power models Model requirements

Non-intrusive and low-overhead Easy to develop and use Fast enough for online use Reasonably accurate (within 10%) Inexpensive Generic and portable

slide-32
SLIDE 32

Power modeling approaches

Detailed component models

Simulation-based Hardware metric-based

High-level full-system models

slide-33
SLIDE 33

Detailed models: Simulation-based

Inexpensive, arbitrarily accurate Not full-system Slow (not real-time) Not portable

Input:

  • Current state
  • Architecture
  • Circuit parameters

Simulation Output: Predicted power (component)

slide-34
SLIDE 34

Detailed models: Metric-based

Highly accurate Not full-system Complex, require specialized knowledge Not portable

Input:

  • Design info
  • HW counters

Equation Output: Predicted power (component)

[Contreras and Martonosi, ISLPED 2005] [Isci and Martonosi, MICRO 2003]

slide-35
SLIDE 35

High-level metrics (Mantis)

How accurate? How portable? Tradeoff between model parameters/complexity and accuracy?

Input: Common util. metrics Equation Output: Predicted power (system)

slide-36
SLIDE 36

Power Modeling

Run one-time calibration scheme (possibly at vendor)

Stress individual components: CPU, memory, disk Outputs: time-stamped performance metrics & AC power measurements

Fit model parameters to calibration data Use model to predict power

Inputs: performance metrics at each time t Output: estimation of AC power at each time t

slide-37
SLIDE 37

Models studied

Constant power (the null model): CPU utilization-based models

P = C0

Input: CPU util. % Equation Output: Predicted power (system)

slide-38
SLIDE 38

CPU utilization-based models

Linear in CPU utilization Empirical power model

[Fan et al, ISCA 2007]

P = C0 + C

1u + C2ur

P = C0 + C

1u

slide-39
SLIDE 39

CPU + disk utilization

Input:

  • CPU util. %
  • Disk util. %

Equation Output: Predicted power (system)

P = C0 + C

1uCPU + C2udisk

[Heath et al, PPoPP 2005]

slide-40
SLIDE 40

CPU + disk util. + performance ctrs

Input:

  • CPU util. %
  • Disk util. %
  • CPU perfctrs

Equation Output: Predicted power (system)

P = C0 + C

1uCPU + C2udisk +

CiP

i

  • [D. Economou, S. Rivoire, C. Kozyrakis,
  • P. Ranganathan, MoBS 2006]
slide-41
SLIDE 41

CPU performance counters

Configurable processor registers to count microarchitectural events Requires OS modification In this study:

Memory bus transactions Unhalted CPU clock cycles Instructions retired/ILP Last-level cache references Floating-point instructions

slide-42
SLIDE 42

Evaluation methodology

Run calibration suite and develop models

  • n a variety of machines

Run benchmarks, collecting metrics and AC power Compare predicted power from metrics with measured AC power

slide-43
SLIDE 43

Evaluation machines

CoolSort with 1 and 13 disks

Highest and lowest frequencies

2005-era AMD laptop

Highest and lowest frequencies

2005-era Itanium server 2008-era Xeon server with 32 GB FBDIMM Variety in component balance, processor, domain, dynamic range

slide-44
SLIDE 44

Evaluation benchmarks

SPECcpu int and fp

Laptop: gcc and gromacs only

SPECjbb Stream I/O-intensive programs

ClamAV Nsort (CoolSort-13 only) SPECweb (Itanium only)

slide-45
SLIDE 45

Overall mean % error

slide-46
SLIDE 46

Overall mean % error

Any model is more accurate than none, and more detail/complexity is better than less.

slide-47
SLIDE 47

Overall mean % error

Performance counter model is most accurate across the board. Any model is more accurate than none, and more detail/complexity is better than less.

slide-48
SLIDE 48

Overall mean % error

Performance counter model is most accurate across the board. Any model is more accurate than none, and more detail/complexity is better than less. Simple linear CPU-util. model gets within 10%…with some exceptions.

slide-49
SLIDE 49

Best case for empirical CPU model

(Xeon server)

slide-50
SLIDE 50

Best case for empirical CPU model

(Xeon server)

Useful to model shared resources and bottlenecks

slide-51
SLIDE 51

Best case for performance counters

(Xeon server and CoolSort-13)

slide-52
SLIDE 52

Best case for performance counters

(Xeon server and CoolSort-13)

Necessary when dynamic memory power is high

slide-53
SLIDE 53

Best case for performance counters

(Xeon server and CoolSort-13)

Necessary when dynamic memory power is high Useful to tell how CPU is being utilized

slide-54
SLIDE 54

Modeling conclusions

Generic approach to power modeling yields accurate results

  • Simple models overall have < 10% error
  • Same parameters across very different machines
  • More information better models

Linear CPU util. model not enough for…

  • Machines and workloads that are not CPU-dominated
  • CPUs with shared resource bottlenecks
  • Aggressively power-optimized CPUs
  • …all of which reflect hardware trends.
slide-55
SLIDE 55

Future work

Beyond CPU, memory, and disk

GPUs Network (not a factor today)

Model complexity

Combine exponential CPU model w/ perfctrs? Cooling?

slide-56
SLIDE 56

Overall Summary

Models and metrics needed to improve energy efficiency Metrics:

JouleSort energy-efficiency benchmark specification Winning JouleSort machine

Models:

Simple, portable high-level modeling technique Trade-offs between accuracy and simplicity

slide-57
SLIDE 57

Acknowledgments

Advisor: Christos Kozyrakis Mentor: Partha Ranganathan Committee: Kunle Olukotun & Dwight Nishimura Collaborators: Mehul Shah, Dimitris Economou, Justin Meza Assistance: Jacob Leverich, HP Labs, Charlie Orgish, Teresa Lynn Defense food! Jayanth and Amin Architecture grad students Grant Gavranovic, Kelley Rivoire, friends & family