Portable, Scalable, per-Core P t bl S l bl C Power Estimation - - PDF document

portable scalable per core p t bl s l bl c power
SMART_READER_LITE
LIVE PREVIEW

Portable, Scalable, per-Core P t bl S l bl C Power Estimation - - PDF document

Portable, Scalable, per-Core P t bl S l bl C Power Estimation Sally A. McKee Chalmers University of Technology Chalmers University of Technology Why Care about Power? Packaging/cooling Operating costs Performance


slide-1
SLIDE 1

1

P t bl S l bl C Portable, Scalable, per-Core Power Estimation

Sally A. McKee Chalmers University of Technology Chalmers University of Technology

Why Care about Power?

 Packaging/cooling  Operating costs  Performance  Reliability  Battery lifetime  Device lifetime  Ergonomics

Slide 2

slide-2
SLIDE 2

2

What Can We Do with Power Info?

 Optimize thread allocation  Manage workloads for

P t i t

 Power constraints  Temperature constraints  Data locality

 Budget power per core, process, or thread  Adapt frequencies for performance requirements  Adapt frequencies for performance requirements  Resize/turn off structures

Slide 3

More Observations

 Energy efficiency essential at all scales  Component power consumption difficult to measure

P h l

 Processors share same power plane  External meters give total node power  Meter per node impossible for large-scale systems  Embedded measurement devices financially infeasible  Even invasive hardware still suffers inaccuracy

 But dynamic power estimation possible using

performance monitoring counters (PMCs)

Slide 4

slide-3
SLIDE 3

3

Approach

 Analytic models based on PMCs

 Gather performance data from microbenchmarks  Collect power measurements  Categorize counters  Choose counters most strongly correlated with power

 Advantages

 Easy  Portable  Dynamic  Application-independent

Slide 5

Approach (cont.)

 Microbenchmarks stress

PMCs

 Four categories sufficient:

Blowup of Core

 Four categories sufficient:

FP ops Memory Stalls Instructions retired

 Future applications also

described by model

128-bit FPU L1 Data Cache Load/ Store Execution Fetch/ 512kB L2 Cache

described by model

AMD Phenom 9500 Core source: www.amd.com

Decode/ Branch L1 Instr Cache

Slide 6

slide-4
SLIDE 4

4 Measurement: pfmon, Watts Up Pro meter Benchmarks: SPEC 2006, SPEC OMP, NAS

Initial Setup

, , (gcc 4.2 –O3 [–OpenMP])

Slide 7

Forming the Model

 Counters with highest correlation become model inputs  Counters ei normalized to cycle count to give ri  Piece-wise linear model for per-core power

Slide 8

slide-5
SLIDE 5

5

Forming the Model: AMD Phenom

 Function behavior differs for very low values of

L2 counter

 All except FP correlate positively with power  All except FP correlate positively with power  Including temperature increases accuracy

e1: L2_CACHE_MISS e2: RETIRED_UOPS e3: RETIRED MMX AND FP INSTRUCTIONS e4: DISPATCH_STALLS

Slide 9

Model Validation

 Comparison of estimated and measured power

 At wall socket

At ATX rails

 At ATX rails  On motherboard

 Three benchmark suites (45 benchmarks)

 Single- and multi-threaded  Floating point and integer  Floating point and integer

 Six platforms (2-8 cores from Intel/AMD)

Slide 10

slide-6
SLIDE 6

6

Maximum Estimation Errors

Benchmark AMD Phenom 9500 Intel Q6600 Intel Core i7 SPEC2006 3.51 % 1.05 % 1.61 %

Quad Core

Benchmark Intel Core Duo Intel E5430 AMD Opteron 8212 SPEC2006 4.01 % 2.76 % 4.80 % NAS 4.52 % 1.59 % 3.11 % SPEC OMP 5.16 % 1.59 % 4.14 %

Dual Core 8 – Core

Slide 11

NAS 3.73 % 3.90 % 2.55 % SPEC OMP 4.36 % 3.53 % 3.35 %

Median Errors

Median Estimation Errors

Benchmark AMD Phenom 9500 Intel Q6600 Intel Core i7 SPEC2006 3.51 % 1.05 % 1.61 %

Quad Core

Benchmark Intel Core Duo Intel E5430 AMD Opteron 8212 SPEC2006 4.01 % 2.76 % 4.80 % NAS 4.52 % 1.59 % 3.11 % SPEC OMP 5.16 % 1.59 % 4.14 %

Dual Core 8 – Core

Slide 12

NAS 3.73 % 3.90 % 2.55 % SPEC OMP 4.36 % 3.53 % 3.35 %

Median Errors

slide-7
SLIDE 7

7

Estimation Results: Intel Q6600

NAS SPEC OMP SPEC 2006

Slide 13

Estimation Results: Intel Q6600

NAS SPEC OMP SPEC 2006

Slide 14

slide-8
SLIDE 8

8

Estimation Results: Intel Q6600

Best: 0.2% lbm Worst: 8.4% cg 98% of estimations < 10% error 85% of estimation< 5% error Overall: SPEC 2006 2.4%, NAS 3.5%, SPEC-OMP 2.0%

Slide 15

Estimation Results: Intel E5430

NAS SPEC OMP SPEC 2006

Slide 16

slide-9
SLIDE 9

9

Estimation Results: Intel E5430

NAS SPEC OMP SPEC 2006

Slide 17

Estimation Results: Intel 5430 8-Core

Best: 0.3% ua Worst: 7.0% hmmer 98% of estimations < 10% error 85% of estimations < 5% error Overall: SPEC 2006 3.5%, NAS 3.9%, SPEC-OMP 2.8%

Slide 18

slide-10
SLIDE 10

10

Standard Deviation of Error: E5430

10 bt cg ep ft lu

  • hp

mg sp ua 2 4 6 8 % SD

Slide 19

bt cg ep ft lu lu-hp mg sp ua

NAS

Standard Deviation of Error: E5430

10 m m p a p p l u a p s i a r t a 3 d a f

  • r

t g r i d u a k e s w i m w i s e 2 4 6 8 % SD

Slide 20

a m m p a p p l u a p s i a r t f m a 3 d g a f

  • r

t m g r i d q u a k e s w i m w u p w i s e

SPEC OMP

slide-11
SLIDE 11

11

Estimation Results: AMD Phenom 9500

NAS SPEC OMP SPEC 2006

Slide 21

Estimation Results: AMD Phenom 9500

NAS SPEC OMP SPEC 2006

Slide 22

slide-12
SLIDE 12

12

Estimation Results: AMD Phenom 9500

Best: 0.9% libquantum Worst: 9.3% xalancbmk 92% of estimations < 10% error 73% of estimations < 5% error Overall: SPEC 2006 4.5%, NAS 3.5%, SPEC-OMP 5.2%

Slide 23

Estimation Results: AMD Opteron 8212

NAS SPEC OMP SPEC 2006

Slide 24

slide-13
SLIDE 13

13

Estimation Results: AMD Opteron 8212

NAS SPEC OMP SPEC 2006

Slide 25

Estimation Results: AMD Opteron 8212

Best: 1.0% cactusADM Worst: 10.6% leslie3d 92% of estimations < 10% error 73% of estimations < 5% error Overall: SPEC 2006 4.5%, NAS 3.5%, SPEC-OMP 5.2%

Slide 26

slide-14
SLIDE 14

14

Estimation Results: Intel Core i7

NAS SPEC OMP SPEC 2006

Slide 27

Factors Affecting Model Accuracy

 Availability of representative PMCs  PMCs available for simultaneous sampling  Sampling rate of power measurement  Accuracy of thermal sensors

These look pretty good but what are we missing?

Slide 28

These look pretty good, but what are we missing? Could we do better w/ a different meter?

slide-15
SLIDE 15

15

 Wall outlet (Watts Up Pro)

 Least intrusive  Low sampling rate

Power Measurement Infrastructures

 Low sampling rate

 PSU output on the ATX power rails

 Moderately intrusive  Requires custom hardware

 Processor socket  Processor socket

 Most intrusive  Requires soldering on motherboard

Slide 29

Comparative Power Measurement Setup

 Power Measured at three points simultaneously  Test machine used to collect samples different

from target Core i7 from target Core i7

 Custom sense hardware placed inside target

machine cabinet

Slide 30

slide-16
SLIDE 16

16

PSU Output Measurement

Slide 31

Measurement at PSU Output

Slide 32

slide-17
SLIDE 17

17

Measurement at Processor Socket

 V_CPU = Core Voltage  IMON = Voltage

proportional to regulator proportional to regulator current output

Slide 33

Estimation Results (PSU Output)

SPEC OMP NAS SPEC 2006

Slide 34

slide-18
SLIDE 18

18

Estimation Results (Socket)

SPEC OMP NAS SPEC 2006

Slide 35

Comparative Results: SPEC OMP/Core i7

PSU (ATX Rails) Wall Socket Processor Socket (Motherboard)

dY8To5AD Slide 36

slide-19
SLIDE 19

19

Power Measurement Experiments

 Sampling frequency (samples per second)

 At wall outlet: 1  At ATX power rails and on MB: 50000

p

 Measurements averaged over 50 samples  Test workload: 32x32 matmul in infinite loop  Theoretical measurement sensitivity

C t t t ATX il 2 A

 Current measurement at ATX rails: 2mA  CPU voltage measurement on motherboard: 47.2 uV  CPU current measurement on motherboard: 7mA

Slide 37

Power Measurement Results

Slide 38

idle power activating 1-4 cores

slide-20
SLIDE 20

20

CPU versus Memory-Bound Applications

Slide 39

memory

DVFS + Throttling

40

slide-21
SLIDE 21

21

Power Measurement Results – Efficiency

41

So What?

 Our models work pretty well  More accurate measurement → more accurate

models models

 All measurement methods incur some error  Intel Shady Brook uses similar approach to

implement “digital power meter”

Slide 42

So we must be doing something right!

slide-22
SLIDE 22

22

Live Power Management

 Proof-of-concept  Goal

S h d l t k d t i t b d t

 Schedule tasks under strict power budget  Minimal overhead

 Methodology

 User-level meta scheduler  DVFS + process suspension to maintain power  DVFS + process suspension to maintain power

envelope

 Two sample policies for process selection

Slide 43

Live Power Management

 Three categories of benchmarks

 CPU bound  Memory bound  Memory bound  Mixed

 Power envelope set to 95%, 90%, 85%  Results for both with/without DVFS

Slide 44

slide-23
SLIDE 23

23

 CPU bound

 ep, gamess, namd, povray  calculix, ep, gamess, gromacs, h264ref, namd,

Workloads with Different Intensities

, p, g , g , , , perlbench, povray

 Moderate

 art, lu, wupwise, xalancmbk  bwaves, cactusADM, fma3d, gcc, leslie3d, sp, ua,

xalancbmk

 Memory bound

 astar, mcf, milc, soplex  applu, astar, lbm, mcf, milc, omnetpp, soplex, swim

Slide 45

Meta-Scheduler Results: Intel Q6600

Max Instructions/Watt Max Instructions/Watt 90% Power Envelope Moderate Computational Intensity Per-core Fair 95% Power Envelope CPU-bound Workload

Slide 46

slide-24
SLIDE 24

24

Meta-Scheduler Results: AMD Phenom

Max Instructions/Watt Max Instructions/Watt 90% Power Envelope Moderate Computational Intensity Per-core Fair 95% Power Envelope CPU-bound Workload

Slide 47

Performance Results: Intel Q6600

CPU-Bound Memory-Bound Moderate

Slide 48

slide-25
SLIDE 25

25

Performance Results: AMD Phenom

CPU-Bound Memory-Bound Moderate

Slide 49

Performance Results: AMD Phenom

CPU-Bound Memory-Bound Moderate

Slide 50

slide-26
SLIDE 26

26

Performance Results: Intel E5430

CPU-Bound Memory-Bound Moderate

Slide 51

Model Overhead

 Linux kernel implementation shows negligible

computation overhead

Benchmark Baseline (s) Model (s) ep.A serial 35.68 35.57 ep.A OMP 4.84 4.89 ep.A MPI 4.77 4.72 cg.A serial 5.82 5.83 cg.A OMP 1.95 1.95 A MPI 2 19 2 20

52

cg.A MPI 2.19 2.20 Runtimes for NAS benchmarks on AMD 8212 with power estimation every 10 ms

slide-27
SLIDE 27

27

Conclusions

 Per-core power estimation model

 Reasonably accurate: 1.09%-5.16% error  Portable: Validated for six platforms  Portable: Validated for six platforms  Scalable: Dual core to 8-core platforms

 Validation via three measurement approaches  Evaluation using live power management

 Useful for maintaining power budget  Useful for maintaining power budget  Efficient enough for online usage in kernel scheduler  Possible to implement virtual power sensors in VMs

53

What Next?

 Continue power estimation/resource management

 GPU power models (w/ Melissa Smith, Clemson ECE)  Memory power models  Kernel/VM exploitation of power info

 Revisit memory controller/system design for CMPs  Continue w/ compiler-based FT for exascale

And depending on available collaborators . . .

 Address new problems wrt green wireless  Address new problems wrt green wireless  Autotune applications for performance/power  ? (I’m open to collaboration ideas)

Slide 54

slide-28
SLIDE 28

28

Questions?

mckee@chalmers.se

Slide 55

Illustration: XEEMU Power Model

 Writing simulators is difficult and time-consuming  Would like early (fast) power model; can

incorporate detailed framework later incorporate detailed framework later

 Given a cycle-accurate simulator and hardware,

we can use our power estimation for:

 Quick functioning power simulator  Faster simulations

Slide 56

slide-29
SLIDE 29

29

XEEMU

 XEEMU: validated Intel XScale cycle-accurate

simulator

 Two reasons for choosing this simulator  Two reasons for choosing this simulator

 Illustrate power model on an embedded platform  XEEMU power model has low error of 3.0%

 We replace XEEMU power model with our

estimation model

 Assume power numbers from XEEMU to be close to

real hardware (validated low error) real hardware (validated low error)

 Evaluate accuracy on the CSiBE benchmark suite

Slide 57

XEEMU Estimation Results

Four counters: dl1.misses, sim_memory_dep, sim_num_insn, fpu access Best: png encode 0 3% Best: png_encode 0.3% Worst: compiler 7.2% Overall error: 1.3%

Slide 58

slide-30
SLIDE 30

30

Speeding Simulations

Speedup of simulations using

  • ur model over

win-win

 Power model now exists where none previously had

  • ur model over

using Sim-Panalyzer  Simulations not slowed by model  Caveat: 1.3% extra error in power estimation (on top of

intrinsic simulator error)

Slide 59

Questions? Ideas?

mckee@chalmers.se

Slide 60

slide-31
SLIDE 31

31

Estimation Model Breakdown     ),T) (r ),...,g (r (g F N r if P

n n i 1 1 1

,

n = number of counters used in model

   ),T) (r ),...,g (r (g F else P

n n

core

1 1 2

,

61

Estimation Model Breakdown     ),T) (r ),...,g (r (g F N r if P

n n i 1 1 1

,

value breakpoint ,   N value PMC d accumulate e cycles e r

i i i

   ),T) (r ),...,g (r (g F else P

n n

core

1 1 2

,

62

value breakpoint  N

slide-32
SLIDE 32

32

Estimation Model Breakdown     ),T) (r ),...,g (r (g F N r if P

n n i 1 1 1

,

idle current

T T T  

   ),T) (r ),...,g (r (g F else P

n n

core

1 1 2

,

63

Estimation Model Breakdown     ),T) (r ),...,g (r (g F N r if P

n n i 1 1 1

,

Optional transformation

   ),T) (r ),...,g (r (g F else P

n n

core

1 1 2

,

64

slide-33
SLIDE 33

33

Estimation Model Breakdown     ),T) (r ),...,g (r (g F N r if P

n n i 1 1 1

,

T p r g p r g p p F

n n n n n

* ) ( * ... ) ( *

1 1 1 1 

    

   ),T) (r ),...,g (r (g F else P

n n

core

1 1 2

,

65

Estimation Results: Intel E5430

NAS SPEC OMP SPEC 2006

Slide 66