 
              Portable, Scalable, per-Core P t bl S l bl C Power Estimation Sally A. McKee Chalmers University of Technology Chalmers University of Technology Why Care about Power?  Packaging/cooling  Operating costs  Performance  Reliability  Battery lifetime  Device lifetime  Ergonomics Slide 2 1
What Can We Do with Power Info?  Optimize thread allocation  Manage workloads for  Power constraints P t i t  Temperature constraints  Data locality  Budget power per core, process, or thread  Adapt frequencies for performance requirements  Adapt frequencies for performance requirements  Resize/turn off structures Slide 3 More Observations  Energy efficiency essential at all scales  Component power consumption difficult to measure  Processors share same power plane P h l  External meters give total node power  Meter per node impossible for large-scale systems  Embedded measurement devices financially infeasible  Even invasive hardware still suffers inaccuracy  But dynamic power estimation possible using performance monitoring counters (PMCs) Slide 4 2
Approach  Analytic models based on PMCs  Gather performance data from microbenchmarks  Collect power measurements  Categorize counters  Choose counters most strongly correlated with power  Advantages  Easy  Portable  Dynamic  Application-independent Slide 5 Approach (cont.)  Microbenchmarks stress PMCs Blowup of Core  Four categories sufficient:  Four categories sufficient: FP ops 128-bit FPU 512kB Memory Load/ L1 Data L2 Stalls Store Cache Cache Instructions retired Execution  Future applications also Fetch/ described by model described by model Decode/ L1 Instr Branch Cache AMD Phenom 9500 Core source: www.amd.com Slide 6 3
Initial Setup Measurement: pfmon, Watts Up Pro meter Benchmarks: SPEC 2006, SPEC OMP, NAS , , (gcc 4.2 –O3 [–OpenMP]) Slide 7 Forming the Model  Counters with highest correlation become model inputs  Counters e i normalized to cycle count to give r i  Piece-wise linear model for per-core power Slide 8 4
Forming the Model: AMD Phenom  Function behavior differs for very low values of L2 counter  All except FP correlate positively with power  All except FP correlate positively with power  Including temperature increases accuracy e 1 : L2_CACHE_MISS e 2 : RETIRED_UOPS e 3 : RETIRED MMX AND FP INSTRUCTIONS e 4 : DISPATCH_STALLS Slide 9 Model Validation  Comparison of estimated and measured power  At wall socket  At ATX rails At ATX rails  On motherboard  Three benchmark suites (45 benchmarks)  Single- and multi-threaded  Floating point and integer  Floating point and integer  Six platforms (2-8 cores from Intel/AMD) Slide 10 5
Maximum Estimation Errors Quad Core Benchmark AMD Intel Intel Core i7 Phenom 9500 Q6600 SPEC2006 3.51 % 1.05 % 1.61 % NAS 4.52 % 1.59 % 3.11 % SPEC OMP 5.16 % 1.59 % 4.14 % Dual Core 8 – Core Benchmark Intel Core Intel AMD Opteron Duo E5430 8212 SPEC2006 4.01 % 2.76 % 4.80 % NAS 3.73 % 3.90 % 2.55 % SPEC OMP 4.36 % 3.53 % 3.35 % Median Errors Slide 11 Median Estimation Errors Quad Core Benchmark AMD Intel Intel Core i7 Phenom 9500 Q6600 SPEC2006 3.51 % 1.61 % 1.05 % NAS 4.52 % 1.59 % 3.11 % SPEC OMP 1.59 % 4.14 % 5.16 % Dual Core 8 – Core Benchmark Intel Core Intel AMD Opteron Duo E5430 8212 SPEC2006 4.01 % 2.76 % 4.80 % NAS 3.73 % 3.90 % 2.55 % SPEC OMP 4.36 % 3.53 % 3.35 % Median Errors Slide 12 6
Estimation Results: Intel Q6600 NAS SPEC OMP SPEC 2006 Slide 13 Estimation Results: Intel Q6600 NAS SPEC OMP SPEC 2006 Slide 14 7
Estimation Results: Intel Q6600 Best: 0.2% lbm Worst: 8.4% cg 98% of estimations < 10% error 85% of estimation< 5% error Overall: SPEC 2006 2.4%, NAS 3.5%, SPEC-OMP 2.0% Slide 15 Estimation Results: Intel E5430 NAS SPEC OMP SPEC 2006 Slide 16 8
Estimation Results: Intel E5430 NAS SPEC OMP SPEC 2006 Slide 17 Estimation Results: Intel 5430 8-Core Best: 0.3% ua Worst: 7.0% hmmer 98% of estimations < 10% error 85% of estimations < 5% error Overall: SPEC 2006 3.5%, NAS 3.9%, SPEC-OMP 2.8% Slide 18 9
Standard Deviation of Error: E5430 10 8 6 % SD 4 2 0 bt bt cg cg ep ep ft ft lu lu lu-hp -hp mg mg sp sp ua ua NAS Slide 19 Standard Deviation of Error: E5430 10 8 6 % SD 4 2 0 p p u u i i t t d d t t d d e e m m e e s s r r r r m m l l 3 3 i i k k s s a a o o p p p p r r i i a a f f g g a a w w i i p p a a w w m m a a m m u u s s a a g p q a f u w SPEC OMP Slide 20 10
Estimation Results: AMD Phenom 9500 NAS SPEC OMP SPEC 2006 Slide 21 Estimation Results: AMD Phenom 9500 NAS SPEC OMP SPEC 2006 Slide 22 11
Estimation Results: AMD Phenom 9500 Best: 0.9% libquantum Worst: 9.3% xalancbmk 92% of estimations < 10% error 73% of estimations < 5% error Overall: SPEC 2006 4.5%, NAS 3.5%, SPEC-OMP 5.2% Slide 23 Estimation Results: AMD Opteron 8212 NAS SPEC OMP SPEC 2006 Slide 24 12
Estimation Results: AMD Opteron 8212 NAS SPEC OMP SPEC 2006 Slide 25 Estimation Results: AMD Opteron 8212 Best: 1.0% cactusADM Worst: 10.6% leslie3d 92% of estimations < 10% error 73% of estimations < 5% error Overall: SPEC 2006 4.5%, NAS 3.5%, SPEC-OMP 5.2% Slide 26 13
Estimation Results: Intel Core i7 NAS SPEC OMP SPEC 2006 Slide 27 Factors Affecting Model Accuracy  Availability of representative PMCs  PMCs available for simultaneous sampling  Sampling rate of power measurement  Accuracy of thermal sensors These look pretty good but what are we missing? These look pretty good, but what are we missing? Could we do better w/ a different meter? Slide 28 14
Power Measurement Infrastructures  Wall outlet (Watts Up Pro)  Least intrusive  Low sampling rate  Low sampling rate  PSU output on the ATX power rails  Moderately intrusive  Requires custom hardware  Processor socket  Processor socket  Most intrusive  Requires soldering on motherboard Slide 29 Comparative Power Measurement Setup  Power Measured at three points simultaneously  Test machine used to collect samples different from target Core i7 from target Core i7  Custom sense hardware placed inside target machine cabinet Slide 30 15
PSU Output Measurement Slide 31 Measurement at PSU Output Slide 32 16
Measurement at Processor Socket  V_CPU = Core Voltage  IMON = Voltage proportional to regulator proportional to regulator current output Slide 33 Estimation Results (PSU Output) NAS SPEC OMP SPEC 2006 Slide 34 17
Estimation Results (Socket) NAS SPEC OMP SPEC 2006 Slide 35 Comparative Results: SPEC OMP/Core i7 Wall Socket PSU (ATX Rails) dY8To5AD Processor Socket (Motherboard) Slide 36 18
Power Measurement Experiments  Sampling frequency (samples per second)  At wall outlet: 1  At ATX power rails and on MB: 50000 p  Measurements averaged over 50 samples  Test workload: 32x32 matmul in infinite loop  Theoretical measurement sensitivity  Current measurement at ATX rails: 2mA C t t t ATX il 2 A  CPU voltage measurement on motherboard: 47.2 uV  CPU current measurement on motherboard: 7mA Slide 37 Power Measurement Results idle power activating 1-4 cores Slide 38 19
CPU versus Memory-Bound Applications memory Slide 39 DVFS + Throttling 40 20
Power Measurement Results – Efficiency 41 So What?  Our models work pretty well  More accurate measurement → more accurate models models  All measurement methods incur some error  Intel Shady Brook uses similar approach to implement “digital power meter” So we must be doing something right! Slide 42 21
Live Power Management  Proof-of-concept  Goal  Schedule tasks under strict power budget S h d l t k d t i t b d t  Minimal overhead  Methodology  User-level meta scheduler  DVFS + process suspension to maintain power  DVFS + process suspension to maintain power envelope  Two sample policies for process selection Slide 43 Live Power Management  Three categories of benchmarks  CPU bound  Memory bound  Memory bound  Mixed  Power envelope set to 95%, 90%, 85%  Results for both with/without DVFS Slide 44 22
Workloads with Different Intensities  CPU bound  ep, gamess, namd, povray  calculix, ep, gamess, gromacs, h264ref, namd, , p, g , g , , , perlbench, povray  Moderate  art, lu, wupwise, xalancmbk  bwaves, cactusADM, fma3d, gcc, leslie3d, sp, ua, xalancbmk  Memory bound  astar, mcf, milc, soplex  applu, astar, lbm, mcf, milc, omnetpp, soplex, swim Slide 45 Meta-Scheduler Results: Intel Q6600 Max Instructions/Watt Max Instructions/Watt 90% Power Envelope Moderate Computational Intensity Per-core Fair 95% Power Envelope CPU-bound Workload Slide 46 23
Meta-Scheduler Results: AMD Phenom Max Instructions/Watt Max Instructions/Watt 90% Power Envelope Moderate Computational Intensity Per-core Fair 95% Power Envelope CPU-bound Workload Slide 47 Performance Results: Intel Q6600 CPU-Bound Memory-Bound Moderate Slide 48 24
Performance Results: AMD Phenom CPU-Bound Memory-Bound Moderate Slide 49 Performance Results: AMD Phenom CPU-Bound Memory-Bound Moderate Slide 50 25
Recommend
More recommend