Energy Efficiency Metrics and Cray XE6 Application Performance - - PowerPoint PPT Presentation

energy efficiency metrics and
SMART_READER_LITE
LIVE PREVIEW

Energy Efficiency Metrics and Cray XE6 Application Performance - - PowerPoint PPT Presentation

Energy Efficiency Metrics and Cray XE6 Application Performance Wilfried Oed Principal Engineer September 8, 2011 Cray Proprietary Slide 1 What made this machine so unique ? Some answers Novel vector architecture Packaging


slide-1
SLIDE 1

Energy Efficiency Metrics and Cray XE6 Application Performance

September 8, 2011 Slide 1 Cray Proprietary

Wilfried Oed

Principal Engineer

slide-2
SLIDE 2
  • What made this machine so unique ?
  • Some answers
  • Novel vector architecture
  • Packaging
  • Cooling
  • Fastest scalar machine !!!
  • High productivity for users
  • Autovectorizing compiler
  • Performance analysis tool
  • Simple OS

and no one cared about the power consumption

Slide 2 Cray Proprietary September 8, 2011

slide-3
SLIDE 3
  • An improvement of 150 thousand in 30 years – and still no end in sight !
  • Cray XE6 is ~ 600 MF / W
  • Cray XK6 is ~ 1200 MF / W
  • So where’s the problem ?
  • Price performance has improved even more dramatically
  • Computing has become ubiquitous
  • The combined systems of the current Green500 require 340 MW
  • That’s up 50 MW from previous list
  • Largest system @ 10 MW
  • Supercomputing and HPC are vital tools for science
  • An interesting article – especially the focus on software

Slide 3 Cray Proprietary September 8, 2011

Power Consumption for Cray Systems 1978 1988 1998 2008 Cray-1 Cray Y-MP 8 Cray T3E Cray XT5 number processors / cores 1 8 1,024 150,152 power consumption (kW) 140 200 220 6,500 Rmax PF 1.50E-07 2.10E-06 8.92E-04 1.06E+00 Flop / Watt ~ 0.001 MF ~ 0.01 MF ~ 4 MF ~ 150 MF Efficiency improvement 1 10 ~ 4,000 ~ 150,000 Andrew Jones, Vice-President of HPC Services and Consulting, Numerical Algorithms Group http://www.hpcwire.com/hpcwire/2011-08-29/exascale:_power_is_not_the_problem_.html

slide-4
SLIDE 4

Y X Z

Gemini Interconnect High Radix YARC Router with adaptive Routing

XE6 Node Characteristics

Number of Cores 24 (Magny Cours) Peak Performance MC-12 (2.2) 211 Gflops/sec Memory Size 32 GB per node 64 GB per node Memory Bandwidth (Peak) 83.5 GB/sec

Slide 4 Cray Proprietary

XK6 Compute Node Characteristics

AMD Series 6200 (Interlagos) NVIDIA Tesla X2090 Host Memory 16 or 32GB 1600 MHz DDR3 NVIDIA Tesla X2090 Memory 6GB GDDR5 capacity Gemini High Speed Interconnect Upgradeable to future GPUs

September 8, 2011

slide-5
SLIDE 5

Slide 5 Cray Proprietary 20,000

40,000 60,000 80,000 100,000 120,000 140,000 160,000 180,000 200,000 Jun 1993 Nov 1993 Jun 1994 Nov 1994 Jun 1995 Nov 1995 Jun 1996 Nov 1996 Jun 1997 Nov 1997 Jun 1998 Nov 1998 Jun 1999 Nov 1999 Jun 2000 Nov 2000 Jun 2001 Nov 2001 Jun 2002 Nov 2002 Jun 2003 Nov 2003 Jun 2004 Nov 2004 Jun 2005 Nov 2005 Jun 2006 Nov 2006 Jun 2007 Nov 2007 Jun 2008 Nov 2008 Jun 2009 Nov 2009 Jun 2010 Nov 2010 Jun 2011

Average # Processors in Top 10 

Supercomputing is about managing scalability

exponential increase with advent of multi-core chips

currently selling systems with > 100 000 cores

One million cores expected within the decade

September 8, 2011

  • A scalable architecture requires BOTH hardware and software

Jitter elimination => OS & Interconnect Latency hiding => Interconnect Programming environment Hybrid programming => MPI / OpenMP

slide-6
SLIDE 6

Science Area Code Contact Cores Total Perf Notes Scaling

Materials DCA++ Schulthess 150,144 1.3 PF* Gordon Bell Winner Weak Materials LSMS/WL ORNL 149,580 1.05 PF 64 bit Weak Seismology SPECFEM3D UCSD 149,784 165 TF Gordon Bell Finalist Weak Weather WRF Michalakes 150,000 50 TF Size of Data Strong Climate POP Jones 18,000 20 sim yrs/ CPU day Size of Data Strong Combustion S3D Chen 144,000 83 TF Weak Fusion GTC UC Irvine 102,000 20 billion Particles / sec Code Limit Weak Materials LS3DF Lin-Wang Wang 147,456 442 TF Gordon Bell Winner Weak

Eight Application World Records Set in First Week (Nov. 2008)!

Slide 6 Cray Proprietary September 8, 2011

slide-7
SLIDE 7
  • Power Usage Effectiveness (PUE)

 Reflects how well a system is being cooled

A poorly designed system can still have a wonderful PUE if cooling is efficient

 Need to define the components that account for “power usage”

  • MFLOPS per Watt

 Reflected in the Green500  Emphasizes pure floating-point (HPL)

  • Time to Solution (sustained performance) per Watt

 Supercomputers are there to solve big problems (aka Grand Challenges)

An extremely high degree of parallelism is required

Besides floating-point, real applications have to deal with communication, organization, load balance

 Power consumption [kWh] = Nproc * Pproc * Tmax [kWh]

Tmax time allowed to finish the problem

Nproc number of processors (cores) utilized to finish within Tmax

Pproc power utilized per processor (core)

 This metric is problem oriented and can be applied across various architectures

Can also be based on power per node for comparing vastly different archictures (e.g. Cray XK6 using hybrid CPU / GPU nodes)

Slide 7 Cray Proprietary September 8, 2011

slide-8
SLIDE 8

0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 50 100 150 200 250 300 350 2,000 4,000 6,000 8,000 10,000 12,000

power consumption (kWh) total execution time (seconds) processors

TA NA TB NB Tmax PA kWh TA * PA PB kWh TB * PB

  • The lower power processor has the same power on a per core basis
  • Despite being a lower power processor and having similar scalability, the higher core count

required makes it less efficient regardless of the desired solution time

Slide 8 Cray Proprietary September 8, 2011

Note: this is an arbitrary example for demonstrating certain effects neither based on actual systems nor applications

slide-9
SLIDE 9
  • The lower power processor always requires less power on a per core basis
  • At low core counts (higher time to solution) the lower powered processor is more energy

efficient , as only a few additional cores are required

Slide 9 Cray Proprietary September 8, 2011 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 50 100 150 200 250 300 350 2,000 4,000 6,000 8,000 10,000 12,000

power consumption (kWh) total execution time (seconds) processors

TA NA TB NB Tmax PA kWh TA * PA PB kWh TB * PB

Note: this is an arbitrary example for demonstrating certain effects neither based on actual systems nor applications

slide-10
SLIDE 10

0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 50 100 150 200 250 300 350 2,000 4,000 6,000 8,000 10,000 12,000

power consumption (kWh) total execution time (seconds) processors

TA NA TB NB Tmax PA kWh TA * PA PB kWh TB * PB

  • The lower power processor always requires less power on a per core basis
  • At higher core counts (lower time to solution) the lower powered processor is less energy

efficient, as far more cores are required

Slide 10 Cray Proprietary September 8, 2011

Note: this is an arbitrary example for demonstrating certain effects neither based on actual systems nor applications

slide-11
SLIDE 11

Science Area Code Nodes Cores Combustion Senga 844 20,256 Materials and MD CASTEP 1,024 24,576 fluid flow/lattice- boltzmann method Heme1b 1,024 24,576 Materials CRYSTAL 1,024 24,576 Quantum Monte Carlo CASINO 664 15,936 MD DL_POLY_4 683 16,392 Chemistry Sparkle 683 16,392

Slide 11 Cray Proprietary September 8, 2011

  • A set of scientific applications running on a regular basis at high core counts at

EPCC

slide-12
SLIDE 12
  • Despite huge progress let‘s not rest
  • The biggest innovations will have to come from technology
  • Remember: the goal for EXAflop is 20 MW or 50 GF / W
  • Which may questionable => keynote: Jens Wiebe
  • Reclaim energy => driving towards PUE < 1
  • Heating your office is not the answer
  • Throttling CPU performance if higher Tmax can be tolerated
  • Current processors have the ability to operate at different clock speeds already
  • But beware, your overall power consumption may end up to be higher
  • Applying the metrics
  • Required is the ability to measure performance on an application level
  • James H. Laros III, Kevin T. Pedretti, Suzanne M. Kelly, John P. Vandyke, Kurt B. Ferreira,

Courtenay T. Vaughan, Mark Swan. Topics on Measuring Real Power Usage on High Performance Computing Platforms, IEEE International

  • Energy aware scheduling
  • TUNE your application (a truck has good mileage only if fully loaded)
  • Scalability is a decisive factor on time to solution and consequently on power

efficiency

Slide 12 Cray Proprietary September 8, 2011

slide-13
SLIDE 13