Two short talks on current topics in Computer Science Jan Prins - PowerPoint PPT Presentation

Two short talks on current topics in Computer Science Jan Prins Department of Computer Science University of North Carolina at Chapel Hill 1 Runtime Methods to Improve Energy Efficiency in Supercomputing Applications 2 Computational Methods in Transcriptome Analysis

Runtime Methods to Improve Energy Efficiency in HPC Applications Sridutt Bhalachandra 1 , Robert Fowler 2 , Stephen Olivier 3 , Allan Porterfield 2 , and Jan Prins 1 1 Department of Computer Science, University of North Carolina at Chapel Hill 2 Renaissance Computing Institute, Chapel Hill 3 Sandia National Laboratories May 8, 2018

computing performance: 120 years of exponential growth! Runtime methods to improve energy efficiency 2

What is driving performance growth? Moore’s “law” - transistor density doubles every 18-24 months Runtime methods to improve energy efficiency 3

What is driving performance growth? Moore’s “law” - transistor density doubles every 18-24 months Dennard scaling - total power remains the same and maximum operating frequency increases Runtime methods to improve energy efficiency 3

What is driving performance growth? Moore’s “law” - transistor density doubles every 18-24 months Dennard scaling - total power remains the same and maximum operating frequency increases but look what has happened over the past two decades Runtime methods to improve energy efficiency 3

The end of Dennard scaling and faster transistors Consequences additional transistors require additional area power and heat increase commensurately parallel computing is the only route to scaling performance multicore processors multiprocessor nodes interconnection networks Runtime methods to improve energy efficiency 4

Scalable parallel computing Message Passing Interface (MPI) is used to coordinate computation and communication among all processor cores Runtime methods to improve energy efficiency 5

Current largest parallel computer Sunway Taihulight 40,960 nodes 10,649,600 cores (256+4 per node) at 1.45GHz 20PB storage $273 million Top500 #1 93.01 PFLOPS @ 15.4MW Source: http://www.nsccwx.cn/wxcyw 1 PetaFLOPS (PFLOPS) = 10 15 Floating Point Operations Per Second 1 MegaWatt (MW) can roughly power 1000 homes Runtime methods to improve energy efficiency 6

Exascale (10 18 FLOPS ) power requirements Performance Power Energy Efficiency System/Site (PFLOPS) (MW) (GFLOPS/W) Exascale 1000 ? ? Taihulight 93 15 6 Tianhe 2 34 18 2 Piz Daint 20 2 9 Runtime methods to improve energy efficiency 7

Exascale (10 18 FLOPS ) power requirements Performance Power Energy Efficiency System/Site (PFLOPS) (MW) (GFLOPS/W) Exascale 1000 ? ? Taihulight 93 15 6 Tianhe 2 34 18 2 Piz Daint 20 2 9 TSUBAME 3.0 2 0.14 14 kukai 0.46 0.03 14 AIST AI Cloud 0.96 0.08 13 Runtime methods to improve energy efficiency 7

Exascale (10 18 FLOPS) power requirements Performance Power Energy Efficiency System/Site (PFLOPS) (MW) (GFLOPS/W) Exascale 1000 20 50 Taihulight 93 15 6 Tianhe 2 34 18 2 Piz Daint 20 2 9 TSUBAME 3.0 2 0.14 14 kukai 0.46 0.03 14 AIST AI Cloud 0.96 0.08 13 Runtime methods to improve energy efficiency 8

Exascale (10 18 FLOPS) power requirements Performance Power Energy Efficiency System/Site (PFLOPS) (MW) (GFLOPS/W) Exascale 1000 20 50 Taihulight 93 15 6 Tianhe 2 34 18 2 Piz Daint 20 2 9 TSUBAME 3.0 2 0.14 14 kukai 0.46 0.03 14 AIST AI Cloud 0.96 0.08 13 5x - 10x improvement in energy efficiency required Runtime methods to improve energy efficiency 8

breakdown of power use in a large parallel computer Source: Use Case: Quantifying the Energy Efficiency of a Computing System -Hsu et al. Runtime methods to improve energy efficiency 9

Opportunity to save energy “Race to the end” in parallel regions each processor core operates on data in its node each processor maximizes speed while staying within thermal limit all processors spinwait on lock at end of the region last processor to arrive releases the lock Runtime methods to improve energy efficiency 10

Computational workload imbalance could be inherent in application could be due to system heterogeneity exacerbated by the race to the end Runtime methods to improve energy efficiency 11

Saving energy by mitigating workload imbalance Runtime methods to improve energy efficiency 12

Saving energy by mitigating workload imbalance Challenges each core is set to operate at a suitable frequency based on previous phase observation the frequency can change at every phase Runtime methods to improve energy efficiency 12

Fine grained power control Dynamic Duty Cycle Modulation (DDCM) – T-states − Actual clock rate is not changed, DVFS and TurboBoost still operational − Modulation range constant across architecture - 100% to 6.25% − IA32_CLOCK_MODULATION MSR Runtime methods to improve energy efficiency 13

Fine grained power control Dynamic Duty Cycle Modulation (DDCM) – T-states − Actual clock rate is not changed, DVFS and TurboBoost still operational − Modulation range constant across architecture - 100% to 6.25% − IA32_CLOCK_MODULATION MSR DVFS - core specific (Haswell) – P-states − Can slow only non-critical cores − Operational range machine-dependent even for the same architecture − acpi_cpufreq kernel module Runtime methods to improve energy efficiency 13

Runtime control policy Core-specific control − match a core’s effective duty cycle to its workload Duty cycle = Time core in active state Total time ( clock cycles ) ∗ Change core active time using DDCM or clock cycles using DVFS Compute time Work = Compute time + Idle time ( constant frequency ) Compute time Max frequency Effective Work = Compute time + Idle time ∗ Current frequency Runtime methods to improve energy efficiency 14

Runtime policy Assumes similar behavior across successive phases Policy calculation local to core, no communication Runtime methods to improve energy efficiency 15

Runtime policy Assumes similar behavior across successive phases Policy calculation local to core, no communication Combined policy ( Power DVFS < Power DDCM ) − Use DVFS policy until lowest frequency reached − Thereafter, use DDCM policy Runtime methods to improve energy efficiency 15

Adaptive Core-specific Runtime (ACR) ACR = Runtime Policy + User Options 1 Can monitor performance degradation at the end of every phase − Rudimentary method to detect phase change 2 Can induce minimum phase length limit − Useful in skipping start-up phases 3 Support for user-annotations − However, not used in current experimentation ∗ Runtime is transparent, eliminating the need for code changes to MPI applications Runtime methods to improve energy efficiency 16

Experimental Setup Mini-apps & Applications Unstructured grids – MiniFE, HPCCG, AMG Structured grids – MiniGhost Mesh Refinement – MiniAMR Hydrodynamics – CloverLeaf − mini-apps representative of key production HPC applications Dislocation Dynamics – ParaDis System 32 Haswell node partition (Sandia Shepard) = 1024 cores − Dell M420: two 16-cores Xeon E5-2698v3 128GB at 2.3GHz − RHEL6.8, Slurm 2.3.3-1.18chaos and Linux 3.17.8 kernel − Mpich 3.2 Results are average of 12 runs taken at stable temperatures (to promote reproducibility) Runtime methods to improve energy efficiency 17

ParaDis results Runtime methods to improve energy efficiency 18

ParaDis critical path on 24 nodes (768 cores) - Default Average Frequency (MHz) 2.5 3500 Compute Time (s) 2.0 1.5 3000 1.0 2500 0.5 0 200 400 600 800 1000 1200 Phase Runtime methods to improve energy efficiency 19

ParaDis critical path on 24 nodes (768 cores) - Default Average Frequency (MHz) 2.5 3500 Compute Time (s) 2.0 1.5 3000 1.0 2500 0.5 0 200 400 600 800 1000 1200 Phase Bimodal distribution of critical path times < 1.0s and > 1.0s Successive phases are similar, with only occasional jumps Average critical path frequency (Default) = 2507.4MHz Runtime methods to improve energy efficiency 19

ParaDis critical path on 24 nodes (768 cores) - DVFS Average Frequency (MHz) 3400 Compute Time (s) 2.0 1.5 3000 1.0 2600 0.5 2200 0 200 400 600 800 1000 1200 Phase Average critical path frequency (Default) = 2467.3MHz Runtime methods to improve energy efficiency 20

ParaDis critical path on 24 nodes (768 cores) - DDCM Average Frequency (MHz) 2.0 3500 Compute Time (s) 1.5 3000 1.0 2500 0.5 0 200 400 600 800 1000 1200 Phase Very low frequency on non-critical cores for prolonged periods reduces variation , and increases available thermal headroom for critical cores Average critical path frequency (Default) = 2784.8MHz Runtime methods to improve energy efficiency 21

Mitigating workload balance average results across all experiments Policy %Power reduced %Energy saved %Time increase Temp decrease (C) DDCM 19.3 15.1 5.3 3.2 DVFS 20.5 20.2 0.5 3.3 Combined 24.9 22.6 2.9 4.2 ACR demonstrates that dynamic control of power at runtime is possible Runtime methods to improve energy efficiency 22

Two short talks on current topics in Computer Science Jan Prins - PowerPoint PPT Presentation

Two short talks on current topics in Computer Science Jan Prins Department of Computer Science University of North Carolina at Chapel Hill 1 Runtime Methods to Improve Energy Efficiency in Supercomputing Applications 2 Computational Methods in

Talks: Session 1 Talks: Session 1 Talks: Session 1 Talks: Session 1 Saturday, April 7, 9:30

How God Talks to Us The way God talks to us must be the same way we, as a church, talk to the

How God Talks to Us The way God talks to us must be the same way we, as a church, talk to the

GSM Short Message Service GSM Short Message Service GSM Short Message Service GSM Short Message

#LeedsOPWD @AgeUKLeeds @LeedsLibraries @TTSLeeds Agenda 10:30 Short talks support available

2016 ANNUAL GENERAL MEETING Short Sea Shipping is OUR BUSINESS 2 Short Sea Shipping is OUR

Never Underestimate TIPS ON TALKS TIPS ON TALKS the Power of People The Ins and Outs of Visual

1 Wed 19 Aug 2020 3:00pm (GMT +8) mwka.com/talks Speaker Moderator LESLEY LIM TOMMY WONG

Research Talks Olivier Bernardi May 31, 2016 Talks in Conferences and Workshops 1. Sep. 2016

TH ANNUAL 4 QUANTLIB USER MEETING AT IKB Sponsored by TALKS TALKS Wednesday 7 th December

DPG Mnster 2017 CBM talks / talks with direct relation to CBM Session / speaker 1. HK 2.3 Mo

DNA Short Tandem Repeats Organism DNA Short Tandem Repeats Organ DNA Short Tandem Repeats Cell

SHORT-TERM RENTALS IN AUSTIN, TX Smart City Policy Summit September 17, 2019 Todd LaRue,

Overview Two-Part MDL Two-Part MDL Two-Part MDL for Two-Part MDL for Grammar Learning

FCCLA RED Talk Presentation Application FCCLAs RED Talks are a non - traditional, short, and

On Game Shows and Nuclear Talks Zurich, 25.01.17 Dr. Philip Grech This slide deck is a short

Sorting Carola Wenk Slides courtesy of Charles Leiserson with small y changes by Carola Wenk

Intro to Analysis of Algorithms Divide & Conquer Chapter 3 Michael Soltys CSU Channel

Defining the semantics of proof evidence Dale Miller Inria Saclay & LIX, Ecole

Implementation of a Pragmatic Translation from Haskell into Isabelle/HOL Patrick Bahr

Holographic Transport and the Hall Angle Mike Blake - DAMTP arXiv:1406.1659 with Aristomenis

Strongly coupled metals and insulators Sean Hartnoll (Stanford) Gauge/gravity duality 2013 @

Pushing XPath Accelerator to its Limits Christian Grn, Marc Kramis Alexander Holupirek, Marc H.

JET ENERGY LOSS IN A FLOWING PLASMA COMBINING QCD, STRINGS, NULL GEODESICS AND VISCOUS HYDRO With

Two short talks on current topics in Computer Science Jan Prins - PowerPoint PPT Presentation

Two short talks on current topics in Computer Science Jan Prins Department of Computer Science University of North Carolina at Chapel Hill 1 Runtime Methods to Improve Energy Efficiency in Supercomputing Applications 2 Computational Methods in

Talks: Session 1 Talks: Session 1 Talks: Session 1 Talks: Session 1 Saturday, April 7, 9:30

How God Talks to Us The way God talks to us must be the same way we, as a church, talk to the

How God Talks to Us The way God talks to us must be the same way we, as a church, talk to the

GSM Short Message Service GSM Short Message Service GSM Short Message Service GSM Short Message

#LeedsOPWD @AgeUKLeeds @LeedsLibraries @TTSLeeds Agenda 10:30 Short talks support available

2016 ANNUAL GENERAL MEETING Short Sea Shipping is OUR BUSINESS 2 Short Sea Shipping is OUR

Never Underestimate TIPS ON TALKS TIPS ON TALKS the Power of People The Ins and Outs of Visual

1 Wed 19 Aug 2020 3:00pm (GMT +8) mwka.com/talks Speaker Moderator LESLEY LIM TOMMY WONG

Research Talks Olivier Bernardi May 31, 2016 Talks in Conferences and Workshops 1. Sep. 2016

TH ANNUAL 4 QUANTLIB USER MEETING AT IKB Sponsored by TALKS TALKS Wednesday 7 th December

DPG Mnster 2017 CBM talks / talks with direct relation to CBM Session / speaker 1. HK 2.3 Mo

DNA Short Tandem Repeats Organism DNA Short Tandem Repeats Organ DNA Short Tandem Repeats Cell

SHORT-TERM RENTALS IN AUSTIN, TX Smart City Policy Summit September 17, 2019 Todd LaRue,

Overview Two-Part MDL Two-Part MDL Two-Part MDL for Two-Part MDL for Grammar Learning

FCCLA RED Talk Presentation Application FCCLAs RED Talks are a non - traditional, short, and

On Game Shows and Nuclear Talks Zurich, 25.01.17 Dr. Philip Grech This slide deck is a short

Sorting Carola Wenk Slides courtesy of Charles Leiserson with small y changes by Carola Wenk

Intro to Analysis of Algorithms Divide &amp; Conquer Chapter 3 Michael Soltys CSU Channel

Defining the semantics of proof evidence Dale Miller Inria Saclay &amp; LIX, Ecole

Implementation of a Pragmatic Translation from Haskell into Isabelle/HOL Patrick Bahr

Holographic Transport and the Hall Angle Mike Blake - DAMTP arXiv:1406.1659 with Aristomenis

Strongly coupled metals and insulators Sean Hartnoll (Stanford) Gauge/gravity duality 2013 @

Pushing XPath Accelerator to its Limits Christian Grn, Marc Kramis Alexander Holupirek, Marc H.

JET ENERGY LOSS IN A FLOWING PLASMA COMBINING QCD, STRINGS, NULL GEODESICS AND VISCOUS HYDRO With

Intro to Analysis of Algorithms Divide & Conquer Chapter 3 Michael Soltys CSU Channel

Defining the semantics of proof evidence Dale Miller Inria Saclay & LIX, Ecole