Two short talks on current topics in Computer Science Jan Prins - - PowerPoint PPT Presentation

two short talks on current topics in computer science
SMART_READER_LITE
LIVE PREVIEW

Two short talks on current topics in Computer Science Jan Prins - - PowerPoint PPT Presentation

Two short talks on current topics in Computer Science Jan Prins Department of Computer Science University of North Carolina at Chapel Hill 1 Runtime Methods to Improve Energy Efficiency in Supercomputing Applications 2 Computational Methods in


slide-1
SLIDE 1

Two short talks on current topics in Computer Science

Jan Prins

Department of Computer Science University of North Carolina at Chapel Hill 1 Runtime Methods to Improve Energy Efficiency in

Supercomputing Applications

2 Computational Methods in Transcriptome Analysis

slide-2
SLIDE 2

Runtime Methods to Improve Energy Efficiency in HPC Applications

Sridutt Bhalachandra1, Robert Fowler2, Stephen Olivier3, Allan Porterfield2, and Jan Prins1

1Department of Computer Science, University of North Carolina at Chapel Hill 2Renaissance Computing Institute, Chapel Hill 3Sandia National Laboratories

May 8, 2018

slide-3
SLIDE 3

computing performance: 120 years of exponential growth!

Runtime methods to improve energy efficiency 2

slide-4
SLIDE 4

What is driving performance growth?

Moore’s “law” - transistor density doubles every 18-24 months

Runtime methods to improve energy efficiency 3

slide-5
SLIDE 5

What is driving performance growth?

Moore’s “law” - transistor density doubles every 18-24 months Dennard scaling - total power remains the same and maximum

  • perating frequency increases

Runtime methods to improve energy efficiency 3

slide-6
SLIDE 6

What is driving performance growth?

Moore’s “law” - transistor density doubles every 18-24 months Dennard scaling - total power remains the same and maximum

  • perating frequency increases

but look what has happened over the past two decades

Runtime methods to improve energy efficiency 3

slide-7
SLIDE 7

The end of Dennard scaling and faster transistors

Consequences additional transistors require additional area power and heat increase commensurately parallel computing is the only route to scaling performance

multicore processors multiprocessor nodes interconnection networks

Runtime methods to improve energy efficiency 4

slide-8
SLIDE 8

Scalable parallel computing

Message Passing Interface (MPI) is used to coordinate computation and communication among all processor cores

Runtime methods to improve energy efficiency 5

slide-9
SLIDE 9

Current largest parallel computer

Source: http://www.nsccwx.cn/wxcyw

Sunway Taihulight

40,960 nodes 10,649,600 cores (256+4 per node) at 1.45GHz 20PB storage $273 million Top500 #1 93.01 PFLOPS @ 15.4MW 1 PetaFLOPS (PFLOPS) = 1015 Floating Point Operations Per Second 1 MegaWatt (MW) can roughly power 1000 homes

Runtime methods to improve energy efficiency 6

slide-10
SLIDE 10

Exascale (1018FLOPS) power requirements

System/Site Performance (PFLOPS) Power (MW) Energy Efficiency (GFLOPS/W) Exascale 1000 ? ? Taihulight 93 15 6 Tianhe 2 34 18 2 Piz Daint 20 2 9

Runtime methods to improve energy efficiency 7

slide-11
SLIDE 11

Exascale (1018FLOPS) power requirements

System/Site Performance (PFLOPS) Power (MW) Energy Efficiency (GFLOPS/W) Exascale 1000 ? ? Taihulight 93 15 6 Tianhe 2 34 18 2 Piz Daint 20 2 9 TSUBAME 3.0 2 0.14 14 kukai 0.46 0.03 14 AIST AI Cloud 0.96 0.08 13

Runtime methods to improve energy efficiency 7

slide-12
SLIDE 12

Exascale (1018 FLOPS) power requirements

System/Site Performance (PFLOPS) Power (MW) Energy Efficiency (GFLOPS/W) Exascale 1000 20 50 Taihulight 93 15 6 Tianhe 2 34 18 2 Piz Daint 20 2 9 TSUBAME 3.0 2 0.14 14 kukai 0.46 0.03 14 AIST AI Cloud 0.96 0.08 13

Runtime methods to improve energy efficiency 8

slide-13
SLIDE 13

Exascale (1018 FLOPS) power requirements

System/Site Performance (PFLOPS) Power (MW) Energy Efficiency (GFLOPS/W) Exascale 1000 20 50 Taihulight 93 15 6 Tianhe 2 34 18 2 Piz Daint 20 2 9 TSUBAME 3.0 2 0.14 14 kukai 0.46 0.03 14 AIST AI Cloud 0.96 0.08 13

5x - 10x improvement in energy efficiency required

Runtime methods to improve energy efficiency 8

slide-14
SLIDE 14

breakdown of power use in a large parallel computer

Source: Use Case: Quantifying the Energy Efficiency of a Computing System -Hsu et al. Runtime methods to improve energy efficiency 9

slide-15
SLIDE 15

Opportunity to save energy

“Race to the end” in parallel regions each processor core operates on data in its node each processor maximizes speed while staying within thermal limit all processors spinwait on lock at end of the region last processor to arrive releases the lock

Runtime methods to improve energy efficiency 10

slide-16
SLIDE 16

Computational workload imbalance

could be inherent in application could be due to system heterogeneity exacerbated by the race to the end

Runtime methods to improve energy efficiency 11

slide-17
SLIDE 17

Saving energy by mitigating workload imbalance

Runtime methods to improve energy efficiency 12

slide-18
SLIDE 18

Saving energy by mitigating workload imbalance

Challenges each core is set to operate at a suitable frequency based on previous phase observation the frequency can change at every phase

Runtime methods to improve energy efficiency 12

slide-19
SLIDE 19

Fine grained power control

Dynamic Duty Cycle Modulation (DDCM) – T-states

− Actual clock rate is not changed, DVFS and TurboBoost still operational − Modulation range constant across architecture - 100% to 6.25% − IA32_CLOCK_MODULATION MSR

Runtime methods to improve energy efficiency 13

slide-20
SLIDE 20

Fine grained power control

Dynamic Duty Cycle Modulation (DDCM) – T-states

− Actual clock rate is not changed, DVFS and TurboBoost still operational − Modulation range constant across architecture - 100% to 6.25% − IA32_CLOCK_MODULATION MSR

DVFS - core specific (Haswell) – P-states

− Can slow only non-critical cores − Operational range machine-dependent even for the same architecture − acpi_cpufreq kernel module

Runtime methods to improve energy efficiency 13

slide-21
SLIDE 21

Runtime control policy

Core-specific control − match a core’s effective duty cycle to its workload

Duty cycle = Time core in active state Total time (clock cycles) ∗Change core active time using DDCM or clock cycles using DVFS Work = Compute time Compute time + Idle time (constant frequency) Effective Work = Compute time Compute time + Idle time ∗ Max frequency Current frequency

Runtime methods to improve energy efficiency 14

slide-22
SLIDE 22

Runtime policy

Assumes similar behavior across successive phases Policy calculation local to core, no communication Runtime methods to improve energy efficiency 15

slide-23
SLIDE 23

Runtime policy

Assumes similar behavior across successive phases Policy calculation local to core, no communication

Combined policy (PowerDVFS < PowerDDCM) − Use DVFS policy until lowest frequency reached − Thereafter, use DDCM policy

Runtime methods to improve energy efficiency 15

slide-24
SLIDE 24

Adaptive Core-specific Runtime (ACR)

ACR = Runtime Policy + User Options

1 Can monitor performance degradation at the end of every

phase

− Rudimentary method to detect phase change

2 Can induce minimum phase length limit

− Useful in skipping start-up phases

3 Support for user-annotations

− However, not used in current experimentation

∗ Runtime is transparent, eliminating the need for code changes to MPI applications

Runtime methods to improve energy efficiency 16

slide-25
SLIDE 25

Experimental Setup

Mini-apps & Applications

Unstructured grids – MiniFE, HPCCG, AMG Structured grids – MiniGhost Mesh Refinement – MiniAMR Hydrodynamics – CloverLeaf

− mini-apps representative of key production HPC applications

Dislocation Dynamics – ParaDis

System

32 Haswell node partition (Sandia Shepard) = 1024 cores

− Dell M420: two 16-cores Xeon E5-2698v3 128GB at 2.3GHz − RHEL6.8, Slurm 2.3.3-1.18chaos and Linux 3.17.8 kernel − Mpich 3.2

Results are average of 12 runs taken at stable temperatures (to promote reproducibility)

Runtime methods to improve energy efficiency 17

slide-26
SLIDE 26

ParaDis results

Runtime methods to improve energy efficiency 18

slide-27
SLIDE 27

ParaDis results

Runtime methods to improve energy efficiency 18

slide-28
SLIDE 28

ParaDis critical path on 24 nodes (768 cores) - Default

0.5 1.0 1.5 2.0 2.5

Phase Compute Time (s)

200 400 600 800 1000 1200 2500 3000 3500

Average Frequency (MHz)

Runtime methods to improve energy efficiency 19

slide-29
SLIDE 29

ParaDis critical path on 24 nodes (768 cores) - Default

0.5 1.0 1.5 2.0 2.5

Phase Compute Time (s)

200 400 600 800 1000 1200 2500 3000 3500

Average Frequency (MHz)

Bimodal distribution of critical path times < 1.0s and > 1.0s Successive phases are similar, with only occasional jumps Average critical path frequency (Default) = 2507.4MHz

Runtime methods to improve energy efficiency 19

slide-30
SLIDE 30

ParaDis critical path on 24 nodes (768 cores) - DVFS

0.5 1.0 1.5 2.0

Phase Compute Time (s)

200 400 600 800 1000 1200 2200 2600 3000 3400

Average Frequency (MHz)

Average critical path frequency (Default) = 2467.3MHz

Runtime methods to improve energy efficiency 20

slide-31
SLIDE 31

ParaDis critical path on 24 nodes (768 cores) - DDCM

0.5 1.0 1.5 2.0

Phase Compute Time (s)

200 400 600 800 1000 1200 2500 3000 3500

Average Frequency (MHz)

Very low frequency on non-critical cores for prolonged periods reduces variation, and increases available thermal headroom for critical cores Average critical path frequency (Default) = 2784.8MHz

Runtime methods to improve energy efficiency 21

slide-32
SLIDE 32

Mitigating workload balance

average results across all experiments

Policy %Power reduced %Energy saved %Time increase Temp decrease (C) DDCM 19.3 15.1 5.3 3.2 DVFS 20.5 20.2 0.5 3.3 Combined 24.9 22.6 2.9 4.2

ACR demonstrates that dynamic control of power at runtime is possible

Runtime methods to improve energy efficiency 22

slide-33
SLIDE 33

Mitigating workload balance

average results across all experiments

Policy %Power reduced %Energy saved %Time increase Temp decrease (C) DDCM 19.3 15.1 5.3 3.2 DVFS 20.5 20.2 0.5 3.3 Combined 24.9 22.6 2.9 4.2

ACR demonstrates that dynamic control of power at runtime is possible At Exascale, runtimes such as ACR will allow

− more work to be run at one time by using less power − individual applications to run faster by allowing a higher thermal headroom on critical cores

Runtime methods to improve energy efficiency 22

slide-34
SLIDE 34

Mitigating workload balance

average results across all experiments

Policy %Power reduced %Energy saved %Time increase Temp decrease (C) DDCM 19.3 15.1 5.3 3.2 DVFS 20.5 20.2 0.5 3.3 Combined 24.9 22.6 2.9 4.2

ACR demonstrates that dynamic control of power at runtime is possible At Exascale, runtimes such as ACR will allow

− more work to be run at one time by using less power − individual applications to run faster by allowing a higher thermal headroom on critical cores

Energy optimization can also be performance optimization

Runtime methods to improve energy efficiency 22

slide-35
SLIDE 35

Saving energy in memory-bound applications Many HPC applications are memory-bound Memory operations are seldom visible to OS/runtime

− Power wasted in CPU while waiting on memory

Approach

− sample table of request occupancy in memory subsystem − use DVFS to slow non-critical cores

Published work

1 Improving Energy Efficiency in Memory-constrained

Applications Using Core-specific Power Control (E2SC 2017)

Runtime methods to improve energy efficiency 23

slide-36
SLIDE 36

Acknowledgements

Ph.D. research of Sridutt Bha- lachandra (Argonne National Labs Exascale Group) Published work

1 An Adaptive Core-specific Runtime for Energy Efficiency

(IPDPS 2017)

2 Using Dynamic Duty Cycle Modulation to improve energy

efficiency in High Performance Computing (HPPAC 2015)

Runtime methods to improve energy efficiency 24