Analysis of Applications on a High PerformanceLow Energy Computer - - PowerPoint PPT Presentation

analysis of applications on a high performance low energy
SMART_READER_LITE
LIVE PREVIEW

Analysis of Applications on a High PerformanceLow Energy Computer - - PowerPoint PPT Presentation

7 th Workshop on UnConventional High Performance Computing August 26 2014, Porto Analysis of Applications on a High PerformanceLow Energy Computer Florina M. Ciorba, Thomas Ilsche, Elke Franz, Stefan Pfennig, Christian Scheunert, Ulf


slide-1
SLIDE 1

Analysis of Applications on a High Performance–Low Energy Computer

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Florina M. Ciorba, Thomas Ilsche, Elke Franz, Stefan Pfennig, Christian Scheunert, Ulf Markwardt, Joseph Schuchart, Daniel Hackenberg, Robert Schöne, Andreas Knüpfer, Wolfgang E. Nagel, Eduard A. Jorswieck, and Matthias S. Müller 7th Workshop on UnConventional High Performance Computing August 26 2014, Porto

slide-2
SLIDE 2

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Talk Outline

¨ Motivation ¨ Modeling Applications ¨ Modeling a High Performance–Low Energy Computer ¨ Mapping Application to Systems ¨ Modeling Communication ¨ Simulation Results ¨ Summary and Future Work

2

slide-3
SLIDE 3

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

The Challenge

Given a parallel application and a high performance-low energy computer, how can the computer execute the application as fast as possible while consuming the least amount of energy?

3

slide-4
SLIDE 4

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Our Approach

¨ Simulation and analysis workflow

recorded

  • app. trace

(existing system) mapping and trace visualization and analysis

simulation

architecture abstraction models (topology, performance/energy of computation and communication) software abstraction models (mapping, runtime environment, energy-aware software) parallel application (source code) simulated

  • app. trace

and mapping (HAEC Box) analysis and evaluation

  • f input

tracing granularity, performance counters, etc. simulation

  • utput

analysis and evaluation

  • f simulation

haec_sim

desired tracing features desired energy measurements instrumented execution (test systems, production systems) accuracy, sampling rate, measurement scope, etc. energy/utility function simulation input simulation input application configuration process models display trace influences visualization feedback

Legend:

HAEC Box parameters (latency, bandwidth, errors) desired simulation goals

4

slide-5
SLIDE 5

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Our Simulation Framework & State of the Art

State of the art Our framework

¨ Trace-driven simulation (TDS) ¨ No execution-based simulation

(replay)

¤ Offers increased accuracy ¤ Increases modeling complexity for the hybrid

interconnection networks

¨ Parallel TDS ¨ Hybrid (& dynamic) communication

network

¨ Trace format contains energy

measurements (performance metrics)

¨ Application AND system performance

AND energy consumption modeling

¨

TDS or use traces in some fashion

¨

TDS+Execution-Based Simulation (EBS, replay) (xSim, BigSim, MPI-NetSim, OMNEST, PSINS, SILAS, MPI-SIM)

¤ Offer scalability ¤ Avoid the need to model complex interconnection networks

¨

Sequential TDS (DIMEMAS, HeSSE, LogGOPSim, TaskSim, Tsim)

¨

Parallel TDS (xSim, BigSim, OMNEST, PSINS, SILAS, SIMCaN)

¨

Non-hybrid communication network (xSim, BigSim, DIMEMAS, LogGOPSim, SILAS, TaskSim, Tsim)

¨

Hybrid communication network (HeSSE, OMNEST)

¨

No focus on energy measurements

¨

Focus on I/O architectures: SIMCaN

¨

Application OR system performance modeling OR network modeling

5

slide-6
SLIDE 6

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

¨ Performance, scalability, and energy ¨ NPB lu.C.81 on 6 Taurus nodes and node level energy counters (1 Sa/s)

Modeling Applications

~ 30 s

6

slide-7
SLIDE 7

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Modeling Applications

¨ lu.C.81 on Taurus ¤ Accumulated exclusive time: 69.9% communication, 30.1% computation ¤ Very high number of point-to-point (unicast) messages (11,639,408)

Communication matrix Process graph

7

slide-8
SLIDE 8

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Modeling a High Performance – Low Energy Computer

Circles – compute nodes Blue lines – optical links Green lines – wireless links

HAEC Box

8

Wireless Interconnections

  • On-chip/on-package antenna fields
  • 8x8 or 16x16 Butler matrices
  • Analog/digital beam steering and interference

suppression

  • 200GHz channel / bandwidth / operating range
  • 100 Gbit/s @ 200GHz / Z direction
  • 10 us latency
  • 1D mesh topology (at the moment)

Optical Interconnections

  • Adaptive analog/digital circuits for E/O transceiver
  • Embedded polymer waveguides
  • Packaging technologies (e.g., 3D stacking of

Si/III-V hybrids)

  • Optical switch (MOEMS) for reconfigurable networks
  • 250 Gbit/s via 10 optical channels /XY direction
  • 1 us latency
  • 2D mesh topology
slide-9
SLIDE 9

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Mapping IePLC IaNLC IeNLC AVG IeNLC MIN IeNLC MAX IeNLC xyz 11,639,408 11,639,408 228,223 161,658 242,490 block xyz 4,364,778 7,274,630 173,205 80,829 242,488 random 646,633 10,992,775 99,934 80,829 242,488

Mapping Applications onto HAEC Box

xyz block xyz random IePLC – inter-process logical communication IaNLC – intra-node logical (local) communication IeNLC – inter-node logical communication Static mapping of lu.C.81 onto the 3×3×3 HAEC Box

9

slide-10
SLIDE 10

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Modeling Communication for Parallel Applications running on the HAEC Box

¤ Message passing n Point-to-point ¤ Links n homogeneous ¤ Topology n 3D mesh ¤ Path selection n Single path n XYZ ¤ Routing n Dimension order routing ¤ Network coding n Practical network coding ¤ Assumptions n Error-free transmission n With acknowledgements

blocking ¡ communica-on ¡ non-­‑blocking ¡ communica-on ¡ applica-on ¡communica-on ¡model ¡(e.g., ¡MPI) ¡ point-­‑to-­‑point ¡ collec-ve ¡ remote ¡ memory ¡access ¡ HAEC ¡communica-on ¡model ¡ links ¡ topology ¡ path ¡ selec-on ¡

  • p-cal ¡communica-on ¡

performance ¡ energy ¡ network ¡coding ¡

10

slide-11
SLIDE 11

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Multicast: Routing vs Network Coding

Routing (RT): two timeslots for transmitting m1 and m2 over C-D to both E and F

Multicast: S wants to transmit both messages m1 and m2 to E and F Topology: butterfly

S B A C D F E m1 m2 m1 m1 m2 m2 m1 m1 S B A C D F E m1 m2 m1 m1 m2 m2 m2 m2 S B A C D F E m1 m2 m1 m1 m2 m2 m12 m12 m12

Network coding (NC): one timeslot for transmitting m1 and m2 over C-D to E and F à Reduces delay and energy costs, increases throughput

11

slide-12
SLIDE 12

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Unicast: Routing vs Network Coding

Routing (RT): data packet lost over A-B has to be resent

S B A C

Unicast: S wants to transmit a message (as data packets) to C Topology: linear array Unreliable links: failures or attacks

Network coding (NC): further linear independent combinations are sufficient

S B A C p1 p2 p3

. . .

p1 p3

. . .

p2 p2 p1+p2 2 p2+ 3 p2 p1 + 4 p2

. . . . . .

p3 + 2 p4 p1+p2 p1 + 4 p2 p3 + 2 p4

12

slide-13
SLIDE 13

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Modeling Communication Delays

node j application network processor memory channel encoding transmission decoding node j + + + 1 1 1 network application memory processor dmpi ds|di|da dout l + sp

b

din dr|di|da dmpi dh,p din l + sa

b

dout dh,a

ds process a data packet

  • f size sp by the sender

dr process a data packet

  • f size sp by the receiver

di process a data packet by an intermediate node da process an acknowledgment

  • f size sa

dh,p send a data packet

  • ver one hop

dh,a send an acknowledgment

  • ver one hop

dout write out to channel din read in from channel dmpi write out to/read in from network buffer l latency for channel coding

13

slide-14
SLIDE 14

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Modeling Transfer Times

¨ Transfer time tt(x) for sending x > 0 packets over

h ∈ [0,6] hops without errors or acknowledgments

Assumption: dh,p ≥ ds and dh,p ≥ dr

tt(x) = 2 ·√ dmpi + ds + (h + x - 1) ·√ dh,p + (h - 1) ·√ di + dr ∀ h > 0 (1) tt(x) = 2 ·√ dmpi if h = 0 (intra-node communication)

¨ Complete transfer time T(np) for sending np packets over

h ∈ [0,6] hops without errors, with acknowledgments

(only the final ACK/generation needs to be considered)

T(np) = tt(sw) ·√ nw + tt(nr) + h ·√ (nw + ⌈nr/sw⌉) ·√ (dh,a + da) (2)

14

slide-15
SLIDE 15

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing Mapping IePLC IePPC IaNPC IeNPC AVG IeNPC MIN IeNPC MAX IeNPC xyz 11,639,408 16,004,186 16,004,186 333,420 121,242 484,976 block xyz 14,549,260 4,364,778 10,184,482 212,176 80,829 484,976 random 31,280,908 646,633 30,364,275 567,301 161,657 1,050,780

xyz mapping block xyz mapping random mapping XYZ path selection for lu.C.81 communication over the physical links of the 3×3×3 HAEC Box IePLC – inter-process logical communication IePPC – inter-process physical communication IaNPC – intra-node physical (local) communication IeNPC – inter-node physical communication

Modeling Communication for Parallel Applications running on the HAEC Box

15

slide-16
SLIDE 16

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

¤ latency 1μs ¤ bandwidth 250 Gbit/s ¤ packet size 288 bytes ¤ delay per packet per hop 1,209.216 ns ¤ delay per ACK per hop 1,200.192 ns ¤ sender delay 200 ns or 203.125 ns ¤ receiver delay 200 ns or 215.625 ns

dimension order routing practical network coding faster slower faster slower lu.C.81 on Taurus

69.9% 30.1%

Modeling the Performance of Communication in Parallel Applications on the HAEC Box

41.793 s 23.7-24.1 s

16

lu.C.81on HAEC Box (xyz mapping)

¨ Simulation parameters (haec_sim)

slide-17
SLIDE 17

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Modeling the Performance of Communication in Parallel Applications on the HAEC Box

dimension order routing practical network coding slower slower lu.C.81 on Taurus

69.9% 30.1%

lu.C.81on HAEC Box (random mapping) 41.793 s 23.7-24.1 s

17

¤ latency 1μs ¤ bandwidth 250 Gbit/s ¤ packet size 288 bytes ¤ delay per packet per hop 1,209.216 ns ¤ delay per ACK per hop 1,200.192 ns ¤ sender delay 200 ns or 203.125 ns ¤ receiver delay 200 ns or 215.625 ns

¨ Simulation parameters (haec_sim)

slide-18
SLIDE 18

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Modeling the Performance of Communication in Parallel Applications on the HAEC Box

dimension order routing practical network coding faster slower slower lu.C.81 on Taurus

69.9% 30.1%

lu.C.81on HAEC Box (block xyz mapping) faster <xyz, random <xyz, random >xyz, <random >xyz,<random <xyz, random <xyz, random 41.793 s 23.7-24.1 s

18

¤ latency 1μs ¤ bandwidth 250 Gbit/s ¤ packet size 288 bytes ¤ delay per packet per hop 1,209.216 ns ¤ delay per ACK per hop 1,200.192 ns ¤ sender delay 200 ns or 203.125 ns ¤ receiver delay 200 ns or 215.625 ns

¨ Simulation parameters (haec_sim)

slide-19
SLIDE 19

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Summary

¨ HAEC Box: unconventional architecture sharing

important concerns with the HPC systems

¤ Performance and energy (computation + communication)

¨ Two communication models

¤ Dimension order routing ¤ Practical network coding (novel for HPC applications)

¨ Simulation-based performance analysis using a

trace-driven simulator (haec_sim)

19

slide-20
SLIDE 20

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Future Work

¨ Model more applications (HPC and not only)

¤ Point-to-point communication ¤ Collective communication ¤ Combinations thereof

¨ Develop energy consumption models

¤ Computation and communication operations

¨ Develop optimal mapping strategies

¤ Communication- and topology- aware

¨ Extend the communication models

¤ Point-to-point: with errors/attacks ¤ Collective: without and with errors/attacks ¤ Heterogeneous links (dynamic latency, bandwidth, path selection, topology)

¨ Simulation

¤ Implement local resource managers (nodes, links): enable contention modeling ¤ Implement runtime process migration (after optimal initial mapping)

20

slide-21
SLIDE 21

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Thank you

HAEC website: http://tu-dresden.de/sfb912