Application Characteristics and Performance on a Cray XE6 - - PowerPoint PPT Presentation

▶

Feb 04, 2024 121 likes •339 views

Application Characteristics and Performance on a Cray XE6 Performance on a Cray XE6 Courtenay T. Vaughan Sandia National Laboratories Cray User Group May 2011 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed

SLIDE 1

Application Characteristics and Performance on a Cray XE6 Performance on a Cray XE6

Courtenay T. Vaughan

Sandia National Laboratories Cray User Group May 2011

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

SLIDE 2

Cielo

Cray XE6 with 6654 compute nodes
dual-socket oct-core AMD Magny-Cours nodes
clocked at 2.4 GHz
32 GB of 1.333 GHz DDR3 memory per node
3D torus with Gemini interconnect
have large machine and smaller machines
were configured briefly as XT6 with same

nodes and SeaStar interconnect nodes and SeaStar interconnect

SLIDE 3

XT5

Cray XT5 with 160 compute nodes
dual socket with 6 core AMD Istanbul processors
2.4 GHz processors
32 GB of 800 MHz DDR2 Memory per node

6 4 8 3D tor s ith SeaStar 2 2

6 x 4 x 8 3D torus with SeaStar 2.2

SLIDE 4

XE6 node

Image courtesy of Cray, Inc.

SLIDE 5

CTH

Three-dimensional shock hydrodynamics code
Ran in flat mesh mode - no AMR (Automatic Mesh

R fi t) Refinement)

Several points in each timestep where each

processor sends a few large messages to up to processor sends a few large messages to up to six neighbors

Messages are aggregated from several variables

ll per cell

Code is mostly FORTRAN with a little C

SLIDE 6

CTH Problems

explosively formed Shaped-Charge problem with

4 materials, high explosives, and 90 x 216 x 90 cells/processor in weak scaling mode cells/processor in weak scaling mode

– Messages aggregate 40 variables per cell and average 5.2 MB

impact Meso-Scale problem with 11 materials and

80 x 80 x 275 cells/processor in weak scaling mode mode

– Messages aggregate 75 variables per cell and average 10.4 MB

SLIDE 7

Shaped Charge Problem

SLIDE 8

CTH Communication matrices on 64 cores

Shaped-Charge Meso-Scale Shaped Charge Meso Scale

SLIDE 9

CTH Communication traces from one timestep on 64 cores

Shaped-Charge Meso-Scale

SLIDE 10

PRONTO

Structural mechanics code with contact algorithm
Communication for structural mechanics portion

i t f b d h f i l consists of boundary exchanges for single variables from static decomposition

Contact algorithm based on dynamic secondary

Contact algorithm based on dynamic secondary decomposition which changes during calculation and requires communication from and back to the primary decomposition primary decomposition

Code is FORTRAN 90 with C for contact

communication

SLIDE 11

PRONTO Problems

Walls problem

– Two sets of two brick walls colliding E h h 320 b i k h f hi h h – Each processor has 320 bricks each of which have 128 elements – All communication related to contact

Can Crush problem

– Cylinder crushed by block – Communication both for finite element and contact algorithms – More balanced problem p

SLIDE 12

Walls Problem

SLIDE 13

Can Crush Problem

SLIDE 14

PRONTO Communication matrices on 64 cores

Walls Can Crush Walls Can Crush

SLIDE 15

PRONTO Communication traces on 64 cores

Walls Can Crush

SLIDE 16

CTH on XT5, XT6, and XE6

3000 2500 2000 1500

Time XT5

1000

sc XT5 sc XT5 -S4 sc XT6 sc XE6

500

meso XT5 meso XT5 -S4 meso XT6 meso XE6

1 2 4 8 16 32 64 128 256 512 1024

Number of Cores

SLIDE 17

PRONTO on XT5, XT6, and XE6

2.5

walls XT5

2.0

walls XT5 walls XT5 -S4 walls XT6 -SN2 walls XT6 walls XE6

1.5

nds) walls XE6 can XT5 can XT5 -S4 can XE6

1 0

me (secon

0 5 1.0

Tim

0.5 0.0 16 32 64 128 256

Number of Cores

SLIDE 18

Average message traffic on 256 cores

70000

13e4 19e4

60000

XT5 - CTH - shaped XT5 - CTH - meso XT5 - P3D - walls

50000

nute

5 3 a s XT5 - P3D - can crush XE6 - CTH - shaped XE6 - CTH - meso

30000 40000

mber/min

XE6 - P3D - walls XE6 - P3D - can crush

20000

Nu

10000

< 16B 16B - 256B 256B - 4KB 4KB - 64KB 64KB - 1MB 1MB - 16MB total KB/sec

Size

SLIDE 19

Summary of Results

Large portion of performance difference for both

codes related to memory contention on XT5 when using 6 cores per NUMA region using 6 cores per NUMA region

CTH has large network bandwidth requirements

and shows some performance improvement p p moving to the XE6

PRONTO can send lots of small messages and

shows more performance improvement moving to shows more performance improvement moving to the XE6

SLIDE 20

Future Work

Extend results to larger number of processors
Develop mini-app for CTH to see if we can take

Application Characteristics and Performance on a Cray XE6 Performance on a Cray XE6

Courtenay T. Vaughan

Sandia National Laboratories Cray User Group May 2011

Cielo

nodes and SeaStar interconnect nodes and SeaStar interconnect

XT5

6 4 8 3D tor s ith SeaStar 2 2

XE6 node

CTH

R fi t) Refinement)

processor sends a few large messages to up to processor sends a few large messages to up to six neighbors

ll per cell

CTH Problems

4 materials, high explosives, and 90 x 216 x 90 cells/processor in weak scaling mode cells/processor in weak scaling mode

– Messages aggregate 40 variables per cell and average 5.2 MB

80 x 80 x 275 cells/processor in weak scaling mode mode

– Messages aggregate 75 variables per cell and average 10.4 MB

Shaped Charge Problem

CTH Communication matrices on 64 cores

Shaped-Charge Meso-Scale Shaped Charge Meso Scale

CTH Communication traces from one timestep on 64 cores

Shaped-Charge Meso-Scale

PRONTO

i t f b d h f i l consists of boundary exchanges for single variables from static decomposition

Contact algorithm based on dynamic secondary decomposition which changes during calculation and requires communication from and back to the primary decomposition primary decomposition

communication

PRONTO Problems

– Two sets of two brick walls colliding E h h 320 b i k h f hi h h – Each processor has 320 bricks each of which have 128 elements – All communication related to contact

– Cylinder crushed by block – Communication both for finite element and contact algorithms – More balanced problem p

Walls Problem

Can Crush Problem

PRONTO Communication matrices on 64 cores

Walls Can Crush Walls Can Crush

PRONTO Communication traces on 64 cores

Walls Can Crush

CTH on XT5, XT6, and XE6

Time XT5

sc XT5 sc XT5 -S4 sc XT6 sc XE6

meso XT5 meso XT5 -S4 meso XT6 meso XE6

Number of Cores

PRONTO on XT5, XT6, and XE6

walls XT5

walls XT5 walls XT5 -S4 walls XT6 -SN2 walls XT6 walls XE6

nds) walls XE6 can XT5 can XT5 -S4 can XE6

me (secon

Tim

Number of Cores

Average message traffic on 256 cores

XT5 - CTH - shaped XT5 - CTH - meso XT5 - P3D - walls

nute

5 3 a s XT5 - P3D - can crush XE6 - CTH - shaped XE6 - CTH - meso

mber/min

XE6 - P3D - walls XE6 - P3D - can crush

Nu

Size

Summary of Results

codes related to memory contention on XT5 when using 6 cores per NUMA region using 6 cores per NUMA region

and shows some performance improvement p p moving to the XE6

shows more performance improvement moving to shows more performance improvement moving to the XE6

Future Work

d t f th i j ti t f th advantage of the message injection rate of the Gemini interconnect