Exploiting Extreme Processor Counts on the Cray Exploiting Extreme - - PowerPoint PPT Presentation

exploiting extreme processor counts on the cray
SMART_READER_LITE
LIVE PREVIEW

Exploiting Extreme Processor Counts on the Cray Exploiting Extreme - - PowerPoint PPT Presentation

Exploiting Extreme Processor Counts on the Cray Exploiting Extreme Processor Counts on the Cray XT4 with High- -Resolution Seismic Wave Resolution Seismic Wave XT4 with High Propagation Experiments Propagation Experiments 2,3 and Mike


slide-1
SLIDE 1

7th May 2009 CUG 2009 Atlanta

Exploiting Extreme Processor Counts on the Cray Exploiting Extreme Processor Counts on the Cray XT4 with High XT4 with High-

  • Resolution Seismic Wave

Resolution Seismic Wave Propagation Experiments Propagation Experiments

1 STFC Daresbury Laboratory, Warrington WA4 4AD, UK 1 STFC Daresbury Laboratory, Warrington WA4 4AD, UK 2 2 Institute Institute of

  • f Engineering

Engineering, UNAM, , UNAM, C.U C.U., 04510, ., 04510, Mexico Mexico DF, DF, Mexico Mexico 3 3 Laboratoire Laboratoire de de G Gé éologie

  • logie CNRS

CNRS-

  • ENS, 24

ENS, 24 Rue Rue Lhomond Lhomond, Paris, France , Paris, France 4 DGSCA, UNAM, 4 DGSCA, UNAM, C.U C.U., 04510, ., 04510, Mexico Mexico DF, DF, Mexico Mexico

Mike Ashworth Mike Ashworth1

1, Mario

, Mario Chavez Chavez2,3

2,3 and

and Eduardo Eduardo Cabrera Cabrera4

4

slide-2
SLIDE 2

7th May 2009 CUG 2009 Atlanta

Outline

Introduction to seismic wave code Benchmark cases Optimization Performance profiling Benchmark results

slide-3
SLIDE 3

7th May 2009 CUG 2009 Atlanta

Large subduction earthquakes On 19 On 19th

th

Sep 1985 a large Ms 8.1 subduction Sep 1985 a large Ms 8.1 subduction earthquake occurred on the Mexican Pacific coast with earthquake occurred on the Mexican Pacific coast with an an epicentre epicentre at about 340 km from Mexico City at about 340 km from Mexico City. . The The losses losses were were of

  • f about

about 30,000 30,000 deaths deaths and and 7 7 billion billion US US dollars dollars. . On On 12 12th

th

May 2008 May 2008 the the Ms Ms 7.9 Sichuan, China, 7.9 Sichuan, China, earthquake earthquake produced produced about about 70,000 70,000 deaths deaths and and 80 US 80 US billion billion dollars dollars loss loss. . Therefore, there is a seismological, engineering and Therefore, there is a seismological, engineering and socio economical interest to model these types of socio economical interest to model these types of events, particularly, due to the scarcity

  • f

events, particularly, due to the scarcity

  • f
  • bservational instrumental data for them.
  • bservational instrumental data for them.
slide-4
SLIDE 4

7th May 2009 CUG 2009 Atlanta

Inner rectangle is the rupture area of the 19/09/1985 Ms 8.1 Inner rectangle is the rupture area of the 19/09/1985 Ms 8.1 earthquake on the surface projection of the 500x600x124 km earth earthquake on the surface projection of the 500x600x124 km earth crust volume 3DFD crust volume 3DFD discretization discretization

SS Ms 8.1

(SS) (SC) (SD)

500 km 6 k m 180 km 1 4 k m

P´ P

slide-5
SLIDE 5

80 90 100 110 120 130 30 50 40 20

2400 km 1600 km

H H C B J I T I Y T I A X A N L Z H C D 2 ( C h e n g d u ) G Y A K M I 5 1 S c a l e k m E a r t h q . r u p t u r e

E N

30 30 s 60 6 s 90 9 s 1 2 120 s Strike=229 deg

  • 240 km
  • 180
  • 120
  • 60

Depth km

10 20 160 360 540 720 900 cm

Slip

4 2 2 10 12 9 9 Vp (km/s) Vs (km/s) ρ (Ton/m3) 4.51 5.39 5.90 6.0 6.28 6.55 6.9 8.0 2.43 2.90 3.15 3.25 3.75 3.80 3.95 4.7 2.49 2.55 2.62 2.72 2.82 2.98 3.3 3.4 Thickness (km)

Geologic structure model

Felsic Crust Felsic / Int. Crust

  • Int. / Mafic Crust

X Y

Locations of: a) the epicenter (red dot) of the 12 05 2008 Sichu Locations of: a) the epicenter (red dot) of the 12 05 2008 Sichuan Ms 7.9; b) its rupture an Ms 7.9; b) its rupture area and its kinematic slip; c) 9 seismographic stations sites ( area and its kinematic slip; c) 9 seismographic stations sites (black dots) of the China black dots) of the China Seismographic Network; d) the Seismographic Network; d) the surficial surficial projection of the 2400 x 1600 x 300 km3 volume projection of the 2400 x 1600 x 300 km3 volume used to used to discretize discretize the region of interest; f) the geologic structure adopted for t the region of interest; f) the geologic structure adopted for the volume he volume

slide-6
SLIDE 6

7th May 2009 CUG 2009 Atlanta

Sichuan earthquake 12th May 2008

slide-7
SLIDE 7

7th May 2009 CUG 2009 Atlanta

Realistic 3D modelling of the seismic wave Realistic 3D modelling of the seismic wave propagation for these types of earthquakes, propagation for these types of earthquakes, should include volumes of the earth crust of should include volumes of the earth crust of hundreds of kilometers hundreds of kilometers 3D finite difference modeling of realistic 3D finite difference modeling of realistic-

  • earth

earth size seismic wave propagation problems has size seismic wave propagation problems has been successful, but very computationally been successful, but very computationally demanding demanding Seismic wave modelling

slide-8
SLIDE 8

7th May 2009 CUG 2009 Atlanta

fd3d earthquake simulation code Seismic wave propagation 3D velocity-stress equations Structured grid Explicit scheme

  • 2nd order accurate in time
  • 4th order accurate in space

Regular grid partitioning Halo exchange

H y p

  • c

e n t e r

  • 1

8 k m 1 4 0 k m P ´ P 5 k m 60 0 k m

  • C

a l e t a

  • M

e x i c

  • C

i t y

i j k

124 km

slide-9
SLIDE 9

Z Y X

fd3d output: synthetic seismograms

H y p

  • c

e n t e r

  • 180 km

1 4 k m

P

  • P

500 km

6 k m

  • Caleta
  • Mexico

City

Mexico City X Vmax=0.01575

0.04

  • 0.04

Mexico City Z Vmax=0.03251

0.04 0.02

  • 0.02
  • 0.04

Mexico City Y Vmax=0.02055

0.04

  • 0.04

50 100 150 200 250

Time (s) Velocity (m/s)

124 km

Caleta X Vmax=0.13127

0.15

  • 0.15

Caleta Z Vmax=0.05664

0.15 0.05

  • 0.05
  • 0.15

Caleta Y Vmax=0.07261

0.15

  • 0.15

50 100 150 200 250

Time (s) Velocity (m/s)

slide-10
SLIDE 10

H ypocenter

  • 18 0 k m

1 4 k m P´ P 50 0 k m 6 k m

  • C

a l e t a

  • M

e x i c

  • C

i t y

Ghost Cell

124 km

Hypocenter

  • 1 8 0 km

1 4 0 k m P´ P 5 00 k m 60 0 k m

  • C

a l e t a

  • M

e x i c

  • C

i t y

i j k

124 km

H ypocenter

  • 18 0 k m

1 4 k m P´ P 50 0 k m 6 k m

  • C

a l e t a

  • M

e x i c

  • C

i t y

124 km

Hypocenter

  • 18 0 km

1 4 0 k m P´ P 5 00 km 6 0 0 k m

  • C

a l e t a

  • M

e x i c

  • C

i t y

124 km

Z Y X

The The Problem Problem Partition Partition Communication Communication (A) (A) ( (Geometry Geometry and and Physical Physical Properties Properties) ) Communication Communication (B) (B) ( (Among Among Cells Cells) )

slide-11
SLIDE 11

H y p

  • c

e n t e r

  • 1

8 k m

1 4 0 k m

P

  • P

5 k m

6 0 0 k m

  • C

a l e t a

  • M

e x i c

  • C

i t y

124 km

H y p

  • c

e n t e r

  • H

y p

  • c

e n t e r

  • 18 0 km

1 4 0 k m

P´ P 5 00 km

6 0 0 k m

  • C

a l e t a

  • M

e x i c

  • C

i t y

124 km

Z Y X

1 Problem 2 Partition Partition

i j k

3 Communication Communication (A) (A) ( (Geometry Geometry and and Physical Physical Properties Properties) )

Ghost Cell

4 Communication Communication (B) (B) ( (Among Among Cells Cells) )

slide-12
SLIDE 12

7th May 2009 CUG 2009 Atlanta

The benchmark cases

Size of domain is 500 x 260 x 124 km Series of models:

  • 500m resolution

1000 x 520 x 248 grid

  • 250m resolution

  • 125m resolution

  • 62.5m resolution

  • 31.25m resolution

16000 x 8320 x 3968

slide-13
SLIDE 13

7th May 2009 CUG 2009 Atlanta

HECToR dual-core Core – 2.8Ghz clock frequency – SSE SIMD FPU (2flops/cycle = 5.6GF peak) Cache Hierarchy – L1 Dcache/Icache: 64k/core – L2 D/I cache: 1M/core – SW Prefetch and loads to L1 – Evictions and HW prefetch to L2 Memory – 6 GB/node = 4 GB + 2 GB – Dual Channel DDR2 – 10GB/s peak @ 667MHz Jaguar ‘Pf’ quad-core Core – 2.3Ghz clock frequency – SSE SIMD FPU (4flops/cycle = 9.2GF peak) Cache Hierarchy – L1 Dcache/Icache: 64k/core – L2 D/I cache: 512 KB/core – L3 Shared cache 2MB/Socket – SW Prefetch and loads to L1,L2,L3 – Evictions and HW prefetch to L1,L2,L3 Memory – 16 GB/node symmetric – Dual Channel DDR2 – 12GB/s peak @ 800MHz

13

HECToR vs. Jaguar ‘Pf’

from Jason Beech-Brandt, Cray

slide-14
SLIDE 14

7th May 2009 CUG 2009 Atlanta

Optimizations

Opt 1: change MPI_sndrcv to MPI_IRecv, MPI_ISend preposting receives before buffer copies Opt 2: Opt 1 + BC code replace loops involving array syntax by a triply-nested loop so that the

  • rder of memory accesses is explicit

Opt 3: Opt 2 + two subroutines were being called 320 million times – push loop into subroutines

slide-15
SLIDE 15
slide-16
SLIDE 16
slide-17
SLIDE 17

7th May 2009 CUG 2009 Atlanta

Optimizations on HECToR

10 20 30 2048 4096 6144 8192 Number of processor cores

Performance (Ggridpoints-steps/sec)

Opt 3 Opt 2 Opt 1 Original

10% 15%

slide-18
SLIDE 18

7th May 2009 CUG 2009 Atlanta

Vectorization

PGI compiler with –O3 –fastsse

836, Generated 3 alternate loops for the inner loop Generated vector sse code for inner loop Generated 8 prefetch instructions for this loop Generated vector sse code for inner loop Generated 8 prefetch instructions for this loop Generated vector sse code for inner loop Generated 8 prefetch instructions for this loop Generated vector sse code for inner loop Generated 8 prefetch instructions for this loop

slide-19
SLIDE 19

7th May 2009 CUG 2009 Atlanta

Craypat HWPC

USER

  • Time% 100.0%

Time 2442.711073 secs Imb.Time

  • - secs

Imb.Time% -- Calls 0.0 /sec 2.0 calls DATA_CACHE_MISSES 26.767M/sec 64384203739 misses PAPI_TOT_INS 1273.576M/sec 3063435776248 instr PAPI_L1_DCA 724.935M/sec 1743745514390 refs PAPI_FP_OPS 557.064M/sec 1339951513747 ops User time (approx) 2405.380 secs 6735065200928 cycles 98.5%Time Average Time per Call 1221.355536 sec CrayPat Overhead : Time 0.0% HW FP Ops / User time 557.064M/sec 1339951513747 ops 9.9%peak(DP) HW FP Ops / WCT 548.551M/sec HW FP Ops / Inst 43.7% Computational intensity 0.20 ops/cycle 0.77 ops/ref Instr per cycle 0.45 inst/cycle MIPS 2608284.49M/sec MFLOPS (aggregate) 1140867.64M/sec Instructions per LD & ST 56.9% refs 1.76 inst/ref D1 cache hit,miss ratios 96.3% hits 3.7% misses D1 cache utilization (M) 27.08 refs/miss 3.385 avg uses

slide-20
SLIDE 20

7th May 2009 CUG 2009 Atlanta

62.5m resolution

10 20 30 4096 8192 12288 16384 Number of processor cores

Performance (Ggridpoints-steps/sec)

Cray XT4 HECToR Cray XT4 jaguar HECToR faster 11% 6%

slide-21
SLIDE 21

7th May 2009 CUG 2009 Atlanta

Dual-core vs Quad-core Headline Linpack performance per core is faster QC 7.0 DC 4.8 Gflop/s/core x1.45 HECToR Allocation Unit is a notional processor running Linpack at 1Gflop/s for 1 hour Gflop/s/core = AUs per core hour Unless your app scales as well as Linpack (x1.45) your Allocation Units will buy less app time

slide-22
SLIDE 22

7th May 2009 CUG 2009 Atlanta

different resolutions on HECToR

5 10 15 20 25 1024 2048 3072 4096 5120 6144 7168 8192 Number of processor cores

Performance (Ggridpoints-steps/sec)

62.5 125m 250m

craypat MPI 10.4% 16.5% 14.5%

slide-23
SLIDE 23

7th May 2009 CUG 2009 Atlanta 10 20 30 40 4096 8192 12288 16384 20480 24576 Number of processor cores

Performance (Ggridpoints-steps/sec)

31.25m 62.5m 125m 250m

different resolutions on jaguar

slide-24
SLIDE 24

7th May 2009 CUG 2009 Atlanta

31.25m resolution from jaguar to jaguar ‘Pf’

20 40 60 80 8192 16384 24576 32768 40960 49152 57344 65536 73728 Number of processor cores

Performance (Ggridpoints-steps/sec) Cray XT5 jaguarpf Cray XT4 jaguar

slide-25
SLIDE 25

7th May 2009 CUG 2009 Atlanta

Conclusions Conclusions We have carried out optimization and performance We have carried out optimization and performance profiling of the seismic wave propagation code profiling of the seismic wave propagation code We have run the code on dual We have run the code on dual-

  • core and quad

core and quad-

  • core

core systems on up to 65536 cores systems on up to 65536 cores Performance continues to scale to around 65536 Performance continues to scale to around 65536 cores though there are some aspects which need cores though there are some aspects which need further investigation further investigation There are issues with the performance per core in There are issues with the performance per core in moving from dual moving from dual-

  • core to quad

core to quad-

  • core with this code

core with this code (and other codes of this type) (and other codes of this type)

slide-26
SLIDE 26

7th May 2009 CUG 2009 Atlanta

This research used resources of the National This research used resources of the National Center Center for for Computational Sciences at Oak Ridge National Laboratory, Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the Department which is supported by the Office of Science of the Department

  • f Energy under Contract DE
  • f Energy under Contract DE-
  • ASC05

ASC05-

  • 00OR22725.

00OR22725. The authors also acknowledge support from the Scientific The authors also acknowledge support from the Scientific Computing Advanced Training (SCAT) project through Europe Computing Advanced Training (SCAT) project through Europe Aid contract II Aid contract II-

  • 0537

0537-

  • FC

FC-

  • FA.
  • FA. http://www.scat

http://www.scat-

  • alfa.eu

alfa.eu We are grateful to John Levesque of Cray Inc. for performing We are grateful to John Levesque of Cray Inc. for performing benchmark runs on the Jaguar benchmark runs on the Jaguar Petaflop Petaflop system. system.

Acknowledgements Acknowledgements

slide-27
SLIDE 27

If you have been … … thank you for listening

Mike Ashworth http://www.cse.scitech.ac.uk/