exploiting extreme processor counts on the cray
play

Exploiting Extreme Processor Counts on the Cray Exploiting Extreme - PowerPoint PPT Presentation

Exploiting Extreme Processor Counts on the Cray Exploiting Extreme Processor Counts on the Cray XT4 with High- -Resolution Seismic Wave Resolution Seismic Wave XT4 with High Propagation Experiments Propagation Experiments 2,3 and Mike


  1. Exploiting Extreme Processor Counts on the Cray Exploiting Extreme Processor Counts on the Cray XT4 with High- -Resolution Seismic Wave Resolution Seismic Wave XT4 with High Propagation Experiments Propagation Experiments 2,3 and Mike Ashworth 1 1 , Mario Chavez 2,3 Cabrera 4 4 , Mario Chavez and Eduardo Eduardo Cabrera Mike Ashworth 1 STFC Daresbury Laboratory, Warrington WA4 4AD, UK 1 STFC Daresbury Laboratory, Warrington WA4 4AD, UK 2 Institute Institute of of Engineering Engineering, UNAM, , UNAM, C.U C.U., 04510, ., 04510, Mexico Mexico DF, DF, Mexico Mexico 2 3 Laboratoire 3 Laboratoire de de G Gé éologie ologie CNRS CNRS- -ENS, 24 ENS, 24 Rue Rue Lhomond Lhomond, Paris, France , Paris, France 4 DGSCA, UNAM, C.U C.U., 04510, ., 04510, Mexico Mexico DF, DF, Mexico Mexico 4 DGSCA, UNAM, 7 th May 2009 CUG 2009 Atlanta

  2. Outline Introduction to seismic wave code Benchmark cases Optimization Performance profiling Benchmark results 7 th May 2009 CUG 2009 Atlanta

  3. Large subduction earthquakes On 19 th Sep 1985 a large Ms 8.1 subduction On 19 th Sep 1985 a large Ms 8.1 subduction earthquake occurred on the Mexican Pacific coast with earthquake occurred on the Mexican Pacific coast with an epicentre epicentre at about 340 km from Mexico City at about 340 km from Mexico City. . The The an losses were were of of about about 30,000 30,000 deaths deaths and and 7 7 billion billion US US losses dollars. . dollars On 12 th May 2008 the the Ms 7.9 Sichuan, China, On 12 th May 2008 Ms 7.9 Sichuan, China, earthquake produced produced about about 70,000 70,000 deaths deaths and and 80 US 80 US earthquake billion dollars dollars loss loss. . billion Therefore, there is a seismological, engineering and Therefore, there is a seismological, engineering and socio economical interest to model these types of socio economical interest to model these types of events, particularly, due to the scarcity of events, particularly, due to the scarcity of observational instrumental data for them. observational instrumental data for them. 7 th May 2009 CUG 2009 Atlanta

  4. (SS) (SD) (SC) P´ 180 km SS Ms 8.1 m k 0 4 1 m k 0 0 6 500 km P Inner rectangle is the rupture area of the 19/09/1985 Ms 8.1 Inner rectangle is the rupture area of the 19/09/1985 Ms 8.1 earthquake on the surface projection of the 500x600x124 km earth earthquake on the surface projection of the 500x600x124 km earth crust volume 3DFD discretization discretization crust volume 3DFD 7 th May 2009 CUG 2009 Atlanta

  5. Strike=229 deg 0 -60 -120 -180 -240 km 60 1 30 90 Depth km 20 2 0 0 50 120 s 6 9 30 s 0 0 10 s s X N Slip cm 1600 km 0 160 360 540 720 900 B J I H H C Geologic structure model 0 ρ 40 Thickness Vp Vs (km) (km/s) (km/s) (Ton/m3) 4 4.51 2.43 2.49 T I Y L Z H 2 5.39 2.90 2.55 2 5.90 3.15 2.62 T I A 10 6.0 3.25 2.72 X A N E a r t h q . 12 6.28 3.75 2.82 r u p t u r e C D 2 0 2400 km 9 6.55 3.80 2.98 30 ( C h e n g d u ) 9 6.9 3.95 3.3 8.0 4.7 3.4 G Y A K M I S c a l e k m Felsic Crust 0 Felsic / Int. Crust 20 Y 5 0 0 0 1 0 0 0 Int. / Mafic Crust 0 0 0 0 0 0 80 90 100 110 120 130 E Locations of: a) the epicenter (red dot) of the 12 05 2008 Sichuan Ms 7.9; b) its rupture an Ms 7.9; b) its rupture Locations of: a) the epicenter (red dot) of the 12 05 2008 Sichu area and its kinematic slip; c) 9 seismographic stations sites (black dots) of the China area and its kinematic slip; c) 9 seismographic stations sites ( black dots) of the China Seismographic Network; d) the surficial surficial projection of the 2400 x 1600 x 300 km3 volume projection of the 2400 x 1600 x 300 km3 volume Seismographic Network; d) the used to discretize discretize the region of interest; f) the geologic structure adopted for t the region of interest; f) the geologic structure adopted for the volume he volume used to

  6. Sichuan earthquake 12 th May 2008 7 th May 2009 CUG 2009 Atlanta

  7. Seismic wave modelling Realistic 3D modelling of the seismic wave Realistic 3D modelling of the seismic wave propagation for these types of earthquakes, propagation for these types of earthquakes, should include volumes of the earth crust of should include volumes of the earth crust of hundreds of kilometers hundreds of kilometers 3D finite difference modeling of realistic- -earth earth 3D finite difference modeling of realistic size seismic wave propagation problems has size seismic wave propagation problems has been successful, but very computationally been successful, but very computationally demanding demanding 7 th May 2009 CUG 2009 Atlanta

  8. fd3d earthquake simulation code Seismic wave propagation 3D velocity-stress equations Structured grid Explicit scheme • 2nd order accurate in time 5 0 0 • 4th order accurate in space k m 60 0 k m P Regular grid partitioning ´ M e x i c o C i t y � 1 8 � 0 H k m y p o 1 4 0 k m c e n t e r Halo exchange � � � C a l e t a j i P 0 124 km k 7 th May 2009 CUG 2009 Atlanta

  9. 0.15 Caleta X 0 fd3d output: synthetic Vmax=0.13127 Velocity (m/s) -0.15 0.15 seismograms 0.05 Caleta Z 0 Vmax=0.05664 -0.05 -0.15 0.15 Caleta Y 500 km 0 Vmax=0.07261 -0.15 0 50 100 150 200 250 Time (s) m k P � 0 0 Mexico 6 City 180 km � � H y p m o c e k n t e r 0 4 1 � � � Caleta P Z 124 km 0.04 Mexico City X Y Vmax=0.01575 0 Velocity (m/s) -0.04 0.04 Mexico City Z X 0.02 Vmax=0.03251 0 -0.02 -0.04 0.04 Mexico City Y Vmax=0.02055 0 -0.04 0 50 100 150 200 250 Time (s)

  10. Partition Partition The Problem Problem The 50 0 k m 5 00 km m k m k P´ 0 P´ M 6 0 0 0 e M x e x 6 i c o i c o C i t C y i t y � 18 0 k m � 18 0 km � � H ypocenter Hypocenter m 1 4 0 k m k 0 4 1 � � � � � � C C a a l l e t e a t a Z P P Y 124 km 124 km X Communication (B) Communication (B) Communication (A) (A) Communication (Among Among Cells Cells) ) ( (Geometry Geometry and and ( 5 00 k m 50 0 k m Physical Properties Physical Properties) ) 60 0 k m m k P´ P´ M 0 M e x 0 e x i c i c o 6 o C i C t i t y y � 1 8 0 km � 18 0 k m � � Hypocenter H ypocenter 1 4 0 k m m k 0 4 1 � � � � � � C a l C e a l t a e t a j i P P 0 124 km 124 km k Ghost Cell

  11. Z 1 Problem Y 2 Partition Partition 3 Communication Communication (A) (A) 5 5 00 km 0 0 X k m (Geometry ( Geometry and and Physical Physical Properties) ) Properties 4 Communication Communication (B) (B) 6 0 0 k m k m P � P´ (Among Among Cells Cells) ) ( 6 0 0 M M e e x x i c i c o o C C i i t t y y � 1 � 18 0 km 8 0 � � � k H H m H y y y p p p o o 1 4 0 k m o c c 1 4 0 k m c e e e n n n t t t e e e r r r � � � � � � � C C a a l l e e t t a a j i P P 0 124 km 124 km k Ghost Cell

  12. The benchmark cases Size of domain is 500 x 260 x 124 km Series of models: • 500m resolution 1000 x 520 x 248 grid • 250m resolution … • 125m resolution … • 62.5m resolution … • 31.25m resolution 16000 x 8320 x 3968 7 th May 2009 CUG 2009 Atlanta

  13. HECToR vs. Jaguar ‘Pf’ HECToR dual-core Jaguar ‘Pf’ quad-core Core Core – 2.8Ghz clock frequency – 2.3Ghz clock frequency – SSE SIMD FPU (2flops/cycle = – SSE SIMD FPU (4flops/cycle = 5.6GF peak) 9.2GF peak) Cache Hierarchy Cache Hierarchy – L1 Dcache/Icache: 64k/core – L1 Dcache/Icache: 64k/core – L2 D/I cache: 1M/core – L2 D/I cache: 512 KB/core – L3 Shared cache 2MB/Socket – SW Prefetch and loads to L1 – SW Prefetch and loads to L1,L2,L3 – Evictions and HW prefetch to – Evictions and HW prefetch to L1,L2,L3 L2 Memory Memory – 16 GB/node symmetric – 6 GB/node = 4 GB + 2 GB – Dual Channel DDR2 – Dual Channel DDR2 – 12GB/s peak @ 800MHz – 10GB/s peak @ 667MHz from Jason Beech-Brandt, Cray 7 th May 2009 CUG 2009 Atlanta 13

  14. Optimizations Opt 1: change MPI_sndrcv to MPI_IRecv, MPI_ISend preposting receives before buffer copies Opt 2: Opt 1 + BC code replace loops involving array syntax by a triply-nested loop so that the order of memory accesses is explicit Opt 3: Opt 2 + two subroutines were being called 320 million times – push loop into subroutines 7 th May 2009 CUG 2009 Atlanta

  15. Optimizations on HECToR 15% Opt 3 30 Opt 2 Performance (Ggridpoints-steps/sec) Opt 1 Original 20 10% 10 0 0 2048 4096 6144 8192 Number of processor cores 7 th May 2009 CUG 2009 Atlanta

  16. Vectorization PGI compiler with –O3 –fastsse 836, Generated 3 alternate loops for the inner loop Generated vector sse code for inner loop Generated 8 prefetch instructions for this loop Generated vector sse code for inner loop Generated 8 prefetch instructions for this loop Generated vector sse code for inner loop Generated 8 prefetch instructions for this loop Generated vector sse code for inner loop Generated 8 prefetch instructions for this loop 7 th May 2009 CUG 2009 Atlanta

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend