larg arge e scale ale larg arge e scale ale dark dark
play

Larg arge e Scale ale Larg arge e Scale ale Dark Dark Matte - PowerPoint PPT Presentation

Larg arge e Scale ale Larg arge e Scale ale Dark Dark Matte atter r Dark Dark Matte atter r Simu mula latio tion Simu mula latio tion Tomoaki Ishiyama Tomoaki Ishiyama University of Tsukuba University of Tsukuba ( HPCI


  1. Larg arge e Scale ale Larg arge e Scale ale Dark Dark Matte atter r Dark Dark Matte atter r Simu mula latio tion Simu mula latio tion Tomoaki Ishiyama Tomoaki Ishiyama University of Tsukuba University of Tsukuba ( HPCI Strategic Field Program 5 ) ( HPCI Strategic Field Program 5 )

  2. Standard cosmolo logy Standard cosmolo logy Cold dark matter (CDM DM) model Cold dark matter (CDM DM) model ● Dark matter dominates gravitational structure formation in the Universe. Five times as much as baryonic matter ● Primordial small density fluctuations ● Smaller dark matter structures gravitationally collapse first, they ● merge into larger structures (dark matter halo) Baryonic matter condenses in the potential of dark matter halos, ● forms galaxies large scale structure dark matter halo galaxy

  3. Large scale le dark rk matter r sim imula latio ion Large scale le dark rk matter r sim imula latio ion ● High resolution observations ● Wide field -> large, high resolution simulations ● 4096 3 -8192 3 particles in 800-1600Mpc Box ● Preparations for other large scale simulations ● Galaxy formation ● Star formation

  4. Para ralle lel l Scalin ling Para ralle lel l Scalin ling ● Parallel calculation is not trivial Fast network ● comminication overhead ● Gravity: long-range force ● long communication Node (CPU + Memory) ● Cosmological N-body simulation codes can run efficiently on a few hundreds nodes systems. But... Gadget-2 (Springel 2005)

  5. CfCA Cray-XC30 (~1060 noces, ~25440 cores) PC cluster CfCA Cray-XT4 (~100 cores) (~700 nodes, ~3000 cores) K Computer ~80000 nodes How can an we get a a good How can an we get a a good ~0.66M cores efficiency on ~1M cores efficiency on ~1M cores systems? systems? Here we use K computer World’s third fastes est / most power er co comsuming system em

  6. Well ll known N-body algorit rithm Well ll known N-body algorit rithm ● Tree ( Barnes and Hut, 1986 ) Solves the forces from distant particles ● by multipole expansions O( NlogN ) ● ● Particle-Mesh ( PM : e.g. Hockney and Eastwood, 1981 ) Calculates the mass density on a coarse-grained mesh ● Solves Poission equation by FFT ( ρ ➙ Φ ) ● Advantage very fast and periodic boundary is naturally satisfied ● Disadvantage resolution is restricted by mesh size ●

  7. TreePM PM N-body solv lver TreePM PM N-body solv lver ● Tr Tree ee for short-range forces ● PM ( particle mesh) for long-range forces ● O(N 2 ) to O(NlogN) ● Suitable for periodic boundary ● Becomes popular from 21 th GOTPM (Dubinski+ 2004) ● GADGET-2 (Springel 2005) ● GreeM (Ishiyama+ 2009) ● HACC (Habib+ 2012) ●

  8. Para ralle lel cal alcula latio ion procedure Para ralle lel cal alcula latio ion procedure 1. 3D domain decomposition, re-distributed particles short and long communication ● 2. Calc long-range forces (PM) long communication ● 3. Calc short-range forces (Tree) short communication ● 4. Time integration (second order leapfrog) 5. Return to 1

  9. Technical l is issues Technical l is issues ● Dense stuructures form everywhere via the gravitationaly instability Dynamic domain update ● with a good load balancer ● Gravity is the long-range force Hierarchical communication ● ● O ( N log N ) is stil expensive SIMD ● ( GRAPEs, GPU ) ●

  10. Optimizatio ion 1 Optimizatio ion 1 Dy Dynam amic ic domain in update Dynam Dy amic ic domain in update ● ∆ Space filling curve ● ○ multi section Enables each node to know easily ● where to perform short communications ● ⅹ equal #particles ● ∆ equal #interactions Our r code ● ○ equal calculation time GADGET-2

  11. Load bala lancer Load bala lancer Sampling method to determine the geometry ● Each node samples particles and shares with all nodes ● Sample frequency depends on the local calculation cost ● -> realizes near-ideal load balance R~10 -3 ~ 10 -5 New geometry is adjusted so that all domains have the ● same number of sampled particles Linear weighted moving average for last 5 steps ● calculation cost #sampling particles new geometry new calculation cost

  12. Optimizatio ion2 Optimizatio ion2 Para ralle lel PM Para ralle lel PM 1.density assignment 2.local density → slab density ● long com ommunication 3.FFT ( density → potential) ● Long communication (FFTW, Fujitsu FFT) 4.slab potential → local potential ● long com ommunication 5. force interpolation 12

  13. Optimizatio ion2 Optimizatio ion2 Numeric rical l challe llenge Numeric rical l challe llenge ● If we simulate ~10000 3 particles with ~5000 3 PM mesh on the fullsystem of K computer... ● FFT is performed by only ~5000 nodes (FFTW) since only slab mesh layout is supported ● Particles are distributed in 3D domain (48x54x32= 82944 nodes) Each FFT process has to receive density meshes from ~ 5000 nodes ● MPI_Isend, MPI_Irecv would not work ● MPI_Alltoallv is easy to implement, but... 13

  14. Optimizatio ion2 Optimizatio ion2 Novel el communication algorithm Novel el communication algorithm ● Hierarchical communication ● MPI_Alltoallv( ..., MPI_COMM_WORLD) -> ● MPI_Alltoallv(..., COMM_SMALLA2A) ● MPI_Reduce(..., COMM_REDUCE)

  15. This implementation is done by Keigo Nitadori Optimizatio ion 3 Optimizatio ion 3 as an extension of Phantom-GRAPE library Gravi vity kern rnel Gravi vity kern rnel ( Nitadori+ 2006, Tanikawa+ 2012, 2013 ) 1. One if-branch is reduced 2. optimized for a SIMD hardware with FMA ● Theoretical limit is 12Gflops/core (16Gflops for DGEMM) ● 11. 1.65 65 Gflops on a simple kernel benchmark ● 97 % of the theoretical limit (73 % of the peak)

  16. Other optimizations Other optimizations K(torus network) ● Assigning nodes as the physical layout fits the network topology ● MPI + OpenMP Hybrid ● Reduce network conjection ● Overlapp communication with calculation ● Hidden comminication

  17. All shufful All shufful ex. restart ● "Alltoallv" is easy. But if p>~10000,,, ● Communication cost O( p log p ) ● Communication buffer is sufficient? ● MPI_Alltoallv (…, MPI_COMM_WORLD) Split MPI_COMM_WORLD ● O ( p log p ) MPI_Alltoallv (…, COMM_X) + MPI_Alltoallv (…, COMM_Y) + MPI_Alltoallv (…, COMM_Z) O ( 3 p 1/3 log p 1/3 ) = O( p 1/3 log p )

  18. Perfo form rmance re result lts on K K computer Perfo form rmance re result lts on K K computer ● Sca calability (204 2048 3 - 10 1024 240 3 ) sec/step 819 8192 3 ● Excellent strong scaling ● 10240 3 simulation is well scaled from 24576 to 82944 409 096 3 (full) nodes of K computer 20 2048 48 3 1024 0240 3 ● Performance ( (126 12600 00 3 ) 10 2 10 3 10 4 10 5 10 5 10 5 ● The average performance on full system (82944=48x54x32) 5.8Pflops , 5. is ~ 5. 5.8Pflops Ishiyama, Nitadori, and Makino, 2012 (arXiv: 1211:4406), 55 % of the peak speed 55 ~ 55 55 SC12 Gordon Bell Prize Winner

  19. Compari rison with other code Compari rison with other code ● Our cod ode (Ishiyama et al. 2012) ● 5.8 Pflops (55 % of peak) on K computer (10.6Pflops) ● 2.5x10 -11 sec / substep / particle ● HACC (Habib et al. 2012) ● 14 Pflops(70 % of peak) on Sequoia (20Pflops) ● 6.0x10 -11 sec / substep / particle Our code can perfor orm cosmological simulations ~5 times faster than other optimized code

  20. Ongoing sim imula latio ions Ongoing sim imula latio ions ● Semi analytic galaxy formation model ● 4096 3 800Mpc Box ● 8192 3 1600Mpc ● Cosmology ● 4096 3 3-6Gpc with and without non gaussianity ● Dark matter detection ● 4096 3 -10240 3 ~1kpc

  21. Simula lation setup 16,777,216,000 part rticle les simula latio ion Simula lation setup Visualization: Takaaki i Takeda (4D 4D2U, N National Astronomical Obser ervatory of Japan)

  22. Summary Summary ● We developed the fastest TreePM code in the world, ● Dynamic domain update with time-based load balancer ● Split MPI_COMM_WORLD ● Highly-tuned Gravity Kernel for short-range forces ● The average performance on full system of 5.8Pflops , which corresponds to K computer is ~ 5.8Pflops 55 % of the peak speed ~ 55 ● Our code can perform cosmological simulations about five times faster than other optimized code -> larger, higher resolution simulations

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend