Larg arge e Scale ale Larg arge e Scale ale Dark Dark Matte - - PowerPoint PPT Presentation

larg arge e scale ale larg arge e scale ale dark dark
SMART_READER_LITE
LIVE PREVIEW

Larg arge e Scale ale Larg arge e Scale ale Dark Dark Matte - - PowerPoint PPT Presentation

Larg arge e Scale ale Larg arge e Scale ale Dark Dark Matte atter r Dark Dark Matte atter r Simu mula latio tion Simu mula latio tion Tomoaki Ishiyama Tomoaki Ishiyama University of Tsukuba University of Tsukuba ( HPCI


slide-1
SLIDE 1

Larg arge e Scale ale Larg arge e Scale ale Dark Dark Matte atter r Dark Dark Matte atter r Simu mula latio tion Simu mula latio tion

Tomoaki Ishiyama Tomoaki Ishiyama

University of Tsukuba University of Tsukuba ( (HPCI Strategic Field Program 5

HPCI Strategic Field Program 5)

)

slide-2
SLIDE 2

Standard cosmolo logy Standard cosmolo logy

Cold dark matter (CDM DM) model Cold dark matter (CDM DM) model

  • Dark matter dominates gravitational structure

formation in the Universe.

  • Five times as much as baryonic matter
  • Primordial small density fluctuations
  • Smaller dark matter structures gravitationally collapse first, they

merge into larger structures (dark matter halo)

  • Baryonic matter condenses in the potential of dark matter halos,

forms galaxies

dark matter halo galaxy large scale structure

slide-3
SLIDE 3

Large scale le dark rk matter r sim imula latio ion Large scale le dark rk matter r sim imula latio ion

  • High resolution observations
  • Wide field
  • > large, high resolution simulations
  • 40963-81923 particles in 800-1600Mpc Box
  • Preparations for other large scale simulations
  • Galaxy formation
  • Star formation
slide-4
SLIDE 4

Para ralle lel l Scalin ling Para ralle lel l Scalin ling

  • Parallel calculation is not trivial
  • comminication overhead
  • Gravity: long-range force
  • long communication
  • Cosmological N-body

simulation codes can run efficiently on a few hundreds nodes systems. But...

Fast network

Node (CPU + Memory)

Gadget-2 (Springel 2005)

slide-5
SLIDE 5

Here we use K computer

World’s third fastes est / most power er co comsuming system em

How can an we get a a good How can an we get a a good efficiency on ~1M cores efficiency on ~1M cores systems? systems?

PC cluster (~100 cores) CfCA Cray-XT4 (~700 nodes, ~3000 cores) K Computer ~80000 nodes ~0.66M cores CfCA Cray-XC30 (~1060 noces, ~25440 cores)

slide-6
SLIDE 6

Well ll known N-body algorit rithm Well ll known N-body algorit rithm

  • Tree ( Barnes and Hut, 1986 )
  • Solves the forces from distant particles

by multipole expansions

  • O( NlogN )
  • Particle-Mesh ( PM : e.g. Hockney and Eastwood, 1981 )
  • Calculates the mass density on a coarse-grained mesh
  • Solves Poission equation by FFT ( ρ ➙ Φ )
  • Advantage very fast and periodic boundary is naturally satisfied
  • Disadvantage resolution is restricted by mesh size
slide-7
SLIDE 7

TreePM PM N-body solv lver TreePM PM N-body solv lver

  • Tr

Tree ee for short-range forces

  • PM ( particle mesh) for

long-range forces

  • O(N2) to O(NlogN)
  • Suitable for periodic boundary
  • Becomes popular from 21th
  • GOTPM (Dubinski+ 2004)
  • GADGET-2 (Springel 2005)
  • GreeM (Ishiyama+ 2009)
  • HACC (Habib+ 2012)
slide-8
SLIDE 8

Para ralle lel cal alcula latio ion procedure Para ralle lel cal alcula latio ion procedure

  • 1. 3D domain decomposition,

re-distributed particles

  • short and long communication
  • 2. Calc long-range forces (PM)
  • long communication
  • 3. Calc short-range forces (Tree)
  • short communication
  • 4. Time integration

(second order leapfrog)

  • 5. Return to 1
slide-9
SLIDE 9

Technical l is issues Technical l is issues

  • Dense stuructures form

everywhere via the gravitationaly instability

  • Dynamic domain update

with a good load balancer

  • Gravity is the long-range force
  • Hierarchical communication
  • O ( N logN ) is stil expensive
  • SIMD
  • ( GRAPEs, GPU )
slide-10
SLIDE 10

Optimizatio ion 1 Optimizatio ion 1

Dy Dynam amic ic domain in update Dy Dynam amic ic domain in update

  • ∆ Space filling curve
  • ○ multi section
  • Enables each node to know easily

where to perform short communications

  • ⅹ equal #particles
  • ∆ equal #interactions
  • ○ equal calculation time

Our r code GADGET-2

slide-11
SLIDE 11

Load bala lancer Load bala lancer

  • Sampling method to determine the geometry
  • Each node samples particles and shares with all nodes
  • Sample frequency depends on the local calculation cost
  • > realizes near-ideal load balance
  • New geometry is adjusted so that all domains have the

same number of sampled particles

  • Linear weighted moving average for last 5 steps

R~10-3 ~ 10-5

calculation cost #sampling particles new geometry new calculation cost

slide-12
SLIDE 12

12

Optimizatio ion2 Optimizatio ion2

Para ralle lel PM Para ralle lel PM

1.density assignment 2.local density → slab density

  • long com
  • mmunication

3.FFT ( density → potential)

  • Long communication

(FFTW, Fujitsu FFT) 4.slab potential → local potential

  • long com
  • mmunication
  • 5. force interpolation
slide-13
SLIDE 13

13

Optimizatio ion2 Optimizatio ion2

Numeric rical l challe llenge Numeric rical l challe llenge

  • If we simulate ~100003 particles with ~50003 PM mesh on

the fullsystem of K computer...

  • FFT is performed by only ~5000 nodes (FFTW) since
  • nly slab mesh layout is supported
  • Particles are distributed in 3D domain

(48x54x32= 82944 nodes)

  • MPI_Isend, MPI_Irecv would not work
  • MPI_Alltoallv is easy to implement, but...

Each FFT process has to receive density meshes from ~ 5000 nodes

slide-14
SLIDE 14

Optimizatio ion2 Optimizatio ion2

Novel el communication algorithm Novel el communication algorithm

  • Hierarchical communication
  • MPI_Alltoallv( ..., MPI_COMM_WORLD) ->
  • MPI_Alltoallv(..., COMM_SMALLA2A)
  • MPI_Reduce(..., COMM_REDUCE)
slide-15
SLIDE 15
slide-16
SLIDE 16

Optimizatio ion 3 Optimizatio ion 3

Gravi vity kern rnel Gravi vity kern rnel

This implementation is done by Keigo Nitadori as an extension of Phantom-GRAPE library ( Nitadori+ 2006, Tanikawa+ 2012, 2013 )

  • 1. One if-branch is reduced
  • 2. optimized for a SIMD hardware with FMA
  • Theoretical limit is 12Gflops/core (16Gflops for DGEMM)
  • 11.

1.65 65 Gflops on a simple kernel benchmark

  • 97% of the theoretical limit (73% of the peak)
slide-17
SLIDE 17

Other optimizations Other optimizations

  • Assigning nodes as

the physical layout fits the network topology

  • MPI + OpenMP Hybrid
  • Reduce network conjection
  • Overlapp communication

with calculation

  • Hidden comminication

K(torus network)

slide-18
SLIDE 18

All shufful All shufful

  • ex. restart
  • "Alltoallv" is easy. But if p>~10000,,,
  • Communication cost O(p logp)
  • Communication buffer is sufficient?
  • Split MPI_COMM_WORLD

MPI_Alltoallv (…, MPI_COMM_WORLD)

O ( p log p)

MPI_Alltoallv (…, COMM_X) + MPI_Alltoallv (…, COMM_Y) + MPI_Alltoallv (…, COMM_Z) O ( 3p1/3 logp1/3) = O(p1/3 logp)

slide-19
SLIDE 19

Perfo form rmance re result lts on K K computer Perfo form rmance re result lts on K K computer

  • Sca

calability (204 20483 - 10 1024 2403)

  • Excellent strong scaling
  • 102403 simulation is well

scaled from 24576 to 82944 (full) nodes of K computer

  • Performance (

(126 12600 003)

  • The average performance on

full system (82944=48x54x32) is ~5.

5.8Pflops 5. 5.8Pflops,

~55

55 55 55% of the peak speed

20 2048 483 409 0963 819 81923 1024 02403

105 103 105 104 105 102 sec/step

Ishiyama, Nitadori, and Makino, 2012 (arXiv: 1211:4406), SC12 Gordon Bell Prize Winner

slide-20
SLIDE 20

Compari rison with other code Compari rison with other code

  • Our cod
  • de (Ishiyama et al. 2012)
  • 5.8 Pflops (55% of peak) on K computer (10.6Pflops)
  • 2.5x10-11 sec / substep / particle
  • HACC (Habib et al. 2012)
  • 14 Pflops(70% of peak) on Sequoia (20Pflops)
  • 6.0x10-11 sec / substep / particle

Our code can perfor

  • rm cosmological simulations

~5 times faster than other optimized code

slide-21
SLIDE 21

Ongoing sim imula latio ions Ongoing sim imula latio ions

  • Semi analytic galaxy formation model
  • 40963 800Mpc Box
  • 81923 1600Mpc
  • Cosmology
  • 40963 3-6Gpc with and without non gaussianity
  • Dark matter detection
  • 40963-102403 ~1kpc
slide-22
SLIDE 22
slide-23
SLIDE 23

Simula lation setup Simula lation setup

Visualization:

Takaaki i Takeda

(4D 4D2U, N National Astronomical Obser ervatory of Japan)

16,777,216,000 part rticle les simula latio ion

slide-24
SLIDE 24

Summary Summary

  • We developed the fastest TreePM code in the world,
  • Dynamic domain update with time-based load balancer
  • Split MPI_COMM_WORLD
  • Highly-tuned Gravity Kernel for short-range forces
  • The average performance on full system of

K computer is ~5.8Pflops

5.8Pflops, which corresponds to

~55

55% of the peak speed

  • Our code can perform cosmological simulations about

five times faster than other optimized code

  • > larger, higher resolution simulations