ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin Pritish - - PowerPoint PPT Presentation
ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin Pritish - - PowerPoint PPT Presentation
ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin Pritish Jetley Celso Mendes Laxmikant Kale Thomas Quinn University of Illinois at Urbana-Champaign University of Washington 1 Outline Motivations Algorithm
Parallel Programming Laboratory @ UIUC 04/23/07
2
Outline
- Motivations
- Algorithm overview
- Scalability
- Load balancer
- Multistepping
Parallel Programming Laboratory @ UIUC 04/23/07
3
Motivations
- Need for simulations of the evolution of the
universe
- Current parallel codes:
– PKDGRAV – Gadget
- Scalability problems:
– load imbalance – expensive domain decomposition – limit to 128 processors
Parallel Programming Laboratory @ UIUC 04/23/07
4
ChaNGa: main characteristics
- Simulator of cosmological interaction
– Newtonian gravity – Periodic boundary conditions – Multiple timestepping
- Particle based (Lagrangian)
– high resolution where needed – based on tree structures
- Implemented in Charm++
– work divided among chares called TreePieces – processor-level optimization using a Charm++ group
called CacheManager
Parallel Programming Laboratory @ UIUC 04/23/07
5
Space decomposition
TreePiece 1 TreePiece 2 TreePiece 3 ...
Parallel Programming Laboratory @ UIUC 04/23/07
6
Basic algorithm ...
- Newtonian gravity interaction
– Each particle is influenced by all others: O(n²) algorithm
- Barnes-Hut approximation: O(nlogn)
– Influence from distant particles combined into center of
mass
Parallel Programming Laboratory @ UIUC 04/23/07
7
... in parallel
- Remote data
– need to fetch from other processors
- Data reusage
– same data needed by more than one particle
Parallel Programming Laboratory @ UIUC 04/23/07
8
Overall algorithm
Processor 1
local work (low priority) remote work miss
TreePiece C
local work (low priority)
remote work
TreePiece B global work
prefetch visit of the tree
TreePiece A local work (low priority)
Start computation End computation
global work
remote
present?
request node
CacheManager
YES: return
Processor n
reply with requested data
NO: fetch
c a l l b a c k TreePiece on Processor 2
buffer
High priority High priority
prefetch visit of the tree
Parallel Programming Laboratory @ UIUC 04/23/07
10
Systems
System Location Procs CPU Network Tungsten NCSA 2,560 2 Xeon 3.2 Ghz 3 GB Myrinet Cray XT3 Pittsburgh 4,136 2 Opteron 2.6GHz 2 GB Torus BlueGene/L IBM-Watson 40,000 2 Power440 700MHz 512 MB Torus Procs per node Memory per node
Parallel Programming Laboratory @ UIUC 04/23/07
11
Scaling: comparison
lambs 3M on Tungsten
Parallel Programming Laboratory @ UIUC 04/23/07
12
Scaling: IBM BlueGene/L
Parallel Programming Laboratory @ UIUC 04/23/07
13
Scaling: Cray XT3
Parallel Programming Laboratory @ UIUC 04/23/07
15
Load balancing with OrbLB
lambs 5M on 1,024 BlueGene/L processors white is good processors time
Parallel Programming Laboratory @ UIUC 04/23/07
17
Scaling with load balancing
Number of Processors x Execution Time per Iteration (s)
Parallel Programming Laboratory @ UIUC 04/23/07
18
Multistepping
- Particles with higher accelerations require smaller
integration timesteps to be accurately predicted.
- Compute particles with highest accelerations every
step, and particles with lower accelerations every few steps.
- Steps become different in terms of load.
Parallel Programming Laboratory @ UIUC 04/23/07
19
ChaNGa scalability - multistepping
dwarf 5M on Tungsten
Parallel Programming Laboratory @ UIUC 04/23/07
20
ChaNGa scalability - multistepping
Parallel Programming Laboratory @ UIUC 04/23/07
21
Future work
- Adding new physics
– Smoothed Particle Hydrodynamics
- More load balancer / scalability
– Reducing overhead of communication – Load balancing without increasing communication
volume
– Multiphase for multistepping – Other phases of the computation
Parallel Programming Laboratory @ UIUC 04/23/07
22
Questions?
Thank you
Parallel Programming Laboratory @ UIUC 04/23/07
23
Decomposition types
- OCT
– Contiguous cubic volume of space to each TreePiece
- SFC – Morton and Peano-Hilbert
– Space Filling Curve imposes total ordering of particles – Segment of this line to each TreePiece
- ORB
– Space divided by Orthogonal Recursive Bisection on
the number of particles
– Contiguous non-cubic volume of space to each
TreePiece
– Due to the shapes of the decomposition, requires more
computation to produce correct results
Parallel Programming Laboratory @ UIUC 04/23/07
24
Serial performance
Execution Time on Tungsten (in seconds) Simulator Lambs datasets 30,000 300,000 1,000,000 3,000,000 PKDGRAV 0.8 12.0 48.5 170.0 ChaNGa 0.8 13.2 53.6 180.6 Time difference 0.00% 9.09% 9.51% 5.87%
Parallel Programming Laboratory @ UIUC 04/23/07
25
CacheManager importance
Number of Processors 4 8 16 32 64 No Cache 48,723 59,115 59,116 68,937 78,086 With Cache 72 115 169 265 397 Time (seconds) No Cache 730.7 453.9 289.1 67.4 42.1 With Cache 39.0 20.4 11.3 6.0 3.3 Speedup 18.74 22.25 25.58 11.23 12.76 Number of messages (in thousand) 1 million lambs dataset on HPCx
Parallel Programming Laboratory @ UIUC 04/23/07
26
Prefetching
1) explicit
- before force computation, data is requested for preload
2) implicit in the cache
- computation performed with tree walks
- after visiting a node, its children will likely be visited
- while fetching remote nodes, the cache prefetches
some of its children
Parallel Programming Laboratory @ UIUC 04/23/07
27
Cache implicit prefetching
1 2 3 4 5 6 7 8 9 10 11 12 13 14 7.5 7.55 7.6 7.65 7.7 7.75 7.8 7.85 7.9 7.95 8 8.05 5 10 15 20 25 30 35 40 45 50 55 60 65
lambs dataset on 64 processors of Tungsten
Time Memory
Cache Prefetch Depth Execution Time (in seconds) Memory Consumption (in MB)
1 2 3
Parallel Programming Laboratory @ UIUC 04/23/07
29
Charm++ Overview
- work decomposed into
- bjects called chares
- message driven
User view S ystem view P 1 P 3 P 2
- mapping of objects to
processors transparent to user
- automatic load balancing
- communication optimization
Parallel Programming Laboratory @ UIUC 04/23/07
30
Tree decomposition
TreePiece 1 TreePiece 3 TreePiece 2
- Exclusive
- Shared
- Remote
Parallel Programming Laboratory @ UIUC 04/23/07
31
Space decomposition
TreePiece 1 TreePiece 2 TreePiece 3 ...
Parallel Programming Laboratory @ UIUC 04/23/07
32
Overall algorithm
Processor 1
local work (low priority) remote work miss
TreePiece C
local work (low priority)
remote work
TreePiece B TreePiece A
local work (low priority) global work
Start computation End computation
remote
present?
request node
CacheManager
YES: return
Processor n
reply with requested data
NO: fetch
c a l l b a c k TreePiece on Processor 2
buffer
Parallel Programming Laboratory @ UIUC 04/23/07
33
Scalability comparison (old result)
dwarf 5M comparison on Tungsten flat: perfect scaling diagonal: no scaling
Parallel Programming Laboratory @ UIUC 04/23/07
34
ChaNGa scalability (old results)
flat: perfect scaling diagonal: no scaling results on BlueGene/L
Parallel Programming Laboratory @ UIUC 04/23/07
35
Interaction list
X
- TreePiece A
Parallel Programming Laboratory @ UIUC 04/23/07
36
Interaction lists
Node X
- pening criteria
cut-off node X is undecided node X is accepted node X is opened
Parallel Programming Laboratory @ UIUC 04/23/07
37
Interaction list
Interaction List Check List Interaction List Check List Interaction List Check List Interaction List Interaction List Interaction List Interaction List Check List
- Node X
Node X Node X Node X
Interaction List Check List
Children of X
Double simultaneous walk in two copies
- f the tree:
1) force computation 2) exploit this observation
walk 2)
Parallel Programming Laboratory @ UIUC 04/23/07
38
Interaction list: results
Number of checks for opening criteria, in millions lambs 1M dwarf 5M Original code 120 1,108 Interaction list 66 440
- 10% average
performance improvement
32 64 128 256 512 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
dwarf 5M on HPCx
Original Interaction lists
Number of processors Relative time
Parallel Programming Laboratory @ UIUC 04/23/07
40
Load balancer
dwarf 5M dataset on BlueGene/L improvement between 15% and 35% flat lines good raising lines bad
Parallel Programming Laboratory @ UIUC 04/23/07