ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin Pritish - - PowerPoint PPT Presentation

changa the charm n body gravity solver
SMART_READER_LITE
LIVE PREVIEW

ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin Pritish - - PowerPoint PPT Presentation

ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin Pritish Jetley Celso Mendes Laxmikant Kale Thomas Quinn University of Illinois at Urbana-Champaign University of Washington 1 Outline Motivations Algorithm


slide-1
SLIDE 1

1

ChaNGa: The Charm N-Body GrAvity Solver

Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn²

¹ University of Illinois at Urbana-Champaign ² University of Washington

slide-2
SLIDE 2

Parallel Programming Laboratory @ UIUC 04/23/07

2

Outline

  • Motivations
  • Algorithm overview
  • Scalability
  • Load balancer
  • Multistepping
slide-3
SLIDE 3

Parallel Programming Laboratory @ UIUC 04/23/07

3

Motivations

  • Need for simulations of the evolution of the

universe

  • Current parallel codes:

– PKDGRAV – Gadget

  • Scalability problems:

– load imbalance – expensive domain decomposition – limit to 128 processors

slide-4
SLIDE 4

Parallel Programming Laboratory @ UIUC 04/23/07

4

ChaNGa: main characteristics

  • Simulator of cosmological interaction

– Newtonian gravity – Periodic boundary conditions – Multiple timestepping

  • Particle based (Lagrangian)

– high resolution where needed – based on tree structures

  • Implemented in Charm++

– work divided among chares called TreePieces – processor-level optimization using a Charm++ group

called CacheManager

slide-5
SLIDE 5

Parallel Programming Laboratory @ UIUC 04/23/07

5

Space decomposition

TreePiece 1 TreePiece 2 TreePiece 3 ...

slide-6
SLIDE 6

Parallel Programming Laboratory @ UIUC 04/23/07

6

Basic algorithm ...

  • Newtonian gravity interaction

– Each particle is influenced by all others: O(n²) algorithm

  • Barnes-Hut approximation: O(nlogn)

– Influence from distant particles combined into center of

mass

slide-7
SLIDE 7

Parallel Programming Laboratory @ UIUC 04/23/07

7

... in parallel

  • Remote data

– need to fetch from other processors

  • Data reusage

– same data needed by more than one particle

slide-8
SLIDE 8

Parallel Programming Laboratory @ UIUC 04/23/07

8

Overall algorithm

Processor 1

local work (low priority) remote work miss

TreePiece C

local work (low priority)

remote work

TreePiece B global work

prefetch visit of the tree

TreePiece A local work (low priority)

Start computation End computation

global work

remote

present?

request node

CacheManager

YES: return

Processor n

reply with requested data

NO: fetch

c a l l b a c k TreePiece on Processor 2

buffer

High priority High priority

prefetch visit of the tree

slide-9
SLIDE 9
slide-10
SLIDE 10

Parallel Programming Laboratory @ UIUC 04/23/07

10

Systems

System Location Procs CPU Network Tungsten NCSA 2,560 2 Xeon 3.2 Ghz 3 GB Myrinet Cray XT3 Pittsburgh 4,136 2 Opteron 2.6GHz 2 GB Torus BlueGene/L IBM-Watson 40,000 2 Power440 700MHz 512 MB Torus Procs per node Memory per node

slide-11
SLIDE 11

Parallel Programming Laboratory @ UIUC 04/23/07

11

Scaling: comparison

lambs 3M on Tungsten

slide-12
SLIDE 12

Parallel Programming Laboratory @ UIUC 04/23/07

12

Scaling: IBM BlueGene/L

slide-13
SLIDE 13

Parallel Programming Laboratory @ UIUC 04/23/07

13

Scaling: Cray XT3

slide-14
SLIDE 14
slide-15
SLIDE 15

Parallel Programming Laboratory @ UIUC 04/23/07

15

Load balancing with OrbLB

lambs 5M on 1,024 BlueGene/L processors white is good processors time

slide-16
SLIDE 16
slide-17
SLIDE 17

Parallel Programming Laboratory @ UIUC 04/23/07

17

Scaling with load balancing

Number of Processors x Execution Time per Iteration (s)

slide-18
SLIDE 18

Parallel Programming Laboratory @ UIUC 04/23/07

18

Multistepping

  • Particles with higher accelerations require smaller

integration timesteps to be accurately predicted.

  • Compute particles with highest accelerations every

step, and particles with lower accelerations every few steps.

  • Steps become different in terms of load.
slide-19
SLIDE 19

Parallel Programming Laboratory @ UIUC 04/23/07

19

ChaNGa scalability - multistepping

dwarf 5M on Tungsten

slide-20
SLIDE 20

Parallel Programming Laboratory @ UIUC 04/23/07

20

ChaNGa scalability - multistepping

slide-21
SLIDE 21

Parallel Programming Laboratory @ UIUC 04/23/07

21

Future work

  • Adding new physics

– Smoothed Particle Hydrodynamics

  • More load balancer / scalability

– Reducing overhead of communication – Load balancing without increasing communication

volume

– Multiphase for multistepping – Other phases of the computation

slide-22
SLIDE 22

Parallel Programming Laboratory @ UIUC 04/23/07

22

Questions?

Thank you

slide-23
SLIDE 23

Parallel Programming Laboratory @ UIUC 04/23/07

23

Decomposition types

  • OCT

– Contiguous cubic volume of space to each TreePiece

  • SFC – Morton and Peano-Hilbert

– Space Filling Curve imposes total ordering of particles – Segment of this line to each TreePiece

  • ORB

– Space divided by Orthogonal Recursive Bisection on

the number of particles

– Contiguous non-cubic volume of space to each

TreePiece

– Due to the shapes of the decomposition, requires more

computation to produce correct results

slide-24
SLIDE 24

Parallel Programming Laboratory @ UIUC 04/23/07

24

Serial performance

Execution Time on Tungsten (in seconds) Simulator Lambs datasets 30,000 300,000 1,000,000 3,000,000 PKDGRAV 0.8 12.0 48.5 170.0 ChaNGa 0.8 13.2 53.6 180.6 Time difference 0.00% 9.09% 9.51% 5.87%

slide-25
SLIDE 25

Parallel Programming Laboratory @ UIUC 04/23/07

25

CacheManager importance

Number of Processors 4 8 16 32 64 No Cache 48,723 59,115 59,116 68,937 78,086 With Cache 72 115 169 265 397 Time (seconds) No Cache 730.7 453.9 289.1 67.4 42.1 With Cache 39.0 20.4 11.3 6.0 3.3 Speedup 18.74 22.25 25.58 11.23 12.76 Number of messages (in thousand) 1 million lambs dataset on HPCx

slide-26
SLIDE 26

Parallel Programming Laboratory @ UIUC 04/23/07

26

Prefetching

1) explicit

  • before force computation, data is requested for preload

2) implicit in the cache

  • computation performed with tree walks
  • after visiting a node, its children will likely be visited
  • while fetching remote nodes, the cache prefetches

some of its children

slide-27
SLIDE 27

Parallel Programming Laboratory @ UIUC 04/23/07

27

Cache implicit prefetching

1 2 3 4 5 6 7 8 9 10 11 12 13 14 7.5 7.55 7.6 7.65 7.7 7.75 7.8 7.85 7.9 7.95 8 8.05 5 10 15 20 25 30 35 40 45 50 55 60 65

lambs dataset on 64 processors of Tungsten

Time Memory

Cache Prefetch Depth Execution Time (in seconds) Memory Consumption (in MB)

1 2 3

slide-28
SLIDE 28
slide-29
SLIDE 29

Parallel Programming Laboratory @ UIUC 04/23/07

29

Charm++ Overview

  • work decomposed into
  • bjects called chares
  • message driven

User view S ystem view P 1 P 3 P 2

  • mapping of objects to

processors transparent to user

  • automatic load balancing
  • communication optimization
slide-30
SLIDE 30

Parallel Programming Laboratory @ UIUC 04/23/07

30

Tree decomposition

TreePiece 1 TreePiece 3 TreePiece 2

  • Exclusive
  • Shared
  • Remote
slide-31
SLIDE 31

Parallel Programming Laboratory @ UIUC 04/23/07

31

Space decomposition

TreePiece 1 TreePiece 2 TreePiece 3 ...

slide-32
SLIDE 32

Parallel Programming Laboratory @ UIUC 04/23/07

32

Overall algorithm

Processor 1

local work (low priority) remote work miss

TreePiece C

local work (low priority)

remote work

TreePiece B TreePiece A

local work (low priority) global work

Start computation End computation

remote

present?

request node

CacheManager

YES: return

Processor n

reply with requested data

NO: fetch

c a l l b a c k TreePiece on Processor 2

buffer

slide-33
SLIDE 33

Parallel Programming Laboratory @ UIUC 04/23/07

33

Scalability comparison (old result)

dwarf 5M comparison on Tungsten flat: perfect scaling diagonal: no scaling

slide-34
SLIDE 34

Parallel Programming Laboratory @ UIUC 04/23/07

34

ChaNGa scalability (old results)

flat: perfect scaling diagonal: no scaling results on BlueGene/L

slide-35
SLIDE 35

Parallel Programming Laboratory @ UIUC 04/23/07

35

Interaction list

X

  • TreePiece A
slide-36
SLIDE 36

Parallel Programming Laboratory @ UIUC 04/23/07

36

Interaction lists

Node X

  • pening criteria

cut-off node X is undecided node X is accepted node X is opened

slide-37
SLIDE 37

Parallel Programming Laboratory @ UIUC 04/23/07

37

Interaction list

Interaction List Check List Interaction List Check List Interaction List Check List Interaction List Interaction List Interaction List Interaction List Check List

  • Node X

Node X Node X Node X

Interaction List Check List

Children of X

Double simultaneous walk in two copies

  • f the tree:

1) force computation 2) exploit this observation

walk 2)

slide-38
SLIDE 38

Parallel Programming Laboratory @ UIUC 04/23/07

38

Interaction list: results

Number of checks for opening criteria, in millions lambs 1M dwarf 5M Original code 120 1,108 Interaction list 66 440

  • 10% average

performance improvement

32 64 128 256 512 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

dwarf 5M on HPCx

Original Interaction lists

Number of processors Relative time

slide-39
SLIDE 39
slide-40
SLIDE 40

Parallel Programming Laboratory @ UIUC 04/23/07

40

Load balancer

dwarf 5M dataset on BlueGene/L improvement between 15% and 35% flat lines good raising lines bad

slide-41
SLIDE 41

Parallel Programming Laboratory @ UIUC 04/23/07

41

ChaNGa scalability

flat: perfect scaling diagonal: no scaling