Parameter Tuning of a Hybrid Treecode-FMM on GPUs Rio Yokota, Lorena - - PowerPoint PPT Presentation

parameter tuning of a hybrid treecode fmm on gpus
SMART_READER_LITE
LIVE PREVIEW

Parameter Tuning of a Hybrid Treecode-FMM on GPUs Rio Yokota, Lorena - - PowerPoint PPT Presentation

Parameter Tuning of a Hybrid Treecode-FMM on GPUs Rio Yokota, Lorena Barba Department of Mechanical Engineering, Boston University Saturday, June 4, 2011 Previous Calculations N=3x10 9 : 6 sec (Yokota & Barba) N=3x10 9 : 20 sec 40 TFlops


slide-1
SLIDE 1

Parameter Tuning of a Hybrid Treecode-FMM on GPUs

Rio Yokota, Lorena Barba Department of Mechanical Engineering, Boston University

Saturday, June 4, 2011

slide-2
SLIDE 2

Previous Calculations

Nagasaki University DEGIMA cluster GTX295 x 380 = 760 GPUs Theoretical Peak : 700 TFlops Power Consumption : 100 kW

Astrophysics Treecode 100 TFlops N=3x109 : 6 sec (Nitadori & Hamada) Turbulence FMM 40 TFlops N=3x109 : 20 sec (Yokota & Barba)

Saturday, June 4, 2011

slide-3
SLIDE 3

Treecode & FMM

Saturday, June 4, 2011

slide-4
SLIDE 4

Multipole expansion

* j

i

do i = 1,N ff = 0 do j = 1,N ff = ff+1/ ( x( i )-x( j ) ) end do f( i ) = ff end do do k = 1,p gg = 0 do j = 1,N gg = gg+( x( j ) - xs )**( k-1 ) end do g(k) = gg end do do i = 1,N ff = 0 do k = 1,p ff = ff+( x( i )-xs )**( -k )*g( k-1 ) end do end do

N

  • j=1

1 xi − xj =

N

  • j=1

1 xi − x∗ p−1

  • k=0

xj − x∗ xi − x∗ k =

p−1

  • k=0

(xi − x∗)−k−1   

N

  • j=1

(xj − x∗)k   

1 1 − t =

p−1

  • k=0

tk

1 xi − xj = 1 xi − x∗ 1

  • 1 − xj−x∗

xi−x∗

  • and Taylor expansion

gives

Saturday, June 4, 2011

slide-5
SLIDE 5

Error Control

Treecode FMM Complexity Error

O(θp) O(p3θ−3)

Saturday, June 4, 2011

slide-6
SLIDE 6

Error Optimization

Optimize p Optimize θ Better on GPUs?

Saturday, June 4, 2011

slide-7
SLIDE 7

Stack based hybrid treewalk

target source push pop push pop push

Saturday, June 4, 2011

slide-8
SLIDE 8

Parameter study

Complexity Error

O(θp) O(p3θ−3)

Saturday, June 4, 2011

slide-9
SLIDE 9

Optimum parameters

Saturday, June 4, 2011

slide-10
SLIDE 10

Parallel calculation

Nagasaki University DEGIMA 760 GTX295 GPUs Peak : 0.7 PFlops Tokyo Institute of Technology TSUBAME 2.0 4224 M2050 GPUs Peak : 2.4 PFlops

Saturday, June 4, 2011

slide-11
SLIDE 11

Strong Scaling

1 2 4 8 16 32 64 128 256 512 50 100 150 200 250 300 350 400 Nprocs time x Nprocs [s] 1 2 4 8 16 32 64 128 256 512 50 100 150 200 250 300 350 400 Nprocs time x Nprocs [s]

tree construction mpisendp2p mpisendm2l P2Pkernel P2Mkernel M2Mkernel M2Lkernel L2Lkernel L2Pkernel

DEGIMA TSUBAME 2.0

N=108

Saturday, June 4, 2011

slide-12
SLIDE 12

Weak Scaling

! "# #$% #&!' & $ (& ($ #& #$ "& )*+,-./01/2.03-44-4/567849 :;+-/54-39 <03=>/-?=>*=@;0A BCC/-?=>*=@;0A C7D/30++*A;3=@;0A 678/30++*A;3=@;0A :.--/30A4@.*3@;0A

0.5 PFlops TSUBAME 2.0

Saturday, June 4, 2011

slide-13
SLIDE 13

Summary & Outlook

  • 1. The stack based treewalk enables a simple but effective hybridization of treecodes

and FMMs

  • 3. More tests need to be performed in the higher accuracy range and the overall

performance must be compared to other treecodes and FMMs

  • 2. The optimum p and are different on CPUs and GPUs, but this difference is small

θ

Saturday, June 4, 2011