An Introductory Exascale Feasibility Study for FFTs and Multigrid - - PowerPoint PPT Presentation

an introductory exascale feasibility study for ffts and
SMART_READER_LITE
LIVE PREVIEW

An Introductory Exascale Feasibility Study for FFTs and Multigrid - - PowerPoint PPT Presentation

An Introductory Exascale Feasibility Study for FFTs and Multigrid Hormozd Gahvari William Gropp University of Illinois at Urbana-Champaign April 22, 2010 Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study


slide-1
SLIDE 1

An Introductory Exascale Feasibility Study for FFTs and Multigrid

Hormozd Gahvari William Gropp

University of Illinois at Urbana-Champaign

April 22, 2010

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-2
SLIDE 2

Outline

1 Exascale basics 2 Studying application feasibility 3 FFT study 4 Multigrid study 5 Conclusions and directions for future work Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-3
SLIDE 3

Exascale Basics

Exascale means 1018 operations per second Exascale machines expected to have between 100 million and 1 billion cores Use of new technologies and perhaps novel architectures also expected Big impact on applications anticipated

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-4
SLIDE 4

Studying Application Feasibility

Main challenge: specific design and machine parameters are far from known, so no straightforward plugging numbers into performance models Instead, treat machine parameters like latency and bandwidth as variables and see what range of values for them would be feasible, i.e., what kind of machine would need to be built to enable exascale performance? Model on following “hypothetical exascale machine:”

228 ≈ 268.5 million cores Time per flop tc = 10−10 seconds Peak performance: 2.68 EFLOPS

Also vary problem sizes

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-5
SLIDE 5

Studying Application Feasibility

Use LogP performance model to model performance. Parameters are:

L – latency for communicating on one link

  • – software overhead incurred in communication

g – gap between messages P – number of processors

We use LogP rather than a more detailed model because:

1

A model that assumes more details about the architecture restricts the results to a certain class of machines

2

We are looking for bounds, not specific predictions (which we cannot make for a machine that has yet to be built!). LogP which ignores complicating factors like congestion can give us a good starting point

For each application, model performance and see the region in parameter space in which exascale performance is achieved

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-6
SLIDE 6

FFT Feasibility Study

Scalability challenge: requires collective communication Past work has managed the cost by using either optimized collective communication routines or aggressive overlap of communication and computation Is the communication cost still manageable at exascale?

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-7
SLIDE 7

FFT Feasibility Study – Problem Setup

We consider a 3D FFT on a cubic domain of N = n3 points Two ways of partitioning: slabs (left), and pencils (right):

2D then 1D local FFTs (2 rounds) One round communication

  • Min. computation time: decades

1D local FFTs (3 rounds) Two rounds communication

  • Min. computation time: milliseconds

We consider only pencils decomposition. Assume P = p × p

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-8
SLIDE 8

FFT Feasibility Study – Performance Models

No overlap model: T = tc N P log2 N + 2(p − 1)(L + o) + 2(p − 2)g Latency is treated as cost to send entire message, so no latency-hiding done here. Overlap model: pipeline computation and communication using LogGP model, which extends LogP with an inverse bandwidth term (G = gap between units of data). Assuming computation and communication of one n × n

p sheet at a

time, we get this ( n

p + 1)-stage pipeline (only 3 stages shown

here for simplicity):

!"#$%&'($()*+#$' #&,-'* #&,-'+ * (#$%&'$.(/(0 + !"#$%&'($()*+#$' / &1*"233*1 45!5(!15$3621 / / / #$%&'$.(/(0 * / / / #&,-'* / #&,-'+ / Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-9
SLIDE 9

FFT Feasibility Study – Results, No Overlap Case

Graph on left shows feasibility regions in L and g for two different problems under two different situations, one where software overhead was zero (dotted line) and the other where it was 1 ns (solid line). Graph on right shows feasibility regions for several problem sizes for the “real-world” case:

10

12

10

10

10

8

10

6

10

12

10

11

10

10

10

9

10

8

10

7

10

6

L g Feasibility Region for 3D FFT 16K cube, Ideal 16K cube, RealWorld 64K cube, Ideal 64K cube, RealWorld

10

10

10

8

10

6

10

4

10

2

10

10

109 108 107 106 105 104 103 10

2

10

1

L g Feasibility Contours for 3D FFT N = 1013 N = 1014 N = 1015 N = 1016 N = 1017 N = 1018 N = 1019

These graphs show that latency and gap have to be small unless problem is large

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-10
SLIDE 10

FFT Feasibility Study – Results, Overlap Case

Overlap enables us to hide latency effectively, but will require GB/s bandwidth to do so. Gap constraint is also more restrictive:

2 4 6 x 10

5

1 2 3 4 x 10

9

0.5 1 1.5 2 2.5 x 10

9

g Feasibility Region for 3D FFT with Overlap, N = 1 1013 L G 0.5 1 1.5 x 10

3

2 4 6 8 x 10

8

0.5 1 1.5 2 x 10

9

g Feasibility Region for 3D FFT with Overlap, N = 1 1015 L G 0.01 0.02 0.03 0.5 1 1.5 2 x 10

6

0.5 1 1.5 2 x 10

9

g Feasibility Region for 3D FFT with Overlap, N = 1 1017 L G 0.2 0.4 0.6 0.8 1 2 3 4 x 10

5

0.5 1 1.5 2 2.5 x 10

9

g Feasibility Region for 3D FFT with Overlap, N = 1 1019 L G

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-11
SLIDE 11

FFT Feasibility Study – Results, Problem Sizes

Since computation grows superlinearly, a natural question to ask is how big the problem size can grow until it takes too long. It can get pretty big. Here are problem sizes at which FFT computation at the rate of one EFLOP takes at least... Time

  • No. Elements

1 ms 5.87 × 1013 1 s 4.84 × 1016 1 minute 2.63 × 1018 1 hour 1.44 × 1020 1 day 3.24 × 1021 1 week 2.19 × 1022

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-12
SLIDE 12

FFT Feasibility Study – Results, Interconnect

Another question to ask is, given the collective communication, the effect of the interconnect? Can give a performance upper bound as time required to (twice, since there are two communication rounds) move problem data across bisection bandwidth of network. If we treat individual link bandwidth as a variable, we can find a lower bound for it that corresponds to the upper bound being exascale: Interconnect Bisection BW N = 242 N = 259 2D Mesh √ P 1.72 × 104 GB/s 1.23 × 104 GB/s 2D Torus 2 √ P 8.63 × 103 GB/s 6.14 × 103 GB/s 3D Mesh P2/3 680 GB/s 484 GB/s 3D Torus 2P2/3 340 GB/s 242 GB/s Fat-tree P/2 1.05 GB/s 0.75 GB/s Hypercube P/2 1.05 GB/s 0.75 GB/s

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-13
SLIDE 13

Multigrid Feasibility Study

Scalability challenge: while communication cost is constant, computation/communication ratio decreases as grids gets coarser When there are less points than processors, some will sit idle unless special measures are taken Under what circumstances will such steps be necessary?

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-14
SLIDE 14

Multigrid Feasibility Study – Problem Setup

Consider using geometric multigrid applied in V-cycles to perform nearest-neighbor computation such as solution of Laplace equation Consider both 2D and 3D versions of computation, with processors arranged in the appropriate mesh network Assume the points are distributed evenly among the processors, with an ideal point to processor mapping Assume Jacobi smoothing

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-15
SLIDE 15

Multigrid Feasibility Study – Performance Model

Use LogP model like with FFT, but with a slight modification. Treat L as a per-link latency. Once there are fewer points than processors, communication will cross more links, and we want to capture this Other model assumptions:

There are N points, arranged in a d-dimensional grid Each processor communicates with k neighbors (k + 1-point stencil) Number of points decreases by a constant factor c in each dimension after coarsening We model one V-cycle

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-16
SLIDE 16

Multigrid Feasibility Study – Performance Model

Break model into components:

smooth(n, l) – run smoother on n points, with neighbors l links away coarsen(l) – perform one step of coarsening. Neighbors before coarsening are l links away; this is the distance of communication prolong(l) – perform one step of prolongation. Neighbors after prolongation are l links away; this is the distance of communication

Treat direct solve as smoother application and recurse as far as possible for simplicity

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-17
SLIDE 17

Multigrid Feasibility Study – Performance Model

Smoothing time is: Ts = ⌊logcd

N P ⌋

  • i=0

smooth N cdiP , L

  • +

⌊logcd N⌋

  • i=⌊logcd

N P ⌋+1

smooth

  • 1, ci−⌊logcd

N P ⌋L

  • Coarsening time is:

Tc = ⌊logcd

N P ⌋

  • i=0

coarsen (L) + ⌊logcd N⌋−1

  • i=⌊logcd

N P ⌋+1

coarsen

  • ci−⌊logcd

N P ⌋L

  • Prolongation time is:

Tp = ⌊logcd

N P ⌋

  • i=0

prolong (L) + ⌊logcd N⌋−1

  • i=⌊logcd

N P ⌋+1

prolong

  • ci−⌊logcd

N P ⌋L

  • Gahvari and Gropp (University of Illinois)

Introductory Exascale Feasibility Study

slide-18
SLIDE 18

Multigrid Feasibility Study – Performance Model

Applying LogP model gives us, for each component: smooth(n, l) = (k + 1)ntc + k(l + o) + (k − 1)g coarsen(l) = k(l + o) + (k − 1)g prolong(l) = k(l + o) + (k − 1)g For our results, we will assume five smoother steps before coarsening and five smoother steps after prolongation back to that grid, with the number of grid points reduced by 2 in each dimension at each coarsening step

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-19
SLIDE 19

Multigrid Feasibility Study – Results

We use performance models to come up with feasibility regions for two cases, 2D 5-point stencil and 3D 7-point stencil. Solid lines are for performance model as described earlier; dotted lines are when latency does not get bigger on coarse enough grids:

1012 1010 108 106 104 102 1012 1010 10

8

10

6

104 102 L g Feasibility Contours for Multigrid, d = 2, k = 4 N = 1011 N = 1012 N = 1013 N = 1014 N = 1015 N = 1016 N = 1017 10

12

10

10

10

8

10

6

10

4

10

2

10

12

10

10

10

8

10

6

10

4

10

2

L g Feasibility Contours for Multigrid, d = 3, k = 6 N = 1011 N = 1012 N = 1013 N = 1014 N = 1015 N = 1016 N = 1017

We see that a coarse grid penalty for latency makes it a big concern

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-20
SLIDE 20

Multigrid Feasibility Study – Results, Problem Size

Since multigrid is only linear time, problem size is not as much of a concern as with the FFT. Here are problem sizes at which multigrid computation at the rate of one EFLOP takes at least... Time

  • No. Elements

1 ms 2.68 × 1014 1 s 2.68 × 1017 1 minute 1.61 × 1019 1 hour 9.66 × 1020 1 day 2.32 × 1022 1 week 1.62 × 1023

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-21
SLIDE 21

Multigrid Feasibility Study – Results, Problem Size

A slowdown in computation on coarse grids could render exascale performance impossible for small enough problems, e.g. if machine peak is achievable only using hardware such as vector units. Model by adjusting computation rate tc accordingly. For vector unit of length 64, assuming varying latency:

1012 1010 108 106 104 102 10

12

10

10

10

8

10

6

10

4

10

2

L g Feasibility Contours for Multigrid, d = 2, k = 4 N = 1011 N = 1012 N = 1013 N = 1014 N = 1015 N = 1016 N = 1017 10

12

10

10

10

8

10

6

10

4

10

2

1012 10

10

10

8

106 10

4

102 L g Feasibility Contours for Multigrid, d = 3, k = 6 N = 1011 N = 1012 N = 1013 N = 1014 N = 1015 N = 1016 N = 1017

Solid lines mean tc varies as well, while dotted lines have tc fixed. No solid line means that exascale performance is impossible

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study

slide-22
SLIDE 22

Conclusions and Directions for Future Work

There are substantial constraints to be satisfied to enable exascale performance for FFT and multigrid:

Latencies in the nanosecond to microsecond range for smaller FFT problems, or perhaps even smaller for multigrid FFT needs bandwidth on the order of GB/s per process, and a mesh interconnect cannot provide enough bisection bandwidth There is still room for the problem size to grow with the higher processor count, however

Two main thrusts for future work:

1

Continue the analysis presented here for other algorithms and applications, to see which ones are suited to exascale systems

2

Build more depth, looking in more detail than was done here – network contention modeling for FFT and data movement techniques to handle coarse grids in multigrid

Gahvari and Gropp (University of Illinois) Introductory Exascale Feasibility Study