Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010 January - - PowerPoint PPT Presentation

lixia liu zhiyuan li purdue university usa ppopp 2010
SMART_READER_LITE
LIVE PREVIEW

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010 January - - PowerPoint PPT Presentation

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010 January 2009 PPOPP 2010, January 2009 Work supported in part by NSF through Work supported in part by NSF through grants ST-HEC-0444285, CPA-0702245 and CPA-0811587, and by a Google


slide-1
SLIDE 1

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010 January 2009 PPOPP 2010, January 2009

Work supported in part by NSF through Work supported in part by NSF through grants ST-HEC-0444285, CPA-0702245 and CPA-0811587, and by a Google Fellowship

slide-2
SLIDE 2

M lti hit t

  • Multicore architecture
  • Multiple cores per chip
  • Modest on-chip caches
  • Memory bandwidth issue

y

▪ Increasing gap between CPU speed and

  • ff-chip memory bandwidth

▪ Increasing bandwidth consumption by aggressive hardware prefetching

  • Software
  • Many optimizations increase memory bandwidth requirement

▪ Parallelization, Software prefetching, ILP

  • Some optimizations reduce memory bandwidth requirement
  • Some optimizations reduce memory bandwidth requirement

▪ Array contraction, index compression

  • Loop transformations to improve data locality

▪ Loop tiling, loop fusion and others ▪ Restricted by data/control dependences y p

2

slide-3
SLIDE 3

Loop tiling is used to increase data locality Example program: PDE iterative solver

The base implementation

do t = 1 itmax do t 1,itmax update(a, n, f); ! Compute residual and convergence test ! Compute residual and convergence test error = residual(a, n) if (error .le. tol) then exit

3

endif end do

slide-4
SLIDE 4

Tiling is skewed to satisfy data dependences After tiling, parallelism only exists within a tile

due to data dependences between tiles due to data dependences between tiles

4

slide-5
SLIDE 5

The tiled version with speculated execution with speculated execution do t = 1, itmax/M + 1 ! Save the old result into buffer as checkpoint

  • ldbuf(1:n 1:n) = a(1:n 1:n)

Questions 1 How to select chunk

  • ldbuf(1:n, 1:n) = a(1:n, 1:n)

! Execute a chunk of M iterations after tiling update_tile(a, n, f, M)

  • 1. How to select chunk

size?

  • 2. Is recovery overhead

?

! Compute residual and perform convergence test error = residual(a, n)

necessary?

if (error .le. tol) then call recovery(oldbuf, a, n, f) exit end if end if end do

5

slide-6
SLIDE 6

Mitigate the memory bandwidth problem

Apply data locality optimizations to challenging cases

R l i i i d b d / l

Relax restrictions imposed by data/control

dependences

6

slide-7
SLIDE 7

B i id ll t f ld i hb i l i

Basic idea: allow to use of old neighboring values in

the computation, still converging O i i ll d t d i ti t

Originally proposed to reduce communication cost

and synchronization overhead Convergence rate of asynchronous algorithms1

Convergence rate of asynchronous algorithms1

May slowdown convergence rate

Our contribution is to use the asynchronous model Our contribution is to use the asynchronous model

to improve parallelism and locality simultaneously

Relax dependencies

M t it diti

Monotone exit condition

7

[1] Frommer, A. and Szyld, D. B. 1994. Asynchronous two-stage iterative methods. In Numer.

  • Math. 69, 2, Dec 1994.
slide-8
SLIDE 8

Th til d i ith t The tiled version without recovery do t = 1, itmax/M + 1 ! E t h k f M it ti ft tili ! Execute a chunk of M iterations after tiling update_tile(a, n, f, M) ! Compute residual and convergence test error = residual(a, n) if (error .le. tol) then exit end if end do

8

slide-9
SLIDE 9

Achieve parallelism across the grid

Not just within a tile

A l l tili t i d t l lit

Apply loop tiling to improve data locality

Requiring a partition of time steps in chunks

Eliminate recovery overhead Eliminate recovery overhead

9

slide-10
SLIDE 10

Chunk size: # iterations executed speculatively in

the tiled code

Ideal if we can predict the exact iterations to

converge

However, it is unknown until convergence happens

Too large a chunk we pay overshooting overhead Too large a chunk, we pay overshooting overhead Too small, poor data reuse and poor data locality

p p y

10

slide-11
SLIDE 11

Poor solutions

Use a constant chunk size (randomly pick)

E i b d h h i l

Estimate based on the theoretical convergence rate

A better solution: Adaptive chunk size A better solution: Adaptive chunk size

Use the latest convergence progress to predict how

many more iterations are required to converge ri :residual error of i-th round of tiled code

11

slide-12
SLIDE 12

Platforms for experiments:

Intel Q6600, AMD8350 Barcelona, Intel E5530 Nehalem

Evaluated numerical methods: Jacobi, GS, SOR Performance results

Synchronous model vs. asynchronous model with the best

chuck size chuck size

Original code vs. loop tiling Impact of the chunk size Adaptive chunk selection vs. the ideal chunk size

12

slide-13
SLIDE 13

Peak bandwidth of our platforms

Machine Model L1 L2 L3 BW (GB/s) SBW (GB/s) A AMD8350 4 4 64KB i t 512KB i t 4x2MB h d 21.6 18.9 4x4 cores private private shared 6 8 9 B Q6600 1x4 cores 32KB private 2x4MB shared N/A 8.5 4.79 E5530 256KB 1MB 2 8MB C E5530 2x4 cores 256KB private 1MB private 2x8MBs shared 51 31.5

13

slide-14
SLIDE 14

Machine A

50 tiled tiled-norec Performance 20 30 40 peedup async-base async-tiled Performance varies with chunk size! til d 10 20 50 100 150 Sp async-tiled version is the best! 50 100 150 Chunk Machine kernel parallel tiled tiled-norec async-base async-tiled A Jacobi 5 95 16 76 27 24 5 47 39 11

14

16 cores Jacobi 5.95 16.76 27.24 5.47 39.11

slide-15
SLIDE 15

Machine B

4 tiled tiled-norec async-base async-tiled Poor 2 3 Speedup async base async tiled Poor performance without tiling (async-base and parallel)! 1 50 100 150 S Chunk and parallel)! Machine kernel parallel tiled tiled-norec async-base async-tiled B 4 Jacobi 1 01 2 55 3 44 1 01 3 67

15

4 cores Jacobi 1.01 2.55 3.44 1.01 3.67

slide-16
SLIDE 16

Machine C

15 tiled tiled-norec 10 eedup tiled tiled norec async-base async-tiled 5 Spe Ch k Machine kernel parallel tiled tiled-norec async-base async-tiled C Jacobi 3 73 8 53 12 69 3 76 13 39 20 40 60 80 100 120 140 160 Chunk

16

8 cores Jacobi 3.73 8.53 12.69 3.76 13.39

slide-17
SLIDE 17

Machine kernel parallel tiled tiled-norec async-base async-tiled A GS 5.49 12.76 22.02 26.19 30.09 B GS 0.68 5.69 9.25 4.90 14.72 C GS 3.54 8.20 11.86 11.00 19.56 A SOR 4.50 11.99 21.25 29.08 31.42 B SOR 0.65 5.24 8.54 7.34 14.87 C SOR 3.84 7.53 11.51 11.68 19.10

  • Asynchronous tiled version performs better than synchronous tiled

version (even without recovery cost)

  • Asynchronous baseline suffers more on machine B due to less

memory bandwidth available

17

memory bandwidth available

slide-18
SLIDE 18

adaptive-1: lower bound of chunk size is 1 adaptive-8: lower bound of chunk size is 8

30 40 up

GS

30 40 50 up

Jacobi

10 20 Speed async-tiled adaptive-1 adaptive-8 10 20 30 Speedu async-tiled adaptive-1 adaptive-8 20 40 60 80 Initial Chunk p 100 Initial Chunk adaptive 8

18

slide-19
SLIDE 19
  • Showed how to benefit from the asynchronous model for relaxing

data and control dependences

  • improve parallelism and data locality (via loop tiling) at the same time
  • An adaptive method to determine the chunk size
  • because the iteration count is usually unknown in practice
  • Good performance enhancement when tested on three well-known
  • Good performance enhancement when tested on three well-known

numerical kernels on three different multicore systems.

19

slide-20
SLIDE 20

Thank you!

20