[PPT] - Computation vs. Memory Systems: Pinning Down Accelerator Bottlenecks PowerPoint Presentation

SLIDE 1

Computation vs. Memory Systems: Pinning Down Accelerator Bottlenecks

Martha Kim and Stephen Edwards Columbia University Department of Computer Science June 19, 2010 AMAS-BT Workshop

SLIDE 2

2

SLIDE 3

2

Improved processor performance Fuller- featured software Larger development teams HLLs & abstractions Slower programs

Larus’ virtuous cycle

SLIDE 4

2

Improved processor performance Fuller- featured software Larger development teams HLLs & abstractions Slower programs

Larus’ virtuous cycle Power wall

SLIDE 5

2

Improved processor performance Fuller- featured software Larger development teams HLLs & abstractions Slower programs

Larus’ virtuous cycle Increases in power efficiency and performance Power wall

SLIDE 6

Efficiency of Specialized Hardware

3

Performance Power

General purpose processor ASIP Application- specific instruction processor FPGA Field- programmable gate array Standard cell ASIC Full custom ASIC

40-500x

3-10x

6-8x

10-350x

10-40x

3-10x

SLIDE 7

Accelerator System

General-purpose core(s) surrounded by many, many special-purpose accelerators that are powered on only when their function is needed. Potential benefits available only if applications actually make use of accelerators.

4

P0 P1 Shared Communication / Memory A0 A1 A2 A3 A4 A5 A6 A7

SLIDE 8

Accelerator System

General-purpose core(s) surrounded by many, many special-purpose accelerators that are powered on only when their function is needed. Potential benefits available only if applications actually make use of accelerators.

4

P0 P1 Shared Communication / Memory A0 A1 A2 A3 A4 A5 A6 A7

SLIDE 9

Talk Outline

Overall Vision
Accelerator System Model
Methodology Overview
Methodology in Practice
Image Rotation
JPEG
Conclusion

5

SLIDE 10

System-Level Vision

6

Presence of high-level types supported by accelerators determines boundary of accelerator code Programmer targets standard accelerator libraries (e.g., Java)

Functions define boundaries of acceleration.

SLIDE 11

Each accelerator must do two things

1. Compute
2. Communicate externally

7

CPU foo bar baz

SLIDE 12

Each accelerator must do two things

1. Compute
2. Communicate externally

7

CPU foo bar baz

Computation Computation

SLIDE 13

Each accelerator must do two things

1. Compute
2. Communicate externally

7

CPU foo bar baz

Communication Computation Computation

SLIDE 14

Talk Outline

Overall Vision
Accelerator System Model
Methodology Overview
Methodology in Practice
Image Rotation
JPEG
Conclusion

8

SLIDE 15

Computation Model

9

main() foo() bar() bar() baz()

Dynamic invocation tree:

CPU foo bar baz

1. Dynamic function invocations

mapped to corresponding accelerator

SLIDE 16

Computation Model

9

main() foo() bar() bar() baz()

Dynamic invocation tree:

CPU foo bar baz

1. Dynamic function invocations

mapped to corresponding accelerator

SLIDE 17

Computation Model

9

main() foo() bar() bar() baz()

Dynamic invocation tree:

CPU foo bar baz

1. Dynamic function invocations

mapped to corresponding accelerator

SLIDE 18

Computation Model

9

main() foo() bar() bar() baz()

Dynamic invocation tree:

CPU foo bar baz

1. Dynamic function invocations

mapped to corresponding accelerator

SLIDE 19

Computation Model

9

main() foo() bar() bar() baz()

Dynamic invocation tree:

CPU foo bar baz

1. Dynamic function invocations

mapped to corresponding accelerator

SLIDE 20

Computation Model

9

main() foo() bar() bar() baz()

Dynamic invocation tree:

CPU foo bar baz

1. Dynamic function invocations

mapped to corresponding accelerator

SLIDE 21

Computation Model

10

main() foo()

Dynamic invocation tree:

CPU foo bar

1. Dynamic function invocations

mapped to corresponding accelerator

2. Invocations without accelerator are

executed on same core as parent

bar() bar() baz()

baz

SLIDE 22

Computation Model

10

main() foo()

Dynamic invocation tree:

CPU foo

1. Dynamic function invocations

mapped to corresponding accelerator

2. Invocations without accelerator are

executed on same core as parent

bar() bar() baz()

baz

SLIDE 23

Computation Model

10

main() foo()

Dynamic invocation tree:

CPU foo

1. Dynamic function invocations

mapped to corresponding accelerator

2. Invocations without accelerator are

executed on same core as parent

bar() bar() baz()

baz

SLIDE 24

Computation Model

10

main() foo()

Dynamic invocation tree:

CPU foo

1. Dynamic function invocations

mapped to corresponding accelerator

2. Invocations without accelerator are

executed on same core as parent

bar() bar() baz()

baz

SLIDE 25

Computation Model

10

main() foo()

Dynamic invocation tree:

CPU foo bar

1. Dynamic function invocations

mapped to corresponding accelerator

2. Invocations without accelerator are

executed on same core as parent

bar() bar() baz()

SLIDE 26

Communication Model

11

main()

Dynamic invocation tree:

CPU foo bar

Invocations communicate via load/ store dependencies

baz

foo() bar() bar() baz()

SLIDE 27

Communication Model

11

main()

Dynamic invocation tree:

CPU foo bar

Invocations communicate via load/ store dependencies

baz

foo() bar() bar() baz()

st A ld A

SLIDE 28

Communication Model

11

main()

Dynamic invocation tree:

CPU foo bar

Invocations communicate via load/ store dependencies

baz

foo() bar() bar() baz()

st A ld A

SLIDE 29

Communication Model

11

main()

Dynamic invocation tree:

CPU foo bar

Invocations communicate via load/ store dependencies

baz

foo() bar() bar() baz()

st A st B ld A ld B

SLIDE 30

Communication Model

11

main()

Dynamic invocation tree:

CPU foo bar

Invocations communicate via load/ store dependencies

baz

foo() bar() bar() baz()

st A st B ld A ld B

SLIDE 31

Talk Outline

Overall Vision
Accelerator System Model
Methodology Overview
Methodology in Practice
Image Rotation
JPEG
Conclusion

12

SLIDE 32

Methodology Pt 1: Examine Application

Pintool ➔ decorated call graph

13

main() foo() bar() bar() baz()

SLIDE 33

Methodology Pt 1: Examine Application

Pintool ➔ decorated call graph

13

main() foo() bar() bar() baz()

i j k l m

SLIDE 34

Methodology Pt 1: Examine Application

Pintool ➔ decorated call graph

13

main() foo() bar() bar() baz()

i j k l m a b c d e f

SLIDE 35

Pintool Functionality

Tool instruments calls, returns, loads and stores
Four logs are generated, all keyed off of a unique, invocation identifier
At present, significant overhead relative to native (approximately 2000x on the short-running

applications to be presented here).

Majority of overhead attributable to hash lookups (for tracking data transfers) and logfile

writing

14

Function invocation ID ➔ function name

(-tfunction, -bfunction)

Subcalls invocation ID ➔ list of subcall IDs

(-tsubcalls, -bsubcalls)

Instruction Count invocation ID ➔ dynamic instruction count (-ticount, -bicount) Data Transfers invocation ID ➔ invocation ID, bytes

(-txfers, -bxfers, -xfer-chunk

SLIDE 36

Methodology Pt 2: Evaluate Execution

n Accelerators

Program runtime a function of

computation rates
computation
communication rates
communication

Enables evaluation of:

acceleration potential in the limit
sensitivity of potential to hardware

parameters, program inputs, etc.

15

main() foo() bar() bar() baz()

i j k l m a b c d e f

SLIDE 37

16

main() foo() bar() bar() baz()

i j k l m a b c d e f

Why is gprof not sufficient?

Runtimes are machine- and algorithm- dependent Does not capture data transfers between invocations

SLIDE 38

Talk Outline

Overall Vision
Accelerator System Model
Methodology Overview
Methodology in Practice
Image Rotation
JPEG
Conclusion

17

SLIDE 39

Methodology In Practice

Image rotation JPEG decode

18

main() foo() bar() bar() baz()

i j k l m a b c d e f

SLIDE 40

Simple Example: Image Rotation

19

#define PIX(x,y) raster[(x) + (y)wd] unsigned wd, ht, maxval, raster; int main(int argc, char** argv) { if (argc != 2 || (argv[1][0] != ’r’ && argv[1][0] != ’i’)) { printf("USAGE: rotate [ir]\n"), exit(0); } read_ppm(); if (argv[1][0] == ’r’) rec_rot(0, 0, wd); else iter_rot(); write_ppm(); return 0; }

SLIDE 41

Image Rotation: Recursive Algorithm

20

#define PIX(x,y) raster[(x) + (y)*wd] unsigned wd, ht, maxval, *raster; int main(int argc, char** argv) { if (argc != 2 || (argv[1][0] != ’r’ && argv[1][0] != ’i’)) { printf("USAGE: rotate [ir]\n"), exit(0); } read_ppm(); if (argv[1][0] == ’r’) rec_rot(0, 0, wd); else iter_rot(); write_ppm(); return 0; }

void rec_rot(int x, int y, int s) { int i, j; s >>= 1; for (i = 0 ; i < s ; ++i) for (j = 0 ; j < s ; ++j) { int rgb = PIX(x+i, y+j); PIX(x+i, y+j ) = PIX(x+i, y+j+s); PIX(x+i, y+j+s) = PIX(x+i+s, y+j+s); PIX(x+i+s, y+j+s) = PIX(x+i+s, y+j); PIX(x+i+s, y+j ) = rgb; } if (s <= 1) return; rec_rot(x,y+s,s); rec_rot(x+s,y+s,s); rec_rot(x+s,y,s); rec_rot(x,y,s); }

SLIDE 42

#define PIX(x,y) raster[(x) + (y)*wd] unsigned wd, ht, maxval, *raster; int main(int argc, char** argv) { if (argc != 2 || (argv[1][0] != ’r’ && argv[1][0] != ’i’)) { printf("USAGE: rotate [ir]\n"), exit(0); } read_ppm(); if (argv[1][0] == ’r’) rec_rot(0, 0, wd); else iter_rot(); write_ppm(); return 0; }

Image Rotation: Iterative Algorithm

21

void iter_rot() { int x, y, s; s = wd >> 1; for (y = 0 ; y < s ; ++y) for (x = 0 ; x < s ; ++x) { int rgb = PIX(x, y); PIX(x, y ) = PIX(y, ht-x-1); PIX(y, ht-x-1) = PIX(wd-x-1, ht-y-1); PIX(wd-x-1, ht-y-1) = PIX(wd-y-1, x ); PIX(wd-y-1, x ) = rgb; } }

SLIDE 43

Experiment 1: Multiple Input Sizes

Relative ratio of

computation and communication nearly constant across algorithms and input sizes

Useful for high level

survey of input space

Given this result, will

proceed to analyze a single input image size (128x128)

22

SLIDE 44

Experiment 1: Multiple Input Sizes

Relative ratio of

computation and communication nearly constant across algorithms and input sizes

Useful for high level

survey of input space

Given this result, will

proceed to analyze a single input image size (128x128)

22

SLIDE 45

Experiment 2: Hotspots (Iterative Rotate)

23

Computational and communication hotspots become visible

Communication volume (bytes)

read_ppm iter_rot write_ppm

ther

read_ppm 607413 49172 iter_rot 16387 4101 write_ppm 5 16384 689163 49155

ther

4 18 491

Computation

804340 114820 738584 1902

SLIDE 46

Experiment 2: Hotspots (Iterative Rotate)

23

Computational and communication hotspots become visible

Communication volume (bytes)

read_ppm iter_rot write_ppm

ther

read_ppm 607413 49172 iter_rot 16387 4101 write_ppm 5 16384 689163 49155

ther

4 18 491

Computation

804340 114820 738584 1902

SLIDE 47

Experiment 2: Hotspots (Iterative Rotate)

23

Computational and communication hotspots become visible

Communication volume (bytes)

read_ppm iter_rot write_ppm

ther

read_ppm 607413 49172 iter_rot 16387 4101 write_ppm 5 16384 689163 49155

ther

4 18 491

Computation

804340 114820 738584 1902

SLIDE 48

Experiment 2: Hotspots (Iterative Rotate)

23

Computational and communication hotspots become visible

Communication volume (bytes)

read_ppm iter_rot write_ppm

ther

read_ppm 607413 49172 iter_rot 16387 4101 write_ppm 5 16384 689163 49155

ther

4 18 491

Computation

804340 114820 738584 1902

SLIDE 49

Experiment 2: Hotspots (Recursive Rotate)

24

Communication volume (bytes)

read_ppm rec_rot write_ppm

ther

read_ppm 607413 49172 rec_rot 27306 216888 4 write_ppm 5 16384 689163 49155

ther

4 18 491

Computation

804340 114820 738584 1902

SLIDE 50

Experiment 2: Hotspots (Recursive Rotate)

25

SLIDE 51

Experiment 4: Specialize Local Communication

Specialization of local

communication in addition to computation dramatically improves potential speedups

Suggests specialized

data handling and movement is a way to contend with memory wall

26

SLIDE 52

JPEG Decode Initial Results

27

SLIDE 53

Accelerator Architectures for JPEG Decode

Not much gain accelerating

each of top four individually

If you had to select one,

function it would be #2 in terms of computation, IDCT.

Accelerating two functions

improves gains, but we see little value in a conjoined accelerator design, rather than two individuals

28

SLIDE 54

Accelerator Architectures for JPEG Decode

Not much gain accelerating

each of top four individually

If you had to select one,

function it would be #2 in terms of computation, IDCT.

Accelerating two functions

improves gains, but we see little value in a conjoined accelerator design, rather than two individuals

28

SLIDE 55

Future Work and Conclusions

29

Validate model against hardware (SVM library)
Improve efficiency of pintool (move to online version, not dependent on logs)
Examine larger, real applications
Future power and performance needs are pointing in the direction of

hardware specialization

Understanding the interplay between an application and abstract

architectural parameters can yield important design insights and performance targets

SLIDE 56

Thank You

martha@cs.columbia.edu

30

SLIDE 57

BACKUP

31

SLIDE 58

Experiment 3: The Memory Wall

32