Computation vs. Memory Systems: Pinning Down Accelerator Bottlenecks - - PowerPoint PPT Presentation

computation vs memory systems pinning down accelerator
SMART_READER_LITE
LIVE PREVIEW

Computation vs. Memory Systems: Pinning Down Accelerator Bottlenecks - - PowerPoint PPT Presentation

Computation vs. Memory Systems: Pinning Down Accelerator Bottlenecks Martha Kim and Stephen Edwards Columbia University Department of Computer Science June 19, 2010 AMAS-BT Workshop 2 Improved processor performance Fuller- Slower


slide-1
SLIDE 1

Computation vs. Memory Systems: Pinning Down Accelerator Bottlenecks

Martha Kim and Stephen Edwards Columbia University Department of Computer Science June 19, 2010 AMAS-BT Workshop

slide-2
SLIDE 2

2

slide-3
SLIDE 3

2

Improved processor performance Fuller- featured software Larger development teams HLLs & abstractions Slower programs

Larus’ virtuous cycle

slide-4
SLIDE 4

2

Improved processor performance Fuller- featured software Larger development teams HLLs & abstractions Slower programs

Larus’ virtuous cycle Power wall

slide-5
SLIDE 5

2

Improved processor performance Fuller- featured software Larger development teams HLLs & abstractions Slower programs

Larus’ virtuous cycle Increases in power efficiency and performance Power wall

slide-6
SLIDE 6

Efficiency of Specialized Hardware

3

Performance Power

General purpose processor ASIP Application- specific instruction processor FPGA Field- programmable gate array Standard cell ASIC Full custom ASIC

40-500x

3-10x

6-8x

10-350x

10-40x

3-10x

slide-7
SLIDE 7

Accelerator System

General-purpose core(s) surrounded by many, many special-purpose accelerators that are powered on only when their function is needed. Potential benefits available only if applications actually make use of accelerators.

4

P0 P1 Shared Communication / Memory A0 A1 A2 A3 A4 A5 A6 A7

slide-8
SLIDE 8

Accelerator System

General-purpose core(s) surrounded by many, many special-purpose accelerators that are powered on only when their function is needed. Potential benefits available only if applications actually make use of accelerators.

4

P0 P1 Shared Communication / Memory A0 A1 A2 A3 A4 A5 A6 A7

slide-9
SLIDE 9

Talk Outline

  • Overall Vision
  • Accelerator System Model
  • Methodology Overview
  • Methodology in Practice
  • Image Rotation
  • JPEG
  • Conclusion

5

slide-10
SLIDE 10

System-Level Vision

6

Presence of high-level types supported by accelerators determines boundary of accelerator code Programmer targets standard accelerator libraries (e.g., Java)

Functions define boundaries of acceleration.

slide-11
SLIDE 11

Each accelerator must do two things

  • 1. Compute
  • 2. Communicate externally

7

CPU foo bar baz

slide-12
SLIDE 12

Each accelerator must do two things

  • 1. Compute
  • 2. Communicate externally

7

CPU foo bar baz

Computation Computation

slide-13
SLIDE 13

Each accelerator must do two things

  • 1. Compute
  • 2. Communicate externally

7

CPU foo bar baz

Communication Computation Computation

slide-14
SLIDE 14

Talk Outline

  • Overall Vision
  • Accelerator System Model
  • Methodology Overview
  • Methodology in Practice
  • Image Rotation
  • JPEG
  • Conclusion

8

slide-15
SLIDE 15

Computation Model

9

main() foo() bar() bar() baz()

Dynamic invocation tree:

CPU foo bar baz

  • 1. Dynamic function invocations

mapped to corresponding accelerator

slide-16
SLIDE 16

Computation Model

9

main() foo() bar() bar() baz()

Dynamic invocation tree:

CPU foo bar baz

  • 1. Dynamic function invocations

mapped to corresponding accelerator

slide-17
SLIDE 17

Computation Model

9

main() foo() bar() bar() baz()

Dynamic invocation tree:

CPU foo bar baz

  • 1. Dynamic function invocations

mapped to corresponding accelerator

slide-18
SLIDE 18

Computation Model

9

main() foo() bar() bar() baz()

Dynamic invocation tree:

CPU foo bar baz

  • 1. Dynamic function invocations

mapped to corresponding accelerator

slide-19
SLIDE 19

Computation Model

9

main() foo() bar() bar() baz()

Dynamic invocation tree:

CPU foo bar baz

  • 1. Dynamic function invocations

mapped to corresponding accelerator

slide-20
SLIDE 20

Computation Model

9

main() foo() bar() bar() baz()

Dynamic invocation tree:

CPU foo bar baz

  • 1. Dynamic function invocations

mapped to corresponding accelerator

slide-21
SLIDE 21

Computation Model

10

main() foo()

Dynamic invocation tree:

CPU foo bar

  • 1. Dynamic function invocations

mapped to corresponding accelerator

  • 2. Invocations without accelerator are

executed on same core as parent

bar() bar() baz()

baz

slide-22
SLIDE 22

Computation Model

10

main() foo()

Dynamic invocation tree:

CPU foo

  • 1. Dynamic function invocations

mapped to corresponding accelerator

  • 2. Invocations without accelerator are

executed on same core as parent

bar() bar() baz()

baz

slide-23
SLIDE 23

Computation Model

10

main() foo()

Dynamic invocation tree:

CPU foo

  • 1. Dynamic function invocations

mapped to corresponding accelerator

  • 2. Invocations without accelerator are

executed on same core as parent

bar() bar() baz()

baz

slide-24
SLIDE 24

Computation Model

10

main() foo()

Dynamic invocation tree:

CPU foo

  • 1. Dynamic function invocations

mapped to corresponding accelerator

  • 2. Invocations without accelerator are

executed on same core as parent

bar() bar() baz()

baz

slide-25
SLIDE 25

Computation Model

10

main() foo()

Dynamic invocation tree:

CPU foo bar

  • 1. Dynamic function invocations

mapped to corresponding accelerator

  • 2. Invocations without accelerator are

executed on same core as parent

bar() bar() baz()

slide-26
SLIDE 26

Communication Model

11

main()

Dynamic invocation tree:

CPU foo bar

Invocations communicate via load/ store dependencies

baz

foo() bar() bar() baz()

slide-27
SLIDE 27

Communication Model

11

main()

Dynamic invocation tree:

CPU foo bar

Invocations communicate via load/ store dependencies

baz

foo() bar() bar() baz()

st A ld A

slide-28
SLIDE 28

Communication Model

11

main()

Dynamic invocation tree:

CPU foo bar

Invocations communicate via load/ store dependencies

baz

foo() bar() bar() baz()

st A ld A

slide-29
SLIDE 29

Communication Model

11

main()

Dynamic invocation tree:

CPU foo bar

Invocations communicate via load/ store dependencies

baz

foo() bar() bar() baz()

st A st B ld A ld B

slide-30
SLIDE 30

Communication Model

11

main()

Dynamic invocation tree:

CPU foo bar

Invocations communicate via load/ store dependencies

baz

foo() bar() bar() baz()

st A st B ld A ld B

slide-31
SLIDE 31

Talk Outline

  • Overall Vision
  • Accelerator System Model
  • Methodology Overview
  • Methodology in Practice
  • Image Rotation
  • JPEG
  • Conclusion

12

slide-32
SLIDE 32

Methodology Pt 1: Examine Application

Pintool ➔ decorated call graph

13

main() foo() bar() bar() baz()

slide-33
SLIDE 33

Methodology Pt 1: Examine Application

Pintool ➔ decorated call graph

13

main() foo() bar() bar() baz()

i j k l m

slide-34
SLIDE 34

Methodology Pt 1: Examine Application

Pintool ➔ decorated call graph

13

main() foo() bar() bar() baz()

i j k l m a b c d e f

slide-35
SLIDE 35

Pintool Functionality

  • Tool instruments calls, returns, loads and stores
  • Four logs are generated, all keyed off of a unique, invocation identifier
  • At present, significant overhead relative to native (approximately 2000x on the short-running

applications to be presented here).

  • Majority of overhead attributable to hash lookups (for tracking data transfers) and logfile

writing

14

Function invocation ID ➔ function name

(-tfunction, -bfunction)

Subcalls invocation ID ➔ list of subcall IDs

(-tsubcalls, -bsubcalls)

Instruction Count invocation ID ➔ dynamic instruction count (-ticount, -bicount) Data Transfers invocation ID ➔ invocation ID, bytes

(-txfers, -bxfers, -xfer-chunk

slide-36
SLIDE 36

Methodology Pt 2: Evaluate Execution

  • n Accelerators

Program runtime a function of

  • computation rates
  • computation
  • communication rates
  • communication

Enables evaluation of:

  • acceleration potential in the limit
  • sensitivity of potential to hardware

parameters, program inputs, etc.

15

main() foo() bar() bar() baz()

i j k l m a b c d e f

slide-37
SLIDE 37

16

main() foo() bar() bar() baz()

i j k l m a b c d e f

Why is gprof not sufficient?

Runtimes are machine- and algorithm- dependent Does not capture data transfers between invocations

slide-38
SLIDE 38

Talk Outline

  • Overall Vision
  • Accelerator System Model
  • Methodology Overview
  • Methodology in Practice
  • Image Rotation
  • JPEG
  • Conclusion

17

slide-39
SLIDE 39

Methodology In Practice

Image rotation JPEG decode

18

main() foo() bar() bar() baz()

i j k l m a b c d e f

slide-40
SLIDE 40

Simple Example: Image Rotation

19

#define PIX(x,y) raster[(x) + (y)*wd] unsigned wd, ht, maxval, *raster; int main(int argc, char** argv) { if (argc != 2 || (argv[1][0] != ’r’ && argv[1][0] != ’i’)) { printf("USAGE: rotate [ir]\n"), exit(0); } read_ppm(); if (argv[1][0] == ’r’) rec_rot(0, 0, wd); else iter_rot(); write_ppm(); return 0; }

slide-41
SLIDE 41

Image Rotation: Recursive Algorithm

20

#define PIX(x,y) raster[(x) + (y)*wd] unsigned wd, ht, maxval, *raster; int main(int argc, char** argv) { if (argc != 2 || (argv[1][0] != ’r’ && argv[1][0] != ’i’)) { printf("USAGE: rotate [ir]\n"), exit(0); } read_ppm(); if (argv[1][0] == ’r’) rec_rot(0, 0, wd); else iter_rot(); write_ppm(); return 0; }

void rec_rot(int x, int y, int s) { int i, j; s >>= 1; for (i = 0 ; i < s ; ++i) for (j = 0 ; j < s ; ++j) { int rgb = PIX(x+i, y+j); PIX(x+i, y+j ) = PIX(x+i, y+j+s); PIX(x+i, y+j+s) = PIX(x+i+s, y+j+s); PIX(x+i+s, y+j+s) = PIX(x+i+s, y+j); PIX(x+i+s, y+j ) = rgb; } if (s <= 1) return; rec_rot(x,y+s,s); rec_rot(x+s,y+s,s); rec_rot(x+s,y,s); rec_rot(x,y,s); }

slide-42
SLIDE 42

#define PIX(x,y) raster[(x) + (y)*wd] unsigned wd, ht, maxval, *raster; int main(int argc, char** argv) { if (argc != 2 || (argv[1][0] != ’r’ && argv[1][0] != ’i’)) { printf("USAGE: rotate [ir]\n"), exit(0); } read_ppm(); if (argv[1][0] == ’r’) rec_rot(0, 0, wd); else iter_rot(); write_ppm(); return 0; }

Image Rotation: Iterative Algorithm

21

void iter_rot() { int x, y, s; s = wd >> 1; for (y = 0 ; y < s ; ++y) for (x = 0 ; x < s ; ++x) { int rgb = PIX(x, y); PIX(x, y ) = PIX(y, ht-x-1); PIX(y, ht-x-1) = PIX(wd-x-1, ht-y-1); PIX(wd-x-1, ht-y-1) = PIX(wd-y-1, x ); PIX(wd-y-1, x ) = rgb; } }

slide-43
SLIDE 43

Experiment 1: Multiple Input Sizes

  • Relative ratio of

computation and communication nearly constant across algorithms and input sizes

  • Useful for high level

survey of input space

  • Given this result, will

proceed to analyze a single input image size (128x128)

22

slide-44
SLIDE 44

Experiment 1: Multiple Input Sizes

  • Relative ratio of

computation and communication nearly constant across algorithms and input sizes

  • Useful for high level

survey of input space

  • Given this result, will

proceed to analyze a single input image size (128x128)

22

slide-45
SLIDE 45

Experiment 2: Hotspots (Iterative Rotate)

23

Computational and communication hotspots become visible

Communication volume (bytes)

read_ppm iter_rot write_ppm

  • ther

read_ppm 607413 49172 iter_rot 16387 4101 write_ppm 5 16384 689163 49155

  • ther

4 18 491

Computation

804340 114820 738584 1902

slide-46
SLIDE 46

Experiment 2: Hotspots (Iterative Rotate)

23

Computational and communication hotspots become visible

Communication volume (bytes)

read_ppm iter_rot write_ppm

  • ther

read_ppm 607413 49172 iter_rot 16387 4101 write_ppm 5 16384 689163 49155

  • ther

4 18 491

Computation

804340 114820 738584 1902

slide-47
SLIDE 47

Experiment 2: Hotspots (Iterative Rotate)

23

Computational and communication hotspots become visible

Communication volume (bytes)

read_ppm iter_rot write_ppm

  • ther

read_ppm 607413 49172 iter_rot 16387 4101 write_ppm 5 16384 689163 49155

  • ther

4 18 491

Computation

804340 114820 738584 1902

slide-48
SLIDE 48

Experiment 2: Hotspots (Iterative Rotate)

23

Computational and communication hotspots become visible

Communication volume (bytes)

read_ppm iter_rot write_ppm

  • ther

read_ppm 607413 49172 iter_rot 16387 4101 write_ppm 5 16384 689163 49155

  • ther

4 18 491

Computation

804340 114820 738584 1902

slide-49
SLIDE 49

Experiment 2: Hotspots (Recursive Rotate)

24

Communication volume (bytes)

read_ppm rec_rot write_ppm

  • ther

read_ppm 607413 49172 rec_rot 27306 216888 4 write_ppm 5 16384 689163 49155

  • ther

4 18 491

Computation

804340 114820 738584 1902

slide-50
SLIDE 50

Experiment 2: Hotspots (Recursive Rotate)

25

slide-51
SLIDE 51

Experiment 4: Specialize Local Communication

  • Specialization of local

communication in addition to computation dramatically improves potential speedups

  • Suggests specialized

data handling and movement is a way to contend with memory wall

26

slide-52
SLIDE 52

JPEG Decode Initial Results

27

slide-53
SLIDE 53

Accelerator Architectures for JPEG Decode

  • Not much gain accelerating

each of top four individually

  • If you had to select one,

function it would be #2 in terms of computation, IDCT.

  • Accelerating two functions

improves gains, but we see little value in a conjoined accelerator design, rather than two individuals

28

slide-54
SLIDE 54

Accelerator Architectures for JPEG Decode

  • Not much gain accelerating

each of top four individually

  • If you had to select one,

function it would be #2 in terms of computation, IDCT.

  • Accelerating two functions

improves gains, but we see little value in a conjoined accelerator design, rather than two individuals

28

slide-55
SLIDE 55

Future Work and Conclusions

29

  • Validate model against hardware (SVM library)
  • Improve efficiency of pintool (move to online version, not dependent on logs)
  • Examine larger, real applications
  • Future power and performance needs are pointing in the direction of

hardware specialization

  • Understanding the interplay between an application and abstract

architectural parameters can yield important design insights and performance targets

slide-56
SLIDE 56

Thank You

martha@cs.columbia.edu

30

slide-57
SLIDE 57

BACKUP

31

slide-58
SLIDE 58

Experiment 3: The Memory Wall

32