Computation vs. Memory Systems: Pinning Down Accelerator Bottlenecks - - PowerPoint PPT Presentation
Computation vs. Memory Systems: Pinning Down Accelerator Bottlenecks - - PowerPoint PPT Presentation
Computation vs. Memory Systems: Pinning Down Accelerator Bottlenecks Martha Kim and Stephen Edwards Columbia University Department of Computer Science June 19, 2010 AMAS-BT Workshop 2 Improved processor performance Fuller- Slower
2
2
Improved processor performance Fuller- featured software Larger development teams HLLs & abstractions Slower programs
Larus’ virtuous cycle
2
Improved processor performance Fuller- featured software Larger development teams HLLs & abstractions Slower programs
Larus’ virtuous cycle Power wall
2
Improved processor performance Fuller- featured software Larger development teams HLLs & abstractions Slower programs
Larus’ virtuous cycle Increases in power efficiency and performance Power wall
Efficiency of Specialized Hardware
3
Performance Power
General purpose processor ASIP Application- specific instruction processor FPGA Field- programmable gate array Standard cell ASIC Full custom ASIC
40-500x
3-10x
6-8x
10-350x
10-40x
3-10x
Accelerator System
General-purpose core(s) surrounded by many, many special-purpose accelerators that are powered on only when their function is needed. Potential benefits available only if applications actually make use of accelerators.
4
P0 P1 Shared Communication / Memory A0 A1 A2 A3 A4 A5 A6 A7
Accelerator System
General-purpose core(s) surrounded by many, many special-purpose accelerators that are powered on only when their function is needed. Potential benefits available only if applications actually make use of accelerators.
4
P0 P1 Shared Communication / Memory A0 A1 A2 A3 A4 A5 A6 A7
Talk Outline
- Overall Vision
- Accelerator System Model
- Methodology Overview
- Methodology in Practice
- Image Rotation
- JPEG
- Conclusion
5
System-Level Vision
6
Presence of high-level types supported by accelerators determines boundary of accelerator code Programmer targets standard accelerator libraries (e.g., Java)
Functions define boundaries of acceleration.
Each accelerator must do two things
- 1. Compute
- 2. Communicate externally
7
CPU foo bar baz
Each accelerator must do two things
- 1. Compute
- 2. Communicate externally
7
CPU foo bar baz
Computation Computation
Each accelerator must do two things
- 1. Compute
- 2. Communicate externally
7
CPU foo bar baz
Communication Computation Computation
Talk Outline
- Overall Vision
- Accelerator System Model
- Methodology Overview
- Methodology in Practice
- Image Rotation
- JPEG
- Conclusion
8
Computation Model
9
main() foo() bar() bar() baz()
Dynamic invocation tree:
CPU foo bar baz
- 1. Dynamic function invocations
mapped to corresponding accelerator
Computation Model
9
main() foo() bar() bar() baz()
Dynamic invocation tree:
CPU foo bar baz
- 1. Dynamic function invocations
mapped to corresponding accelerator
Computation Model
9
main() foo() bar() bar() baz()
Dynamic invocation tree:
CPU foo bar baz
- 1. Dynamic function invocations
mapped to corresponding accelerator
Computation Model
9
main() foo() bar() bar() baz()
Dynamic invocation tree:
CPU foo bar baz
- 1. Dynamic function invocations
mapped to corresponding accelerator
Computation Model
9
main() foo() bar() bar() baz()
Dynamic invocation tree:
CPU foo bar baz
- 1. Dynamic function invocations
mapped to corresponding accelerator
Computation Model
9
main() foo() bar() bar() baz()
Dynamic invocation tree:
CPU foo bar baz
- 1. Dynamic function invocations
mapped to corresponding accelerator
Computation Model
10
main() foo()
Dynamic invocation tree:
CPU foo bar
- 1. Dynamic function invocations
mapped to corresponding accelerator
- 2. Invocations without accelerator are
executed on same core as parent
bar() bar() baz()
baz
Computation Model
10
main() foo()
Dynamic invocation tree:
CPU foo
- 1. Dynamic function invocations
mapped to corresponding accelerator
- 2. Invocations without accelerator are
executed on same core as parent
bar() bar() baz()
baz
Computation Model
10
main() foo()
Dynamic invocation tree:
CPU foo
- 1. Dynamic function invocations
mapped to corresponding accelerator
- 2. Invocations without accelerator are
executed on same core as parent
bar() bar() baz()
baz
Computation Model
10
main() foo()
Dynamic invocation tree:
CPU foo
- 1. Dynamic function invocations
mapped to corresponding accelerator
- 2. Invocations without accelerator are
executed on same core as parent
bar() bar() baz()
baz
Computation Model
10
main() foo()
Dynamic invocation tree:
CPU foo bar
- 1. Dynamic function invocations
mapped to corresponding accelerator
- 2. Invocations without accelerator are
executed on same core as parent
bar() bar() baz()
Communication Model
11
main()
Dynamic invocation tree:
CPU foo bar
Invocations communicate via load/ store dependencies
baz
foo() bar() bar() baz()
Communication Model
11
main()
Dynamic invocation tree:
CPU foo bar
Invocations communicate via load/ store dependencies
baz
foo() bar() bar() baz()
st A ld A
Communication Model
11
main()
Dynamic invocation tree:
CPU foo bar
Invocations communicate via load/ store dependencies
baz
foo() bar() bar() baz()
st A ld A
Communication Model
11
main()
Dynamic invocation tree:
CPU foo bar
Invocations communicate via load/ store dependencies
baz
foo() bar() bar() baz()
st A st B ld A ld B
Communication Model
11
main()
Dynamic invocation tree:
CPU foo bar
Invocations communicate via load/ store dependencies
baz
foo() bar() bar() baz()
st A st B ld A ld B
Talk Outline
- Overall Vision
- Accelerator System Model
- Methodology Overview
- Methodology in Practice
- Image Rotation
- JPEG
- Conclusion
12
Methodology Pt 1: Examine Application
Pintool ➔ decorated call graph
13
main() foo() bar() bar() baz()
Methodology Pt 1: Examine Application
Pintool ➔ decorated call graph
13
main() foo() bar() bar() baz()
i j k l m
Methodology Pt 1: Examine Application
Pintool ➔ decorated call graph
13
main() foo() bar() bar() baz()
i j k l m a b c d e f
Pintool Functionality
- Tool instruments calls, returns, loads and stores
- Four logs are generated, all keyed off of a unique, invocation identifier
- At present, significant overhead relative to native (approximately 2000x on the short-running
applications to be presented here).
- Majority of overhead attributable to hash lookups (for tracking data transfers) and logfile
writing
14
Function invocation ID ➔ function name
(-tfunction, -bfunction)
Subcalls invocation ID ➔ list of subcall IDs
(-tsubcalls, -bsubcalls)
Instruction Count invocation ID ➔ dynamic instruction count (-ticount, -bicount) Data Transfers invocation ID ➔ invocation ID, bytes
(-txfers, -bxfers, -xfer-chunk
Methodology Pt 2: Evaluate Execution
- n Accelerators
Program runtime a function of
- computation rates
- computation
- communication rates
- communication
Enables evaluation of:
- acceleration potential in the limit
- sensitivity of potential to hardware
parameters, program inputs, etc.
15
main() foo() bar() bar() baz()
i j k l m a b c d e f
16
main() foo() bar() bar() baz()
i j k l m a b c d e f
Why is gprof not sufficient?
Runtimes are machine- and algorithm- dependent Does not capture data transfers between invocations
Talk Outline
- Overall Vision
- Accelerator System Model
- Methodology Overview
- Methodology in Practice
- Image Rotation
- JPEG
- Conclusion
17
Methodology In Practice
Image rotation JPEG decode
18
main() foo() bar() bar() baz()
i j k l m a b c d e f
Simple Example: Image Rotation
19
#define PIX(x,y) raster[(x) + (y)*wd] unsigned wd, ht, maxval, *raster; int main(int argc, char** argv) { if (argc != 2 || (argv[1][0] != ’r’ && argv[1][0] != ’i’)) { printf("USAGE: rotate [ir]\n"), exit(0); } read_ppm(); if (argv[1][0] == ’r’) rec_rot(0, 0, wd); else iter_rot(); write_ppm(); return 0; }
Image Rotation: Recursive Algorithm
20
#define PIX(x,y) raster[(x) + (y)*wd] unsigned wd, ht, maxval, *raster; int main(int argc, char** argv) { if (argc != 2 || (argv[1][0] != ’r’ && argv[1][0] != ’i’)) { printf("USAGE: rotate [ir]\n"), exit(0); } read_ppm(); if (argv[1][0] == ’r’) rec_rot(0, 0, wd); else iter_rot(); write_ppm(); return 0; }
void rec_rot(int x, int y, int s) { int i, j; s >>= 1; for (i = 0 ; i < s ; ++i) for (j = 0 ; j < s ; ++j) { int rgb = PIX(x+i, y+j); PIX(x+i, y+j ) = PIX(x+i, y+j+s); PIX(x+i, y+j+s) = PIX(x+i+s, y+j+s); PIX(x+i+s, y+j+s) = PIX(x+i+s, y+j); PIX(x+i+s, y+j ) = rgb; } if (s <= 1) return; rec_rot(x,y+s,s); rec_rot(x+s,y+s,s); rec_rot(x+s,y,s); rec_rot(x,y,s); }
#define PIX(x,y) raster[(x) + (y)*wd] unsigned wd, ht, maxval, *raster; int main(int argc, char** argv) { if (argc != 2 || (argv[1][0] != ’r’ && argv[1][0] != ’i’)) { printf("USAGE: rotate [ir]\n"), exit(0); } read_ppm(); if (argv[1][0] == ’r’) rec_rot(0, 0, wd); else iter_rot(); write_ppm(); return 0; }
Image Rotation: Iterative Algorithm
21
void iter_rot() { int x, y, s; s = wd >> 1; for (y = 0 ; y < s ; ++y) for (x = 0 ; x < s ; ++x) { int rgb = PIX(x, y); PIX(x, y ) = PIX(y, ht-x-1); PIX(y, ht-x-1) = PIX(wd-x-1, ht-y-1); PIX(wd-x-1, ht-y-1) = PIX(wd-y-1, x ); PIX(wd-y-1, x ) = rgb; } }
Experiment 1: Multiple Input Sizes
- Relative ratio of
computation and communication nearly constant across algorithms and input sizes
- Useful for high level
survey of input space
- Given this result, will
proceed to analyze a single input image size (128x128)
22
Experiment 1: Multiple Input Sizes
- Relative ratio of
computation and communication nearly constant across algorithms and input sizes
- Useful for high level
survey of input space
- Given this result, will
proceed to analyze a single input image size (128x128)
22
Experiment 2: Hotspots (Iterative Rotate)
23
Computational and communication hotspots become visible
Communication volume (bytes)
read_ppm iter_rot write_ppm
- ther
read_ppm 607413 49172 iter_rot 16387 4101 write_ppm 5 16384 689163 49155
- ther
4 18 491
Computation
804340 114820 738584 1902
Experiment 2: Hotspots (Iterative Rotate)
23
Computational and communication hotspots become visible
Communication volume (bytes)
read_ppm iter_rot write_ppm
- ther
read_ppm 607413 49172 iter_rot 16387 4101 write_ppm 5 16384 689163 49155
- ther
4 18 491
Computation
804340 114820 738584 1902
Experiment 2: Hotspots (Iterative Rotate)
23
Computational and communication hotspots become visible
Communication volume (bytes)
read_ppm iter_rot write_ppm
- ther
read_ppm 607413 49172 iter_rot 16387 4101 write_ppm 5 16384 689163 49155
- ther
4 18 491
Computation
804340 114820 738584 1902
Experiment 2: Hotspots (Iterative Rotate)
23
Computational and communication hotspots become visible
Communication volume (bytes)
read_ppm iter_rot write_ppm
- ther
read_ppm 607413 49172 iter_rot 16387 4101 write_ppm 5 16384 689163 49155
- ther
4 18 491
Computation
804340 114820 738584 1902
Experiment 2: Hotspots (Recursive Rotate)
24
Communication volume (bytes)
read_ppm rec_rot write_ppm
- ther
read_ppm 607413 49172 rec_rot 27306 216888 4 write_ppm 5 16384 689163 49155
- ther
4 18 491
Computation
804340 114820 738584 1902
Experiment 2: Hotspots (Recursive Rotate)
25
Experiment 4: Specialize Local Communication
- Specialization of local
communication in addition to computation dramatically improves potential speedups
- Suggests specialized
data handling and movement is a way to contend with memory wall
26
JPEG Decode Initial Results
27
Accelerator Architectures for JPEG Decode
- Not much gain accelerating
each of top four individually
- If you had to select one,
function it would be #2 in terms of computation, IDCT.
- Accelerating two functions
improves gains, but we see little value in a conjoined accelerator design, rather than two individuals
28
Accelerator Architectures for JPEG Decode
- Not much gain accelerating
each of top four individually
- If you had to select one,
function it would be #2 in terms of computation, IDCT.
- Accelerating two functions
improves gains, but we see little value in a conjoined accelerator design, rather than two individuals
28
Future Work and Conclusions
29
- Validate model against hardware (SVM library)
- Improve efficiency of pintool (move to online version, not dependent on logs)
- Examine larger, real applications
- Future power and performance needs are pointing in the direction of
hardware specialization
- Understanding the interplay between an application and abstract
architectural parameters can yield important design insights and performance targets
Thank You
martha@cs.columbia.edu
30
BACKUP
31
Experiment 3: The Memory Wall
32