high performance data processing with halide
play

High performance data processing with Halide Roel Jordans High - PowerPoint PPT Presentation

High performance data processing with Halide Roel Jordans High performance computjng 2 Architecture Trends 3 Processing platgorm architectures Increasing latency / energy CPU CPU GPU Cache Cache Cache Cache Local memory Memory 4


  1. High performance data processing with Halide Roel Jordans

  2. High performance computjng 2

  3. Architecture Trends 3

  4. Processing platgorm architectures Increasing latency / energy CPU CPU GPU Cache Cache Cache Cache Local memory Memory 4

  5. Programming challenges ● Data movement is costly ● Requires early planning of data distributjon ● May require explicit moves from CPU to GPU memory ● Traditjonal programming languages (like C) are not designed for this kind of architecture ● Require library support ● Highly customized code ● Vendor specifjc, low portability 5

  6. An example: 3x3 box fjlter Horizontal Vertjcal blur blur • A simple, two-stage imaging pipeline: 3x3 blur . Basic functjon: a summatjon over a 3x3 area: • We leave out the averaging step. 6

  7. Blur: Inlined implementatjon C code int in[W*H]; int by[W*H]; for(int y=1; y<(H-1); y++){ for(int x=1; x<(W-1); x++{ by[x + y*W] = in[(x-1) + (y-1)*W] + in[(x-1) + y*W] + in[(x-1) + (y+1)*W] + in[x + (y-1)*W] + in[x + y*W] + in[x + (y+1)*W] + in[(x+1) + (y-1)*W] + in[(x+1) + y*W] + in[(x+1) + (y+1)*W]; } } - 9 loads per output pixel - 8 additjons per output pixel - Minimal memory footprint - Completely parallelizable (independent pixels) - Unnecessary recomputatjon 7

  8. Blur: Stored implementatjon C code int in[W*H]; int bx[W*H]; int by[W*H]; for(int y=0; y<H; y++){ for(int x=1; x<(W-1); x++{ bx[x + y*W] = in[x-1 + y*W] + in[x + y*W] + in[x+1 +y*W]; } } for(int y=1; y<(H-1); y++){ for(int x=1; x<(W-1); x++{ by[x + y*W] = bx[x + (y-1)*W] + bx[x + y*W] + bx[x+ (y+1)*W]; } } - 6 loads, 1 store per output pixel - 4 additjons per output pixel - Very low locality (big bufger) - No recomputatjon - Stjll parallelizable 8

  9. Blur: Fused pipeline C code int in[W*H]; int bx[W*H]; int by[W*H]; for(int y=0; y<2; y++){ for(int x=1; x<(W-1); x++{ bx[x + y*W] = in[x-1 + y*W] + in[x + y*W] + in[x+1 + y*W]; } } for(int y=1; y<(H-1); y++){ for(int x=1; x<(W-1); x++{ bx[x + (y+1)*W] = in[x-1 + (y+1)*W] + in[x + (y+1)*W] + in[x+1 + (y+1)*W]; by[x + y*W] = bx[x + (y-1)*W] + bx[x + y*W] + bx[x + (y+1)*W]; } } - 6 loads, 1 store per output pixel - 4 additjons per output pixel - Not directly parallelizable - High locality (producer, consumer moved together) - No recomputatjon 9

  10. Blur: Folded storage C code int in[W*H]; int bx[W* 3 ]; int by[W*H]; for(int y=0; y<2; y++){ for(int x=1; x<(W-1); x++{ bx[x + y*W] = in[x-1 + y*W] + in[x + y*W] + in[x+1 + y*W]; } } for(int y=1; y<(H-1); y++){ for(int x=1; x<(W-1); x++{ bx[ ( x + (y+1)*W )%3 ] = in[x-1 + (y+1)*W] + in[x + (y+1)*W] + in[x+1 + (y+1)*W]; by[x + y*W] = bx[ ( x + (y-1)*W )%3 ] + bx[ ( x + y*W )%3 ] + bx[ ( x + (y+1)*W )%3 ]; } } - Same results as last slide, but: - With a smaller intermediate bufger (W*3 instead of W*H) 10

  11. Data mapping freedom ● Two extremes ● Inline everything → Lots of computatjons ● Store everything → Lots of memory required ● Many optjons in between ● Can be tuned to match memory hierarchy! ● Can result in really complex loop structures 11

  12. Next level optjmizatjons C allows us to specifjcally program the executjon order Many optjmizatjons: • Loop fusion, storage folding, tjling, multj-threading, vectorizatjon, ... • Most obscure functjonality • Most are architecture specifjc • Requires rewritjng and debugging to optjmize • Exploratjon of optjmizatjons is increasingly diffjcult 12

  13. Blur optjmized #pragma omp parallel for for ( int yTile = 0 ; yTile < out . height (); yTile += 32 ) { __m128i a , b , c , sum , avg ; __m128i tmp [( 128 / 8 ) * ( 32 + 2 )]; for ( int xTile = 0 ; xTile < out . width (); xTile += 128 ) { Func blur_3x3 ( Func input ) { __m128i * tmpPtr = tmp ; Func blur_x , blur_y ; for ( int y = 0 ; y < 32 + 2 ; y ++) { Var x , y , xi , yi ; const uint16_t * inPtr = &( in ( xTile , yTile + y )); for ( int x = 0 ; x < 128 ; x += 8 ) { a = _mm_load_si128 (( const __m128i *)( inPtr )); // The algorithm - no storage or order b = _mm_loadu_si128 (( const __m128i *)( inPtr + 1 )); bx ( x , y ) = ( in ( x - 1 , y ) + in ( x , y ) + in ( x + 1 , y )); c = _mm_loadu_si128 (( const __m128i *)( inPtr + 2 )); sum = _mm_add_epi16 ( _mm_add_epi16 ( a , b ), c ); by ( x , y ) = ( bx ( x , y - 1 ) + bx ( x , y ) + bx ( x , y + 1 )); avg = _mm_mulhi_epi16 ( sum , one_third ); _mm_store_si128 ( tmpPtr ++, avg ); inPtr += 8 ; // The schedule - defines order, locality; implies storage } by . tile ( x , y , xi , yi , 256 , 32 ) } . vectorize ( xi , 8 ). parallel ( y ); tmpPtr = tmp ; bx . compute_at ( by , x ). vectorize ( x , 8 ); for ( int y = 0 ; y < 32 ; y ++) { __m128i * outPtr = ( __m128i *)(&( out ( xTile , yTile + y ))); for ( int x = 0 ; x < 128 ; x += 8 ) { return by ; a = _mm_load_si128 ( tmpPtr +( 2 * 128 )/ 8 ); Halide (complete implementatjon) b = _mm_load_si128 ( tmpPtr + 128 / 8 ); } c = _mm_load_si128 ( tmpPtr ++); sum = _mm_add_epi16 ( _mm_add_epi16 ( a , b ), c ); avg = _mm_mulhi_epi16 ( sum , one_third ); _mm_store_si128 ( outPtr ++, avg ); } } } C (partjal implementatjon) } 13

  14. Halide? ● A domain specifjc language (DSL) targetjng image processing pipelines ● Embedded in C++ → Uses an pre-existjng compiler for most of the heavy lifuing ● Available for many target architectures (x86, ARM, CUDA, …) ● Support from industry: Google, Adobe 14

  15. Halide! ● Main idea ● Decouple algorithm defjnitjon from optjmizatjon schedule → Apply optjmizatjons without complicatjng the code ● Result ● Easier and faster design space exploratjon ● Improved readability and portability → For a new architecture we should only change the schedule 15

  16. Blur: Halide Horizontal Vertjcal blur blur Func in, bx, by; Var x, y; bx(x,y) = in(x-1,y) + in(x,y), + in(x+1,y); by(x,y) = bx(x,y-1) + bx(x,y) + bx(x,y+1); by.realize(10,10); //  build and execute the loop nest over a 10x10 area 16

  17. Blur: Halide Func bx , by, in ; Var x , y; bx ( x , y ) = in ( x - 1 , y ) + in ( x , y ) + in ( x + 1 , y ); by ( x , y ) = bx ( x , y - 1 ) + bx ( x , y ) + bx ( x , y + 1 ); Note that in the body, there is no notjon of : • tjme (executjon order). • space (bufger assignment, image size, memory allocatjon) • hardware (because no tjme and space) • a very clear , concise and readable algorithm. • we have not chosen any optjmizatjon strategy yet. • eg. we can use this same startjng point on any target architecture. • (in C, a naïve implementatjon would already require scheduling decisions) 17

  18. Scheduling Halide Func bx , by, in ; Var x , y; bx ( x , y ) = in ( x - 1 , y ) + in ( x , y ) + in ( x + 1 , y ); by ( x , y ) = bx ( x , y - 1 ) + bx ( x , y ) + bx ( x , y + 1 ); by.realize ( 10,10 ); • Internally, Halide converts this functjonal representatjon to a C-like loop nest. • By default, if nothing else is done, everything is inlined. 18

  19. Scheduling Halide Func bx , by, in ; Var x , y; bx ( x , y ) = in ( x - 1 , y ) + in ( x , y ) + in ( x + 1 , y ); by ( x , y ) = bx ( x , y - 1 ) + bx ( x , y ) + bx ( x , y + 1 ); bx.compute_root(); by.realize ( 10,10 ); compute_root() : compute and store all outputs of a (producer) functjon before startjng computatjon of the next . 19

  20. Scheduling Halide Func bx , by, in ; Var x , y; bx ( x , y ) = in ( x - 1 , y ) + in ( x , y ) + in ( x + 1 , y ); by ( x , y ) = bx ( x , y - 1 ) + bx ( x , y ) + bx ( x , y + 1 ); bx.compute_at(by,y); by.realize ( 10,10 ); • compute_root() is actually a special case of compute_at(). • compute_at (by, y) means: “Whenever stage by starts an iteratjon of the y loop, fjrst calculate the pixels of stage bx that will be consumed.” • In other words: computatjon of bx is fused at the loop over y of by . Not completely equivalent to our initjal fused version 20

  21. Scheduling Halide Func bx , by, in ; Var x , y; bx ( x , y ) = in ( x - 1 , y ) + in ( x , y ) + in ( x + 1 , y ); by ( x , y ) = bx ( x , y - 1 ) + bx ( x , y ) + bx ( x , y + 1 ); bx.compute_at(by,y); bx.store_root(); by.realize ( 10,10 ); • For this, we can separate computatjon from storage using store_at() and store_root() . • bx.store_root() means: “Allocate the bufger for bx outside the loop nest.” Halide automatjcally applies storage folding as well! 21

  22. Many more scheduling optjons • We looked at the syntax which interleaves computatjon between stages . • There is also syntax which changes the order of computatjon within a single stage : • Reorder loop variables → by.reorder(y,x); • Split loop variables into inner and outer → by.split(x, xout, xin, 4); • Tiling is a just combinatjon of the above: • by.split(x, xout, xin, 4); • by.split(y, yout, yin, 4); • by.reorder(xin, yin, xout, yout); • Because this is so common, syntactjc sugar (a “shortcut”) is ofgered: • by.tjle(x, y, xout, xin, yout, yin, 4, 4); • Execute loop iteratjons in parallel using multj-threading: • by.parallel(x); //executes each x iteratjon simultaneously in threads • Turn a loop into a (series of) vector operatjon(s) : • by.vectorize(xin); //loop over xin, which has 4 iteratjons, is vectorized • by.vectorize(x, 4); //shortcut: split x into out and in of 4, then vectorize in 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend