High performance data processing with Halide Roel Jordans High - - PowerPoint PPT Presentation

high performance data processing with halide
SMART_READER_LITE
LIVE PREVIEW

High performance data processing with Halide Roel Jordans High - - PowerPoint PPT Presentation

High performance data processing with Halide Roel Jordans High performance computjng 2 Architecture Trends 3 Processing platgorm architectures Increasing latency / energy CPU CPU GPU Cache Cache Cache Cache Local memory Memory 4


slide-1
SLIDE 1

High performance data processing with Halide

Roel Jordans

slide-2
SLIDE 2

2

High performance computjng

slide-3
SLIDE 3

Architecture Trends

3

slide-4
SLIDE 4

4

Processing platgorm architectures

CPU Memory Cache Cache CPU Cache Increasing latency / energy GPU Local memory Cache

slide-5
SLIDE 5

Programming challenges

5

  • Data movement is costly
  • Requires early planning of data distributjon
  • May require explicit moves from CPU to GPU

memory

  • Traditjonal programming languages (like C) are

not designed for this kind of architecture

  • Require library support
  • Highly customized code
  • Vendor specifjc, low portability
slide-6
SLIDE 6

An example: 3x3 box fjlter

6

Horizontal blur Vertjcal blur

  • A simple, two-stage imaging pipeline: 3x3 blur.

Basic functjon: a summatjon over a 3x3 area:

  • We leave out the averaging step.
slide-7
SLIDE 7

Blur: Inlined implementatjon

7

C code

int in[W*H]; int by[W*H]; for(int y=1; y<(H-1); y++){ for(int x=1; x<(W-1); x++{ by[x + y*W] = in[(x-1) + (y-1)*W] + in[(x-1) + y*W] + in[(x-1) + (y+1)*W] + in[x + (y-1)*W] + in[x + y*W] + in[x + (y+1)*W] + in[(x+1) + (y-1)*W] + in[(x+1) + y*W] + in[(x+1) + (y+1)*W]; } }

  • 9 loads per output pixel
  • 8 additjons per output pixel
  • Minimal memory footprint
  • Completely parallelizable (independent pixels)
  • Unnecessary recomputatjon
slide-8
SLIDE 8

Blur: Stored implementatjon

8

C code

int in[W*H]; int bx[W*H]; int by[W*H]; for(int y=0; y<H; y++){ for(int x=1; x<(W-1); x++{ bx[x + y*W] = in[x-1 + y*W] + in[x + y*W] + in[x+1 +y*W]; } } for(int y=1; y<(H-1); y++){ for(int x=1; x<(W-1); x++{ by[x + y*W] = bx[x + (y-1)*W] + bx[x + y*W] + bx[x+ (y+1)*W]; } }

  • 6 loads, 1 store per output pixel
  • 4 additjons per output pixel
  • Very low locality (big bufger)
  • No recomputatjon
  • Stjll parallelizable
slide-9
SLIDE 9

Blur: Fused pipeline

9

C code

int in[W*H]; int bx[W*H]; int by[W*H]; for(int y=0; y<2; y++){ for(int x=1; x<(W-1); x++{ bx[x + y*W] = in[x-1 + y*W] + in[x + y*W] + in[x+1 + y*W]; } } for(int y=1; y<(H-1); y++){ for(int x=1; x<(W-1); x++{ bx[x + (y+1)*W] = in[x-1 + (y+1)*W] + in[x + (y+1)*W] + in[x+1 + (y+1)*W]; by[x + y*W] = bx[x + (y-1)*W] + bx[x + y*W] + bx[x + (y+1)*W]; } }

  • 6 loads, 1 store per output pixel
  • 4 additjons per output pixel
  • Not directly parallelizable
  • High locality (producer, consumer

moved together)

  • No recomputatjon
slide-10
SLIDE 10

Blur: Folded storage

10

C code

int in[W*H]; int bx[W*3]; int by[W*H]; for(int y=0; y<2; y++){ for(int x=1; x<(W-1); x++{ bx[x + y*W] = in[x-1 + y*W] + in[x + y*W] + in[x+1 + y*W]; } } for(int y=1; y<(H-1); y++){ for(int x=1; x<(W-1); x++{ bx[(x + (y+1)*W)%3] = in[x-1 + (y+1)*W] + in[x + (y+1)*W] + in[x+1 + (y+1)*W]; by[x + y*W] = bx[(x + (y-1)*W)%3] + bx[(x + y*W)%3] + bx[(x + (y+1)*W)%3]; } }

  • Same results as last slide, but:
  • With a smaller intermediate bufger

(W*3 instead of W*H)

slide-11
SLIDE 11

Data mapping freedom

11

  • Two extremes
  • Inline everything → Lots of computatjons
  • Store everything → Lots of memory required
  • Many optjons in between
  • Can be tuned to match memory hierarchy!
  • Can result in really complex loop structures
slide-12
SLIDE 12

Next level optjmizatjons

12

C allows us to specifjcally program the executjon order

Many optjmizatjons:

  • Loop fusion, storage folding, tjling, multj-threading, vectorizatjon, ...
  • Most obscure functjonality
  • Most are architecture specifjc
  • Requires rewritjng and debugging to optjmize
  • Exploratjon of optjmizatjons is increasingly diffjcult
slide-13
SLIDE 13

Blur optjmized

13

#pragma omp parallel for for (int yTile = 0; yTile < out.height(); yTile += 32) { __m128i a, b, c, sum, avg; __m128i tmp[(128/8) * (32 + 2)]; for (int xTile = 0; xTile < out.width(); xTile += 128) { __m128i *tmpPtr = tmp; for (int y = 0; y < 32+2; y++) { const uint16_t *inPtr = &(in(xTile, yTile+y)); for (int x = 0; x < 128; x += 8) { a = _mm_load_si128((const __m128i*)(inPtr)); b = _mm_loadu_si128((const __m128i*)(inPtr+1)); c = _mm_loadu_si128((const __m128i*)(inPtr+2)); sum = _mm_add_epi16(_mm_add_epi16(a, b), c); avg = _mm_mulhi_epi16(sum,one_third); _mm_store_si128(tmpPtr++, avg); inPtr+=8; } } tmpPtr = tmp; for (int y = 0; y < 32; y++) { __m128i *outPtr = (__m128i *)(&(out(xTile, yTile+y))); for (int x = 0; x < 128; x += 8) { a = _mm_load_si128(tmpPtr+(2*128)/8); b = _mm_load_si128(tmpPtr+128/8); c = _mm_load_si128(tmpPtr++); sum = _mm_add_epi16(_mm_add_epi16(a, b), c); avg = _mm_mulhi_epi16(sum, one_third); _mm_store_si128(outPtr++, avg); } } } }

Func blur_3x3(Func input) { Func blur_x, blur_y; Var x, y, xi, yi; // The algorithm - no storage or order bx(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y)); by(x, y) = (bx(x, y-1) + bx(x, y) + bx(x, y+1)); // The schedule - defines order, locality; implies storage by.tile(x, y, xi, yi, 256, 32) .vectorize(xi, 8).parallel(y); bx.compute_at(by, x).vectorize(x, 8); return by; }

C (partjal implementatjon)

Halide (complete implementatjon)

slide-14
SLIDE 14

Halide?

14

  • A domain specifjc language (DSL) targetjng

image processing pipelines

  • Embedded in C++ → Uses an pre-existjng compiler

for most of the heavy lifuing

  • Available for many target architectures (x86, ARM,

CUDA, …)

  • Support from industry: Google, Adobe
slide-15
SLIDE 15

Halide!

15

  • Main idea
  • Decouple algorithm defjnitjon from optjmizatjon

schedule

→ Apply optjmizatjons without complicatjng the code

  • Result
  • Easier and faster design space exploratjon
  • Improved readability and portability

→ For a new architecture we should only change the schedule

slide-16
SLIDE 16

Blur: Halide

16

Horizontal blur Vertjcal blur Func in, bx, by; Var x, y; bx(x,y) = in(x-1,y) + in(x,y), + in(x+1,y); by(x,y) = bx(x,y-1) + bx(x,y) + bx(x,y+1); by.realize(10,10); // build and execute the loop nest

  • ver a 10x10 area
slide-17
SLIDE 17

Blur: Halide

17

Func bx, by, in; Var x, y; bx(x, y) = in(x-1, y) + in(x, y) + in(x+1, y); by(x, y) = bx(x, y-1) + bx(x, y) + bx(x, y+1);

Note that in the body, there is no notjon of:

  • tjme (executjon order).
  • space (bufger assignment, image size, memory allocatjon)
  • hardware (because no tjme and space)
  • a very clear, concise and readable algorithm.
  • we have not chosen any optjmizatjon strategy yet.
  • eg. we can use this same startjng point on any target architecture.
  • (in C, a naïve implementatjon would already require scheduling decisions)
slide-18
SLIDE 18

Scheduling

18

Halide

Func bx, by, in; Var x, y; bx(x, y) = in(x-1, y) + in(x, y) + in(x+1, y); by(x, y) = bx(x, y-1) + bx(x, y) + bx(x, y+1); by.realize(10,10);

  • Internally, Halide converts this functjonal representatjon to a C-like loop nest.
  • By default, if nothing else is done, everything is inlined.
slide-19
SLIDE 19

Scheduling

19

Halide

Func bx, by, in; Var x, y; bx(x, y) = in(x-1, y) + in(x, y) + in(x+1, y); by(x, y) = bx(x, y-1) + bx(x, y) + bx(x, y+1); bx.compute_root(); by.realize(10,10);

compute_root(): compute and store all outputs of a (producer) functjon before startjng computatjon of the next.

slide-20
SLIDE 20

Scheduling

20

Halide

Func bx, by, in; Var x, y; bx(x, y) = in(x-1, y) + in(x, y) + in(x+1, y); by(x, y) = bx(x, y-1) + bx(x, y) + bx(x, y+1); bx.compute_at(by,y); by.realize(10,10);

  • compute_root() is actually a special case of compute_at().
  • compute_at(by, y) means: “Whenever stage by starts an iteratjon of the y loop, fjrst

calculate the pixels of stage bx that will be consumed.”

  • In other words: computatjon of bx is fused at the loop over y of by.

Not completely equivalent to our initjal fused version

slide-21
SLIDE 21

Scheduling

21

Halide

Func bx, by, in; Var x, y; bx(x, y) = in(x-1, y) + in(x, y) + in(x+1, y); by(x, y) = bx(x, y-1) + bx(x, y) + bx(x, y+1); bx.compute_at(by,y); bx.store_root(); by.realize(10,10);

  • For this, we can separate computatjon from storage using store_at() and store_root().
  • bx.store_root() means: “Allocate the bufger for bx outside the loop nest.”

Halide automatjcally applies storage folding as well!

slide-22
SLIDE 22

Many more scheduling optjons

22

  • We looked at the syntax which interleaves computatjon between stages.
  • There is also syntax which changes the order of computatjon within a single stage:
  • Reorder loop variables → by.reorder(y,x);
  • Split loop variables into inner and outer → by.split(x, xout, xin, 4);
  • Tiling is a just combinatjon of the above:
  • by.split(x, xout, xin, 4);
  • by.split(y, yout, yin, 4);
  • by.reorder(xin, yin, xout, yout);
  • Because this is so common, syntactjc sugar (a “shortcut”) is ofgered:
  • by.tjle(x, y, xout, xin, yout, yin, 4, 4);
  • Execute loop iteratjons in parallel using multj-threading:
  • by.parallel(x); //executes each x iteratjon simultaneously in threads
  • Turn a loop into a (series of) vector operatjon(s):
  • by.vectorize(xin); //loop over xin, which has 4 iteratjons, is vectorized
  • by.vectorize(x, 4); //shortcut: split x into out and in of 4, then vectorize in
slide-23
SLIDE 23

Many scheduling optjons

23

Func gradient; Var x, y, xout, xin, yout, yin; //1-line algorithm defjnition: gradient(x,y) = x+y; //this is equivalent to //gradient.tile(x, y, xout, xin, yout, yin, 2, 2): gradient.split(x, xout, xin, 2); 1st gradient.split(y, yout, yin, 2); 2nd gradient.reorder(xin, yin, xout, yout); 3rd gradient.vectorize(xin); gradient.parallel(yout).parallel(xout);

Result: Result: Result:

slide-24
SLIDE 24

Larger program: Local Laplacian fjlters

24

Reference: 300 lines in C++ Adobe (C++): 1500 lines expert-optjmized, multj-threaded, SIMD, 10x faster 3 months of work Intern (Halide): 60 lines, 2x faster (vs. expert), 1 day GPU version, 9x faster (vs. expert)

  • 99 difgerent stages
  • many difgerent stencils
  • large data-dependent resampling.
slide-25
SLIDE 25

Auto-scheduling

25

Halide now includes an auto-scheduler

  • User provides an estjmate of the problem size
  • e.g. by.estjmate(x,0,1920).estjmate(y,0,1024);
  • Compiler atuempts to automatjcally generate an
  • ptjmizatjon schedule for the pipeline
  • Tiling
  • Fusion
  • Vectorizatoin
  • Parallelizatjon
  • User can inspect the schedule and optjmize it further
slide-26
SLIDE 26

Limitatjons

26

  • As mentjoned Halide is domain-specifjc to image
  • processing. It can be less suitable for other

workloads because:

  • Not Turing-complete (no full recursion)
  • Only iterates over rectangular domains
  • Scheduling model only covers typical image

processing optjmizatjons

  • But this is the point of domain-specifjc languages:
  • If we aim to cover everything, we will get

something fmexible like C again!

slide-27
SLIDE 27

Observatjons

27

  • With Halide, the algorithm defjnitjon is more clear and concise than

with C.

  • Being separated from the optjmizatjon strategy
  • Transformatjons that would normally take a lot of efgort are done in

just a few separate scheduling statements

  • Saves tjme
  • Guaranteed correctness
  • Automatjc handling of edge conditjons and storage folding
  • With Halide, we can easily port the algorithm to a difgerent

architecture

  • As long as a Halide back-end exists for that architecture
  • Code is hardware-independent
  • For good performance on the new architecture → re-write the schedule
slide-28
SLIDE 28