Revec: Program Rejuvenation through Revectorization Charith Mendis - - PowerPoint PPT Presentation

revec program rejuvenation through revectorization
SMART_READER_LITE
LIVE PREVIEW

Revec: Program Rejuvenation through Revectorization Charith Mendis - - PowerPoint PPT Presentation

Revec: Program Rejuvenation through Revectorization Charith Mendis * Ajay Jain * Paras Jain Saman Amarasinghe * equal contribution 1 Parallelism in Processors INT Mem fetch ME WB Core1 Core2 INT L1 L2 L3 L4 IF ID ME WB FP ME


slide-1
SLIDE 1

Revec: Program Rejuvenation through Revectorization

Charith Mendis * Ajay Jain * Paras Jain Saman Amarasinghe

1

* equal contribution

slide-2
SLIDE 2

Parallelism in Processors

Core1 Core2 Core4 Core3 IF ID INT INT FP Vect ME ME ME WB WB WB Mem fetch L1 L2 L3 L4 Vector REG1 Vector REG2

Thread Level Multi-cores Hyperthreads Instruction Level Superscalar Pipelined Data Level SIMD units

2

slide-3
SLIDE 3

Exploiting SIMD parallelism

Portable Complete?

3

for (int i = 0; i < N/4; i+=4) { av[i] = _mm_sqrt_pd(bv[i]); av[i+2] = _mm_sqrt_pd(bv[i+2]); }

Hand-vectorization using compiler intrinsics

Portable? Complete

Use compiler auto-vectorization.

for (int i = 0; i < N; i++) { av[i] = sqrt(bv[i]); } sqrtpd 80(%rdx,%rax), %xmm0 sqrtpd 96(%rdx,%rax), %xmm1 vmovdqu %xmm0, 40(%rdi,%rax) vmovdqu %xmm1, 56(%rdi,%rax)

SSE2 (128 bit)

slide-4
SLIDE 4

Exploiting SIMD parallelism

vsqrtpd 80(%rdx,%rax), %zmm0 vsqrtpd 144(%rdx,%rax),%zmm1 vmovupd %zmm0, 40(%rdi,%rax) vmovupd %zmm1, 104(%rdi,%rax) Portable Complete?

Use compiler auto-vectorization.

for (int i = 0; i < N; i++) { av[i] = sqrt(bv[i]); }

AVX-512 (512 bit)

4

for (int i = 0; i < N/4; i+=4) { av[i] = _mm_sqrt_pd(bv[i]); av[i+2] = _mm_sqrt_pd(bv[i+2]); }

Hand-vectorization using compiler intrinsics

Portable? Complete

fixed to 128 bits

slide-5
SLIDE 5

Intel Vector-ISA Generations

5

32-bit scalar

  • nly

64-bit vector (MMX) 128-bit vector (SSE2) 256-bit vector (AVX2) 512-bit vector (AVX512)

Increase in bit-width Diversity in Instruction Set

1997 2000 2011 2016

slide-6
SLIDE 6

Exploiting SIMD parallelism

vsqrtpd 80(%rdx,%rax), %zmm0 vsqrtpd 144(%rdx,%rax),%zmm1 vmovupd %zmm0, 40(%rdi,%rax) vmovupd %zmm1, 104(%rdi,%rax) Portable Complete?

Use compiler auto-vectorization.

for (int i = 0; i < N; i++) { av[i] = sqrt(bv[i]); }

AVX-512 (512 bit)

6

for (int i = 0; i < N/4; i+=4) { av[i] = _mm_sqrt_pd(bv[i]); av[i+2] = _mm_sqrt_pd(bv[i+2]); }

Hand-vectorization using compiler intrinsics

Portable? Complete Portable Complete

Revec

slide-7
SLIDE 7

7

Naive implementation auto-vectorized by compiler

for (i = 1; i < H - 1; ++i) for (j = 1; j < W - 1; ++j) dst[i][j] = 1/9 * ( in[i-1][j-1] + in[i-1][j] + in[i-1][j+1] + in[i][j-1] + in[i][j] + in[i][j+1] + in[i+1][j-1] + in[i+1][j] + in[i+1][j+1])

1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9

Input Output

Unoptimized MeanFilter3x3

slide-8
SLIDE 8

8

Naive implementation auto-vectorized by compiler

for (i = 1; i < H - 1; ++i) for (j = 1; j < W - 1; ++j) dst[i][j] = 1/9 * ( in[i-1][j-1] + in[i-1][j] + in[i-1][j+1] + in[i][j-1] + in[i][j] + in[i][j+1] + in[i+1][j-1] + in[i+1][j] + in[i+1][j+1])

Input Output

Unoptimized MeanFilter3x3

Auto-vectorized SSE2 - 128-bit

for (i = 1; i < H - 1; ++i) for (j = 1; j < W - 1; j+=8) dst[i][j:j+7] = 1/9 * ( in[i-1][j-1:j+6] + in[i-1][j:j+7] + in[i-1][j+1:j+8] + in[i] [j-1:j+6] + in[i] [j:j+7] + in[i] [j+1:j+8] + in[i+1][j-1:j+6] + in[i+1][j:j+7] + in[i+1][j+1:j+8])

slide-9
SLIDE 9

9

Naive implementation auto-vectorized by compiler

for (i = 1; i < H - 1; ++i) for (j = 1; j < W - 1; ++j) dst[i][j] = 1/9 * ( in[i-1][j-1] + in[i-1][j] + in[i-1][j+1] + in[i][j-1] + in[i][j] + in[i][j+1] + in[i+1][j-1] + in[i+1][j] + in[i+1][j+1])

Input Output

Unoptimized MeanFilter3x3

Auto-vectorized AVX-512 - 512-bit

for (i = 1; i < H - 1; ++i) for (j = 1; j < W - 1; j+=8) dst[i][j:j+31] = 1/9 * ( in[i-1][j-1:j+30] + in[i-1][j:j+31] + in[i-1][j+1:j+32] + in[i] [j-1:j+30] + in[i] [j:j+31] + in[i] [j+1:j+32] + in[i+1][j-1:j+30] + in[i+1][j:j+31] + in[i+1][j+1:j+32])

slide-10
SLIDE 10

Optimized Implementation hand-vectorized by programmer

10

#define A 8 #define F (1 << 16)/9 __m128i div9 = _mm_set_epi16(F,F,F,F,F,F,F,F); uint16_t colsum[3 * W]; __m128i * buf1 = &colsum[0 * W]; __m128i * buf2 = &colsum[1 * W]; __m128i * buf3 = &colsum[2 * W]; //code to compute column sums for first two rows in buf1, buf2 for (i = 2; i < H; ++i){ for (j = 1; j < W - 1; j += A){ a0 = _mm_loadu_si128(in[i][j-1]); a1 = _mm_loadu_si128(in[i][j]); a2 = _mm_loadu_si128(in[i][j+1]); buf3[j/A] = _mm_add_epi16(a0, _mm_add_epi16(a1,a2))); dst[i - 1][j] = _mm_mulhi_epu16(div9, _mm_add_epi16(buf1[j/A] _mm_add_epi16(buf2[j/A],buf3[j/A])));} //swap buffer colsums for next iteration __m128i * temp = buf1; buf1 = buf2; buf2 = buf3; buf3 = temp; }

Optimized MeanFilter3x3

/

  • Rotating buffers
  • Hand-vectorized using compiler intrinsics

targeting a 128-bit SSE2 machine

_mm_loadu_si128 _mm_add_epi16

slide-11
SLIDE 11

Hand-vectorization inhibits performance portability

11

Speedups 1 2 3 4 5 6 7 8 SSE2 AVX512

Scalar auto-vectorized Hand-vectorized SSE2

  • Auto-vectorizers do not handle

already vectorized code

  • Hand-vectorized code does not

utilize features of newer vector ISA

slide-12
SLIDE 12

12

  • Retarget hand-vectorized codes to use new

vector instruction sets when available Revectorization

  • We built a compiler pass Revec to perform

revectorization

slide-13
SLIDE 13

Revec reinstates performance portability for hand-vectorized code

13

Speedups 1 2 3 4 5 6 7 8 9 SSE2 AVX512

Scalar auto-vectorized Hand-vectorized SSE2 Revectorized SSE2

1.31x Revec

slide-14
SLIDE 14

Revec

  • Revectorizes code transparently — implemented

as a regular compiler transformation in LLVM

  • Based on SLP vectorization
  • Enables performance portability for hand-vectorized

code

14

slide-15
SLIDE 15

SLP Vectorization

a[0] = b[0] + c[0]; a[1] = b[1] + c[1]; a[0:1] = b[0:1] + c[0:1]; Vectorize

Isomorphic and independent statements can be vectorized.

15

Pack

slide-16
SLIDE 16

Revec: Revectorization

a[0:1] = b[0:1] + c[0:1]; a[1:2] = b[1:2] + c[1:2]; a[0:2] = b[0:2] + c[0:2]; Revectorize

Isomorphic and independent vectorized statements can be revectorized.

16

However, need to adapt to handle

Pack

Vector Shuffles Opaque Intrinsics

slide-17
SLIDE 17

%1 = load <8 x i16>, <8 x i16>* %in %2 = shufflevector <8 x i16> %1, const_vec_1, mask_1 %3 = shufflevector <8 x i16> %1, const_vec_2, mask_2 %4 = bitcast <8 x i16> %2 to <4 x i32> %5 = add <4 x i32> %4, <127, 127, 127, 127> %6 = bitcast <8 x i16> %3 to <4 x i32> %7 = add <4 x i32> %6, <127, 127, 127, 127> %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7) store <8 x i16> %8, <8 x i16>* %out

17

Vector IR Vector Shuffles Opaque Intrinsics

Hand-vectorized example

Revec revectorizes LLVM IR

__m128i zeros = _mm_set_epi16(0,0,0,0,0,0,0,0); __m128i cons = _mm_set_epi32(127,127,127,127); for(int i = 0; i < H * W; i+= 8){ __m128i inval = _mm_loadu_si128(in[i]); __m128i lo = _mm_unpacklo_epi16(inval,zeros); __m128i hi = _mm_unpackhi_epi16(inval,zeros); __m128i lo_plus = _mm_add_epi32(lo,cons); __m128i hi_plus = _mm_add_epi32(hi,cons); __m128i final = _mm_packus_epi32(lo_plus, hi_plus); _mm_storeu_si128(out[i],final); }

LLVM IR Hand-vectorized code (128-bit)

slide-18
SLIDE 18

__m128i zeros = _mm_set_epi16(0,0,0,0,0,0,0,0); __m128i cons = _mm_set_epi32(127,127,127,127); for(int i = 0; i < H * W; i+= 8){ __m128i inval = _mm_loadu_si128(in[i]); __m128i lo = _mm_unpacklo_epi16(inval,zeros); __m128i hi = _mm_unpackhi_epi16(inval,zeros); __m128i lo_plus = _mm_add_epi32(lo,cons); __m128i hi_plus = _mm_add_epi32(hi,cons); __m128i final = _mm_packus_epi32(lo_plus, hi_plus); _mm_storeu_si128(out[i],final); }

Hand-vectorized example

18

%1 = load <8 x i16>, <8 x i16>* %in %2 = shufflevector <8 x i16> %1, const_vec_1, mask_1 %3 = shufflevector <8 x i16> %1, const_vec_2, mask_2 %4 = bitcast <8 x i16> %2 to <4 x i32> %5 = add <4 x i32> %4, <127, 127, 127, 127> %6 = bitcast <8 x i16> %3 to <4 x i32> %7 = add <4 x i32> %6, <127, 127, 127, 127> %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7) store <8 x i16> %8, <8 x i16>* %out

Input

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0

127 127 127 127 127 127 127 127

+ +

r7 r6 r5 r4 r3 r2 r1 r0

  • 7 o6 o5 o4 o3 o2 o1 o0

Output

LLVM IR Hand-vectorized code (128-bit)

slide-19
SLIDE 19

Loop Unrolling

19

%1 = load <8 x i16>, <8 x i16>* %in %2 = shufflevector <8 x i16> %1, const_vec_1, mask_1 %3 = shufflevector <8 x i16> %1, const_vec_2, mask_2 %4 = bitcast <8 x i16> %2 to <4 x i32> %5 = add <4 x i32> %4, <127, 127, 127, 127> %6 = bitcast <8 x i16> %3 to <4 x i32> %7 = add <4 x i32> %6, <127, 127, 127, 127> %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7) store <8 x i16> %8, <8 x i16>* %out %9 = load <8 x i16>, <8 x i16>* %in_1 %10 = shufflevector <8 x i16> %9, const_vec_3, mask_3 %11 = shufflevector <8 x i16> %9, const_vec_4, mask_4 %12 = bitcast <8 x i16> %10 to <4 x i32> %13 = add <4 x i32> %12, <127, 127, 127, 127> %14 = bitcast <8 x i16> %11 to <4 x i32> %15 = add <4 x i32> %14, <127, 127, 127, 127> %16 = call <8 x i16> @llvm.x86.sse41.packusdw(%13,%15) store <8 x i16> %16, <8 x i16>* %out_1

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0 127 127 127 127 127 127 127 127

+

r7 r6 r5 r4 r3 r2 r1 r0

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

+

Input Input + 128 Output Output + 128

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0 127 127 127 127 127 127 127 127

+

r7 r6 r5 r4 r3 r2 r1 r0

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

+

%1 = load <8 x i16>, <8 x i16>* %in %2 = shufflevector <8 x i16> %1, const_vec_1, mask_1 %3 = shufflevector <8 x i16> %1, const_vec_2, mask_2 %4 = bitcast <8 x i16> %2 to <4 x i32> %5 = add <4 x i32> %4, <127, 127, 127, 127> %6 = bitcast <8 x i16> %3 to <4 x i32> %7 = add <4 x i32> %6, <127, 127, 127, 127> %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7) store <8 x i16> %8, <8 x i16>* %out

Targeting a 256-vector machine Unroll once

slide-20
SLIDE 20

Revec Transformation

20

  • 1. Start at root packs

(adjacent stores, reduction roots)

  • 2. Revectorize the pack
  • 3. Recursively traverse the use-def chain
  • 4. Stop when packs are not revectorizable

a[0:1] = b[0:1] + c[0:1]; a[1:2] = b[1:2] + c[1:2]; Pack

Loop-aware SLP in GCC [Rosen, et al. 2007]

slide-21
SLIDE 21

%1 = load <8 x i16>, <8 x i16>* %in %2 = shufflevector <8 x i16> %1, const_vec_1, mask_1 %3 = shufflevector <8 x i16> %1, const_vec_2, mask_2 %4 = bitcast <8 x i16> %2 to <4 x i32> %5 = add <4 x i32> %4, <127, 127, 127, 127> %6 = bitcast <8 x i16> %3 to <4 x i32> %7 = add <4 x i32> %6, <127, 127, 127, 127> %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7) store <8 x i16> %8, <8 x i16>* %out %9 = load <8 x i16>, <8 x i16>* %in_1 %10 = shufflevector <8 x i16> %9, const_vec_3, mask_3 %11 = shufflevector <8 x i16> %9, const_vec_4, mask_4 %12 = bitcast <8 x i16> %10 to <4 x i32> %13 = add <4 x i32> %12, <127, 127, 127, 127> %14 = bitcast <8 x i16> %11 to <4 x i32> %15 = add <4 x i32> %14, <127, 127, 127, 127> %16 = call <8 x i16> @llvm.x86.sse41.packusdw(%13,%15) store <8 x i16> %16, <8 x i16>* %out_1

Revec Transformation

21

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0 127 127 127 127 127 127 127 127

+

r7 r6 r5 r4 r3 r2 r1 r0

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

+

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0 127 127 127 127 127 127 127 127

+

r7 r6 r5 r4 r3 r2 r1 r0

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

+

Input Input + 128 Output Output + 128

slide-22
SLIDE 22

%1 = load <8 x i16>, <8 x i16>* %in %2 = shufflevector <8 x i16> %1, const_vec_1, mask_1 %3 = shufflevector <8 x i16> %1, const_vec_2, mask_2 %4 = bitcast <8 x i16> %2 to <4 x i32> %5 = add <4 x i32> %4, <127, 127, 127, 127> %6 = bitcast <8 x i16> %3 to <4 x i32> %7 = add <4 x i32> %6, <127, 127, 127, 127> %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7) store <8 x i16> %8, <8 x i16>* %out %9 = load <8 x i16>, <8 x i16>* %in_1 %10 = shufflevector <8 x i16> %9, const_vec_3, mask_3 %11 = shufflevector <8 x i16> %9, const_vec_4, mask_4 %12 = bitcast <8 x i16> %10 to <4 x i32> %13 = add <4 x i32> %12, <127, 127, 127, 127> %14 = bitcast <8 x i16> %11 to <4 x i32> %15 = add <4 x i32> %14, <127, 127, 127, 127> %16 = call <8 x i16> @llvm.x86.sse41.packusdw(%13,%15) store <8 x i16> %16, <8 x i16>* %out_1

Revec Transformation

22

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0 127 127 127 127 127 127 127 127

+

r7 r6 r5 r4 r3 r2 r1 r0

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

+

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0 127 127 127 127 127 127 127 127

+

r7 r6 r5 r4 r3 r2 r1 r0

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

+

Opaque Intrinsics

Revectorized code

store <16 x i16> ??, <16 x i16>* %out

Input Input + 128 Output Output + 128

slide-23
SLIDE 23

23

Does there exist a matching 256-bit intrinsic?

%16 = call <8 x i16> @llvm.x86.sse41.packusdw(%13,%15) %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7)

Opaque Intrinsics

Finding Intrinsic Equivalences

slide-24
SLIDE 24

Finding Intrinsic Equivalences

24 %16 = call <8 x i16> @llvm.x86.sse41.packusdw(%13,%15) %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7)

Opaque Intrinsics

Populate a dictionary of equivalences through enumerative search (offline)

  • K:1 equivalences
  • Random and corner case testing

Compile-time - dictionary look-up

Does there exist a matching 256-bit intrinsic?

slide-25
SLIDE 25

Finding Intrinsic Equivalences

25 %16 = call <8 x i16> @llvm.x86.sse41.packusdw(%13,%15) %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7)

Opaque Intrinsics Does there exist a matching 256-bit intrinsic?

call <16 x i16> @llvm.x86.avx2.packusdw(??,??)

slide-26
SLIDE 26

%1 = load <8 x i16>, <8 x i16>* %in %2 = shufflevector <8 x i16> %1, const_vec_1, mask_1 %3 = shufflevector <8 x i16> %1, const_vec_2, mask_2 %4 = bitcast <8 x i16> %2 to <4 x i32> %5 = add <4 x i32> %4, <127, 127, 127, 127> %6 = bitcast <8 x i16> %3 to <4 x i32> %7 = add <4 x i32> %6, <127, 127, 127, 127> %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7) store <8 x i16> %8, <8 x i16>* %out %9 = load <8 x i16>, <8 x i16>* %in_1 %10 = shufflevector <8 x i16> %9, const_vec_3, mask_3 %11 = shufflevector <8 x i16> %9, const_vec_4, mask_4 %12 = bitcast <8 x i16> %10 to <4 x i32> %13 = add <4 x i32> %12, <127, 127, 127, 127> %14 = bitcast <8 x i16> %11 to <4 x i32> %15 = add <4 x i32> %14, <127, 127, 127, 127> %16 = call <8 x i16> @llvm.x86.sse41.packusdw(%13,%15) store <8 x i16> %16, <8 x i16>* %out_1

Revec Transformation

26

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0 127 127 127 127 127 127 127 127

+

r7 r6 r5 r4 r3 r2 r1 r0

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

+

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0 127 127 127 127 127 127 127 127

+

r7 r6 r5 r4 r3 r2 r1 r0

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

+ Revectorized code

store <16 x i16> ??, <16 x i16>* %out

Input Input + 128 Output Output + 128

slide-27
SLIDE 27

%1 = load <8 x i16>, <8 x i16>* %in %2 = shufflevector <8 x i16> %1, const_vec_1, mask_1 %3 = shufflevector <8 x i16> %1, const_vec_2, mask_2 %4 = bitcast <8 x i16> %2 to <4 x i32> %5 = add <4 x i32> %4, <127, 127, 127, 127> %6 = bitcast <8 x i16> %3 to <4 x i32> %7 = add <4 x i32> %6, <127, 127, 127, 127> %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7) store <8 x i16> %8, <8 x i16>* %out %9 = load <8 x i16>, <8 x i16>* %in_1 %10 = shufflevector <8 x i16> %9, const_vec_3, mask_3 %11 = shufflevector <8 x i16> %9, const_vec_4, mask_4 %12 = bitcast <8 x i16> %10 to <4 x i32> %13 = add <4 x i32> %12, <127, 127, 127, 127> %14 = bitcast <8 x i16> %11 to <4 x i32> %15 = add <4 x i32> %14, <127, 127, 127, 127> %16 = call <8 x i16> @llvm.x86.sse41.packusdw(%13,%15) store <8 x i16> %16, <8 x i16>* %out_1

Revec Transformation

27

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0 127 127 127 127 127 127 127 127

+

r7 r6 r5 r4 r3 r2 r1 r0

+

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0 127 127 127 127 127 127 127 127

+

r7 r6 r5 r4 r3 r2 r1 r0

+

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

%8a = call <16 x i16> @llvm.x86.avx2.packusdw(??,??) store <16 x i16> %8a, <16 x i16>* %out

Input Input + 128 Output Output + 128

Revectorized code

slide-28
SLIDE 28

%1 = load <8 x i16>, <8 x i16>* %in %2 = shufflevector <8 x i16> %1, const_vec_1, mask_1 %3 = shufflevector <8 x i16> %1, const_vec_2, mask_2 %4 = bitcast <8 x i16> %2 to <4 x i32> %5 = add <4 x i32> %4, <127, 127, 127, 127> %6 = bitcast <8 x i16> %3 to <4 x i32> %7 = add <4 x i32> %6, <127, 127, 127, 127> %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7) store <8 x i16> %8, <8 x i16>* %out %9 = load <8 x i16>, <8 x i16>* %in_1 %10 = shufflevector <8 x i16> %9, const_vec_3, mask_3 %11 = shufflevector <8 x i16> %9, const_vec_4, mask_4 %12 = bitcast <8 x i16> %10 to <4 x i32> %13 = add <4 x i32> %12, <127, 127, 127, 127> %14 = bitcast <8 x i16> %11 to <4 x i32> %15 = add <4 x i32> %14, <127, 127, 127, 127> %16 = call <8 x i16> @llvm.x86.sse41.packusdw(%13,%15) store <8 x i16> %16, <8 x i16>* %out_1 %8a = call <16 x i16> @llvm.x86.avx2.packusdw(??,??) store <16 x i16> %8a, <16 x i16>* %out

Revec Transformation

28

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 127 127 127 127 r7 r6 r5 r4 i3 i2 i1 i0 127 127 127 127

+

r3 r2 r1 r0

+

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0 127 127 127 127 127 127 127 127

+

r7 r6 r5 r4 r3 r2 r1 r0

+

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

Input Input + 128 Output Output + 128

Revectorized code

slide-29
SLIDE 29

%1 = load <8 x i16>, <8 x i16>* %in %2 = shufflevector <8 x i16> %1, const_vec_1, mask_1 %3 = shufflevector <8 x i16> %1, const_vec_2, mask_2 %4 = bitcast <8 x i16> %2 to <4 x i32> %5 = add <4 x i32> %4, <127, 127, 127, 127> %6 = bitcast <8 x i16> %3 to <4 x i32> %7 = add <4 x i32> %6, <127, 127, 127, 127> %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7) store <8 x i16> %8, <8 x i16>* %out %9 = load <8 x i16>, <8 x i16>* %in_1 %10 = shufflevector <8 x i16> %9, const_vec_3, mask_3 %11 = shufflevector <8 x i16> %9, const_vec_4, mask_4 %12 = bitcast <8 x i16> %10 to <4 x i32> %13 = add <4 x i32> %12, <127, 127, 127, 127> %14 = bitcast <8 x i16> %11 to <4 x i32> %15 = add <4 x i32> %14, <127, 127, 127, 127> %16 = call <8 x i16> @llvm.x86.sse41.packusdw(%13,%15) store <8 x i16> %16, <8 x i16>* %out_1 %8a = call <16 x i16> @llvm.x86.avx2.packusdw(??,??) store <16 x i16> %8a, <16 x i16>* %out

Revec Transformation

29

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 127 127 127 127 r7 r6 r5 r4 i3 i2 i1 i0

+

127 127 127 127 r3 r2 r1 r0

+

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0 127 127 127 127 127 127 127 127

+

r7 r6 r5 r4 r3 r2 r1 r0

+

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

Input Input + 128 Output Output + 128

Revectorized code

slide-30
SLIDE 30

%1 = load <8 x i16>, <8 x i16>* %in %2 = shufflevector <8 x i16> %1, const_vec_1, mask_1 %3 = shufflevector <8 x i16> %1, const_vec_2, mask_2 %4 = bitcast <8 x i16> %2 to <4 x i32> %5 = add <4 x i32> %4, <127, 127, 127, 127> %6 = bitcast <8 x i16> %3 to <4 x i32> %7 = add <4 x i32> %6, <127, 127, 127, 127> %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7) store <8 x i16> %8, <8 x i16>* %out %9 = load <8 x i16>, <8 x i16>* %in_1 %10 = shufflevector <8 x i16> %9, const_vec_3, mask_3 %11 = shufflevector <8 x i16> %9, const_vec_4, mask_4 %12 = bitcast <8 x i16> %10 to <4 x i32> %13 = add <4 x i32> %12, <127, 127, 127, 127> %14 = bitcast <8 x i16> %11 to <4 x i32> %15 = add <4 x i32> %14, <127, 127, 127, 127> %16 = call <8 x i16> @llvm.x86.sse41.packusdw(%13,%15) store <8 x i16> %16, <8 x i16>* %out_1 %4a = bitcast <16 x i16> ?? to <8 x i32> %5a = add <8 x i32> %4a, <127,127,127,127,127,127,127,127> %6a = bitcast <16 x i16> ?? to <8 x i32> %7a = add <8 x i32> %6a, <127,127,127,127,127,127,127,127> %8a = call <16 x i16> @llvm.x86.avx2.packusdw(%5a,%7a) store <16 x i16> %8a, <16 x i16>* %out

Revec Transformation

30

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 127 127 127 127 r7 r6 r5 r4 127 127 127 127

+

r3 r2 r1 r0

+

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0 127 127 127 127 127 127 127 127 r7 r6 r5 r4 r3 r2 r1 r0

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

i3 i2 i1 i0

Vector Shuffles

Input Input + 128 Output Output + 128

Revectorized code

slide-31
SLIDE 31

%1 = load <8 x i16>, <8 x i16>* %in %2 = shufflevector <8 x i16> %1, const_vec_1, mask_1 %3 = shufflevector <8 x i16> %1, const_vec_2, mask_2 %4 = bitcast <8 x i16> %2 to <4 x i32> %5 = add <4 x i32> %4, <127, 127, 127, 127> %6 = bitcast <8 x i16> %3 to <4 x i32> %7 = add <4 x i32> %6, <127, 127, 127, 127> %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7) store <8 x i16> %8, <8 x i16>* %out %9 = load <8 x i16>, <8 x i16>* %in_1 %10 = shufflevector <8 x i16> %9, const_vec_3, mask_3 %11 = shufflevector <8 x i16> %9, const_vec_4, mask_4 %12 = bitcast <8 x i16> %10 to <4 x i32> %13 = add <4 x i32> %12, <127, 127, 127, 127> %14 = bitcast <8 x i16> %11 to <4 x i32> %15 = add <4 x i32> %14, <127, 127, 127, 127> %16 = call <8 x i16> @llvm.x86.sse41.packusdw(%13,%15) store <8 x i16> %16, <8 x i16>* %out_1 %4a = bitcast <16 x i16> ?? to <8 x i32> %5a = add <8 x i32> %4a, <127,127,127,127,127,127,127,127> %6a = bitcast <16 x i16> ?? to <8 x i32> %7a = add <8 x i32> %6a, <127,127,127,127,127,127,127,127> %8a = call <16 x i16> @llvm.x86.avx2.packusdw(%5a,%7a) store <16 x i16> %8a, <16 x i16>* %out

Revec Transformation

31

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 127 127 127 127 r7 r6 r5 r4 127 127 127 127

+

r3 r2 r1 r0

+

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0 127 127 127 127 127 127 127 127 r7 r6 r5 r4 r3 r2 r1 r0

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

i3 i2 i1 i0

Vector Shuffles

Input Input + 128 Output Output + 128

Revectorized code

slide-32
SLIDE 32

Shuffle Merge Rules

32

shufflevector operand_1, operand_2, mask

Depending on the mask and operand type rules for merging vector shuffles change

shufflevector operand_1, operand_2, mask_1 shufflevector operand_3, operand_4, mask_2

mask selects values from operand_1 or operand_2 to be in the final output Four different vector shuffle patterns to handle merging

slide-33
SLIDE 33

Shuffle Patterns

33

%s1 = shufflevector 1 , , a %s2 = shufflevector 2 3 , , a %S = a Pattern A: Sequential subvector extracts

Pattern B: Permutations of identical operands = shufflevector , , a b m1 = shufflevector , , a b m2 shufflevector , , a b concat m1 m Pattern C: Mergeable constant vector operand 1 = shufflevector 1 , , a b1 m1 2 = shufflevector 1 1 , , a b2 m2 %S = shufflevector 1 1 , , a b concat m1 m2

Pattern D: Lane widening: vertically pack isomorphic operands %s1 = shufflevector , , a1 b1 m1 %s2 = shufflevector , , a2 b2 m2 %S = shufflevector , , pack a 1 a2 reindex m1 m2 pack b1 b2

slide-34
SLIDE 34

Shuffle Merge Rules

shufflevector <8 x i16> %1, <0, 0, 0, 0, 0, 0, 0, 0>, <0, 8, 1, 9, 2, 10, 3, 11> shufflevector <8 x i16> %9, <0, 0, 0, 0, 0, 0, 0, 0>, <0, 8, 1, 9, 2, 10, 3, 11>

  • perand_1: different
  • perand_2:

128-bit null_vector mask: same

Generic Lane Widening

shufflevector <16 x i16> {%1,%9}, <0, 0, 0, 0,…., 0, 0>, <0,16,1,17,2,18,3,19,8,24,9,25,10,26,11,27> Revectorize %1,%9 256-bit null_vector Computed mask

34

slide-35
SLIDE 35

%1 = load <8 x i16>, <8 x i16>* %in %2 = shufflevector <8 x i16> %1, const_vec_1, mask_1 %3 = shufflevector <8 x i16> %1, const_vec_2, mask_2 %4 = bitcast <8 x i16> %2 to <4 x i32> %5 = add <4 x i32> %4, <127, 127, 127, 127> %6 = bitcast <8 x i16> %3 to <4 x i32> %7 = add <4 x i32> %6, <127, 127, 127, 127> %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7) store <8 x i16> %8, <8 x i16>* %out %9 = load <8 x i16>, <8 x i16>* %in_1 %10 = shufflevector <8 x i16> %9, const_vec_3, mask_3 %11 = shufflevector <8 x i16> %9, const_vec_4, mask_4 %12 = bitcast <8 x i16> %10 to <4 x i32> %13 = add <4 x i32> %12, <127, 127, 127, 127> %14 = bitcast <8 x i16> %11 to <4 x i32> %15 = add <4 x i32> %14, <127, 127, 127, 127> %16 = call <8 x i16> @llvm.x86.sse41.packusdw(%13,%15) store <8 x i16> %16, <8 x i16>* %out_1 %4a = bitcast <16 x i16> ?? to <8 x i32> %5a = add <8 x i32> %4a, <127,127,127,127,127,127,127,127> %6a = bitcast <16 x i16> ?? to <8 x i32> %7a = add <8 x i32> %6a, <127,127,127,127,127,127,127,127> %8a = call <16 x i16> @llvm.x86.avx2.packusdw(%5a,%7a) store <16 x i16> %8a, <16 x i16>* %out

Revec Transformation

35

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 127 127 127 127 r7 r6 r5 r4 127 127 127 127

+

r3 r2 r1 r0

+

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0 127 127 127 127 127 127 127 127 r7 r6 r5 r4 r3 r2 r1 r0

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

i3 i2 i1 i0

Input Input + 128 Output Output + 128

Revectorized code

slide-36
SLIDE 36

%1 = load <8 x i16>, <8 x i16>* %in %2 = shufflevector <8 x i16> %1, const_vec_1, mask_1 %3 = shufflevector <8 x i16> %1, const_vec_2, mask_2 %4 = bitcast <8 x i16> %2 to <4 x i32> %5 = add <4 x i32> %4, <127, 127, 127, 127> %6 = bitcast <8 x i16> %3 to <4 x i32> %7 = add <4 x i32> %6, <127, 127, 127, 127> %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7) store <8 x i16> %8, <8 x i16>* %out %9 = load <8 x i16>, <8 x i16>* %in_1 %10 = shufflevector <8 x i16> %9, const_vec_3, mask_3 %11 = shufflevector <8 x i16> %9, const_vec_4, mask_4 %12 = bitcast <8 x i16> %10 to <4 x i32> %13 = add <4 x i32> %12, <127, 127, 127, 127> %14 = bitcast <8 x i16> %11 to <4 x i32> %15 = add <4 x i32> %14, <127, 127, 127, 127> %16 = call <8 x i16> @llvm.x86.sse41.packusdw(%13,%15) store <8 x i16> %16, <8 x i16>* %out_1 %2a = shufflevector <16 x i16> ??, const_vec_n1, mask_n1 %3a = shufflevector <16 x i16> ??, const_vec_n2, mask_n2 %4a = bitcast <16 x i16> %2a to <8 x i32> %5a = add <8 x i32> %4a, <127,127,127,127,127,127,127,127> %6a = bitcast <16 x i16> %3a to <8 x i32> %7a = add <8 x i32> %6a, <127,127,127,127,127,127,127,127> %8a = call <16 x i16> @llvm.x86.avx2.packusdw(%5a,%7a) store <16 x i16> %8a, <16 x i16>* %out

Revec Transformation

36

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 127 127 127 127 r7 r6 r5 r4 127 127 127 127

+

r3 r2 r1 r0

+

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0 127 127 127 127 127 127 127 127 r7 r6 r5 r4 r3 r2 r1 r0

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

i3 i2 i1 i0

Input Input + 128 Output Output + 128

Revectorized code

slide-37
SLIDE 37

%1 = load <8 x i16>, <8 x i16>* %in %2 = shufflevector <8 x i16> %1, const_vec_1, mask_1 %3 = shufflevector <8 x i16> %1, const_vec_2, mask_2 %4 = bitcast <8 x i16> %2 to <4 x i32> %5 = add <4 x i32> %4, <127, 127, 127, 127> %6 = bitcast <8 x i16> %3 to <4 x i32> %7 = add <4 x i32> %6, <127, 127, 127, 127> %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7) store <8 x i16> %8, <8 x i16>* %out %9 = load <8 x i16>, <8 x i16>* %in_1 %10 = shufflevector <8 x i16> %9, const_vec_3, mask_3 %11 = shufflevector <8 x i16> %9, const_vec_4, mask_4 %12 = bitcast <8 x i16> %10 to <4 x i32> %13 = add <4 x i32> %12, <127, 127, 127, 127> %14 = bitcast <8 x i16> %11 to <4 x i32> %15 = add <4 x i32> %14, <127, 127, 127, 127> %16 = call <8 x i16> @llvm.x86.sse41.packusdw(%13,%15) store <8 x i16> %16, <8 x i16>* %out_1 %1a = load <16 x i16>, <16 x i16>* %in %2a = shufflevector <16 x i16> %1a, const_vec_n1, mask_n1 %3a = shufflevector <16 x i16> %1a, const_vec_n2, mask_n2 %4a = bitcast <16 x i16> %2a to <8 x i32> %5a = add <8 x i32> %4a, <127,127,127,127,127,127,127,127> %6a = bitcast <16 x i16> %3a to <8 x i32> %7a = add <8 x i32> %6a, <127,127,127,127,127,127,127,127> %8a = call <16 x i16> @llvm.x86.avx2.packusdw(%5a,%7a) store <16 x i16> %8a, <16 x i16>* %out

Revec Transformation

37

i7 i6 i5 i4 i3 i2 i1 i0 i7 i6 i5 i4 127 127 127 127 r7 r6 r5 r4 127 127 127 127

+

r3 r2 r1 r0

+

i7 i6 i5 i4 i3 i2 i1 i0 127 127 127 127 127 127 127 127 r7 r6 r5 r4 r3 r2 r1 r0

  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

i3 i2 i1 i0 i7 i6 i5 i4 i3 i2 i1 i0

Revectorization complete

Input Input + 128 Output Output + 128

Revectorized code

slide-38
SLIDE 38

38

Evaluation

SIMD-scan (databases) Image Processing Video Compression High Performance Kernels

slide-39
SLIDE 39

SIMD-Scan

39

  • Variable width integer packing and unpacking using SIMD
  • FastPFor’s SSE implementation [Lemire, et al.]

Revectorization Speedups over SSE

Speedup 0.4 0.8 1.2 1.6 1-4 bits 5 - 8 bits 9 - 12 bits 13 - 16 bits

1.42 1.55 1.49 1.57 1.15 1.24 1.15 1.27

AVX2 AVX-512

slide-40
SLIDE 40

x265

40

  • Implements H.265 Video Codec
  • Best Software based video encoding for H.265 (http://x265.org)

Revectorization speedups over SSE

Speedup 0.475 0.95 1.425 1.9 IDCT DCT dequant 8x8 dequant 16x16 dequant 32x32

1.74 1.8 1.88 0.94 1 1.39 1.45 1.69 1 1

AVX2 AVX-512

slide-41
SLIDE 41

Simd image processing

41

  • 216 intrinsic-rich image processing and stencil operations
  • Compare against hand-vectorized AVX2, AVX512 implementations

Speedups 0.75 1.5 2.25 3 yuvtohue alphablending

Revectorized (AVX2) Hand-vectorized (AVX2) Revectorized (AVX-512) Hand-vectorized (AVX-512)

Speedups 0.9 0.975 1.05 1.125 1.2 absgradsatsum bgratogray

Speedups over SSE kernels

  • Extensively hand-tuned
slide-42
SLIDE 42

Conclusion

42

  • Revectorization allows retargeting hand-vectorized code
  • The compiler transformation transparently rejuvenates performance
  • Helps preserve performance portability of hand-vectorized

implementations

42

slide-43
SLIDE 43

43

This Work Supported By:

43

Try Revec and VectorBench (benchmark suite) >200 hand-vectorized kernels extracted from popular repos https://www.nextgenvec.org