Revec: Program Rejuvenation through Revectorization Charith Mendis - PowerPoint PPT Presentation

Revec: Program Rejuvenation through Revectorization Charith Mendis * Ajay Jain * Paras Jain Saman Amarasinghe * equal contribution 1

Parallelism in Processors INT Mem fetch ME WB Core1 Core2 INT L1 L2 L3 L4 IF ID ME WB FP ME WB Vector REG1 Core3 Core4 Vect Vector REG2 Data Level Instruction Level Thread Level Superscalar Multi-cores SIMD units Pipelined Hyperthreads 2

Exploiting SIMD parallelism Use compiler auto-vectorization. for ( int i = 0; i < N; i++) { Portable av[i] = sqrt(bv[i]); Complete? } sqrtpd 80(%rdx,%rax), %xmm0 sqrtpd 96(%rdx,%rax), %xmm1 vmovdqu %xmm0, 40(%rdi,%rax) Portable? vmovdqu %xmm1, 56(%rdi,%rax) Complete Hand-vectorization using compiler intrinsics SSE2 (128 bit) for ( int i = 0; i < N/4; i+=4) { av[i] = _mm_sqrt_pd(bv[i]); av[i+2] = _mm_sqrt_pd(bv[i+2]); } 3

Exploiting SIMD parallelism Use compiler auto-vectorization. for ( int i = 0; i < N; i++) { Portable av[i] = sqrt(bv[i]); Complete? } vsqrtpd 80(%rdx,%rax), %zmm0 vsqrtpd 144(%rdx,%rax),%zmm1 vmovupd %zmm0, 40(%rdi,%rax) Portable? vmovupd %zmm1, 104(%rdi,%rax) Complete Hand-vectorization using compiler intrinsics AVX-512 (512 bit) for ( int i = 0; i < N/4; i+=4) { av[i] = _mm_sqrt_pd(bv[i]); av[i+2] = _mm_sqrt_pd(bv[i+2]); } fixed to 128 bits 4

Intel Vector-ISA Generations 32 -bit scalar 64 -bit vector 128 -bit vector 256 -bit vector 512 -bit vector only (MMX) (SSE2) (AVX2) (AVX512) 1997 2000 2011 2016 Increase in bit-width Diversity in Instruction Set 5

Exploiting SIMD parallelism Use compiler auto-vectorization. for ( int i = 0; i < N; i++) { Portable av[i] = sqrt(bv[i]); Complete? } vsqrtpd 80(%rdx,%rax), %zmm0 vsqrtpd 144(%rdx,%rax),%zmm1 vmovupd %zmm0, 40(%rdi,%rax) Portable? vmovupd %zmm1, 104(%rdi,%rax) Complete Hand-vectorization using compiler intrinsics AVX-512 (512 bit) for ( int i = 0; i < N/4; i+=4) { av[i] = _mm_sqrt_pd(bv[i]); av[i+2] = _mm_sqrt_pd(bv[i+2]); Portable } Complete Revec 6

Naive implementation auto-vectorized by compiler Unoptimized MeanFilter3x3 1/9 1/9 1/9 for (i = 1; i < H - 1; ++i) 1/9 1/9 1/9 for (j = 1; j < W - 1; ++j) dst[i][j] = 1/9 1/9 1/9 1/9 * ( in[i-1][j-1] + in[i-1][j] + in[i-1][j+1] + in[i][j-1] + in[i][j] + in[i][j+1] + in[i+1][j-1] + in[i+1][j] + in[i+1][j+1]) Input Output 7

Naive implementation auto-vectorized by compiler Unoptimized MeanFilter3x3 for (i = 1; i < H - 1; ++i) for (j = 1; j < W - 1; ++j) dst[i][j] = 1/9 * ( in[i-1][j-1] + in[i-1][j] + in[i-1][j+1] + in[i][j-1] + in[i][j] + in[i][j+1] + in[i+1][j-1] + in[i+1][j] + in[i+1][j+1]) for (i = 1; i < H - 1; ++i) for (j = 1; j < W - 1; j+=8) Input Output dst[i][j:j+7] = 1/9 * ( in[i-1][j-1:j+6] + in[i-1][j:j+7] + in[i-1][j+1:j+8] + in[i] [j-1:j+6] + in[i] [j:j+7] + in[i] [j+1:j+8] + in[i+1][j-1:j+6] + in[i+1][j:j+7] + in[i+1][j+1:j+8]) Auto-vectorized SSE2 - 128-bit 8

Naive implementation auto-vectorized by compiler Unoptimized MeanFilter3x3 for (i = 1; i < H - 1; ++i) for (j = 1; j < W - 1; ++j) dst[i][j] = 1/9 * ( in[i-1][j-1] + in[i-1][j] + in[i-1][j+1] + in[i][j-1] + in[i][j] + in[i][j+1] + in[i+1][j-1] + in[i+1][j] + in[i+1][j+1]) for (i = 1; i < H - 1; ++i) for (j = 1; j < W - 1; j+=8) Input Output dst[i][j:j+31] = 1/9 * ( in[i-1][j-1:j+30] + in[i-1][j:j+31] + in[i-1][j+1:j+32] + in[i] [j-1:j+30] + in[i] [j:j+31] + in[i] [j+1:j+32] + in[i+1][j-1:j+30] + in[i+1][j:j+31] + in[i+1][j+1:j+32]) Auto-vectorized AVX-512 - 512-bit 9

Optimized Implementation hand-vectorized by programmer Optimized MeanFilter3x3 #define A 8 #define F (1 << 16)/9 __m128i div9 = _mm_set_epi16(F,F,F,F,F,F,F,F); • Rotating buffers uint16_t colsum[3 * W]; __m128i * buf1 = &colsum[0 * W]; __m128i * buf2 = &colsum[1 * W]; __m128i * buf3 = &colsum[2 * W]; • Hand-vectorized using compiler intrinsics //code to compute column sums for first two rows in buf1, buf2 targeting a 128-bit SSE2 machine for (i = 2; i < H; ++i){ for (j = 1; j < W - 1; j += A){ a0 = _mm_loadu_si128(in[i][j-1]); a1 = _mm_loadu_si128(in[i][j]); _mm_add_epi16 _mm_loadu_si128 a2 = _mm_loadu_si128(in[i][j+1]); buf3[j/A] = _mm_add_epi16(a0, _mm_add_epi16(a1,a2))); / dst[i - 1][j] = _mm_mulhi_epu16(div9, _mm_add_epi16(buf1[j/A] _mm_add_epi16(buf2[j/A],buf3[j/A])));} //swap buffer colsums for next iteration __m128i * temp = buf1; buf1 = buf2; buf2 = buf3; buf3 = temp; } 10

Hand-vectorization inhibits performance portability Scalar auto-vectorized Hand-vectorized SSE2 8 • Auto-vectorizers do not handle 7 already vectorized code 6 5 Speedups • Hand-vectorized code does not 4 3 utilize features of newer vector ISA 2 1 0 SSE2 AVX512 11

• Retarget hand-vectorized codes to use new vector instruction sets when available Revectorization • We built a compiler pass Revec to perform revectorization 12

Scalar auto-vectorized Hand-vectorized SSE2 Revectorized SSE2 9 8 1.31x Revec 7 6 Speedups 5 4 Revec reinstates performance 3 portability 2 1 for hand-vectorized code 0 SSE2 AVX512 13

Revec • Revectorizes code transparently — implemented as a regular compiler transformation in LLVM • Based on SLP vectorization • Enables performance portability for hand-vectorized code 14

SLP Vectorization Isomorphic and independent statements can be vectorized. a[0] = b[0] + c[0]; Pack a[1] = b[1] + c[1]; Vectorize a[0:1] = b[0:1] + c[0:1]; 15

Revec: Revectorization Isomorphic and independent vectorized statements can be re vectorized. a[0:1] = b[0:1] + c[0:1]; Pack a[1:2] = b[1:2] + c[1:2]; Re vectorize a[0:2] = b[0:2] + c[0:2]; Vector Shuffles However, need to adapt to handle Opaque Intrinsics 16

Hand-vectorized example __m128i zeros = _mm_set_epi16(0,0,0,0,0,0,0,0); __m128i cons = _mm_set_epi32(127,127,127,127); for(int i = 0; i < H * W; i+= 8){ __m128i inval = _mm_loadu_si128(in[i]); __m128i lo = _mm_unpacklo_epi16(inval,zeros); Revec revectorizes LLVM IR __m128i hi = _mm_unpackhi_epi16(inval,zeros); __m128i lo_plus = _mm_add_epi32(lo,cons); __m128i hi_plus = _mm_add_epi32(hi,cons); __m128i final = _mm_packus_epi32(lo_plus, hi_plus); _mm_storeu_si128(out[i],final); } Vector IR Hand-vectorized code (128-bit) Vector Shuffles %1 = load <8 x i16>, <8 x i16>* %in %2 = shufflevector <8 x i16> %1, const_vec_1, mask_1 Opaque Intrinsics %3 = shufflevector <8 x i16> %1, const_vec_2, mask_2 %4 = bitcast <8 x i16> %2 to <4 x i32> %5 = add <4 x i32> %4, <127, 127, 127, 127> %6 = bitcast <8 x i16> %3 to <4 x i32> %7 = add <4 x i32> %6, <127, 127, 127, 127> %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7) store <8 x i16> %8, <8 x i16>* %out LLVM IR 17

Hand-vectorized example __m128i zeros = _mm_set_epi16(0,0,0,0,0,0,0,0); __m128i cons = _mm_set_epi32(127,127,127,127); Input for(int i = 0; i < H * W; i+= 8){ __m128i inval = _mm_loadu_si128(in[i]); __m128i lo = _mm_unpacklo_epi16(inval,zeros); i 7 i 6 i 5 i 4 i 3 i 2 i 1 i 0 __m128i hi = _mm_unpackhi_epi16(inval,zeros); __m128i lo_plus = _mm_add_epi32(lo,cons); __m128i hi_plus = _mm_add_epi32(hi,cons); __m128i final = _mm_packus_epi32(lo_plus, hi_plus); _mm_storeu_si128(out[i],final); 0 i 7 0 i 6 0 i 5 0 i 4 0 i 3 0 i 2 0 i 1 0 i 0 } + + Hand-vectorized code (128-bit) 127 127 127 127 127 127 127 127 %1 = load <8 x i16>, <8 x i16>* %in %2 = shufflevector <8 x i16> %1, const_vec_1, mask_1 r 7 r 6 r 5 r 4 r 3 r 2 r 1 r 0 %3 = shufflevector <8 x i16> %1, const_vec_2, mask_2 %4 = bitcast <8 x i16> %2 to <4 x i32> %5 = add <4 x i32> %4, <127, 127, 127, 127> %6 = bitcast <8 x i16> %3 to <4 x i32> o 7 o 6 o 5 o 4 o 3 o 2 o 1 o 0 %7 = add <4 x i32> %6, <127, 127, 127, 127> %8 = call <8 x i16> @llvm.x86.sse41.packusdw(%5, %7) store <8 x i16> %8, <8 x i16>* %out Output LLVM IR 18

Revec: Program Rejuvenation through Revectorization Charith Mendis - PowerPoint PPT Presentation

Revec: Program Rejuvenation through Revectorization Charith Mendis * Ajay Jain * Paras Jain Saman Amarasinghe * equal contribution 1 Parallelism in Processors INT Mem fetch ME WB Core1 Core2 INT L1 L2 L3 L4 IF ID ME WB FP ME

Mindfulness Part I: Reducing Stress through Rejuvenation Mapping Brandi Lust Learning Goals We

Source Code Rejuvenation is not Refactoring Peter Pirkelbauer Damian Dechev Bjarne Stroustrup

Riken Immune Rejuvenation Medicine Co., Ltd. Offered by KWNA LLP (Tokyo - USA) in association with

Let the journey to rejuvenation begin Indriya wellness, spa & salon at The Umrao is oasis

Skin rejuvenation Wrinkles correction Facial modeling Princess Filler the polyvalent

Chapter Rejuvenation Workshop Presented by: Judy Himes Phoebe Putney Memorial Hospital Al Voss

Engaging Local Communities in Research PARC Study: Physical Activity and the Rejuvenation of

Virtualization Technology Zhiming Shen Virtualization: rejuvenation 1960s: first track of

October 26, 2016 Programs Math Program ELA Program Leadership Program Boys and

Program Behaviour Program Behaviour semantics .c .c .c source program code inputs Program

TOGETHER AT LAST Ravi Chugh, Brian Hempel, Mitchell Spradlin, Jacob Albers Program 2 Program 2

How To Prevent The Challenging Filler Face David J. Goldberg, MD Skin Laser & Surgery

Interface D.P. Mathuria Executive Director (Tech.) Ground and surface water linkages Surface

Its your at SALT To make a meeting memorable, you first have to make it remarkable. T.I.M.E at

To make a meeting memorable, you first have to make it remarkable. T.I.M.E is the difference

into the wild 2 jungle walk and tree top adventures 3 turtle release 4 authentic local

$ n

Convolution Engine Balancing Efficiency & Flexibility in Specialized Computing Wajahat

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #23:

Lecture 5 - SIMD recap Welcome! , = (, ) ,

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Random matrices and Gaussian multiplicative chaos Nick Simm Mathematics Institute, University

Biomechanics BIOEN 520 | ME 527 Session 7A Computa>onal

Outline Outline Introduction Introduction Using R as a Wrapper in Using R as a Wrapper

Revec: Program Rejuvenation through Revectorization Charith Mendis - PowerPoint PPT Presentation

Revec: Program Rejuvenation through Revectorization Charith Mendis * Ajay Jain * Paras Jain Saman Amarasinghe * equal contribution 1 Parallelism in Processors INT Mem fetch ME WB Core1 Core2 INT L1 L2 L3 L4 IF ID ME WB FP ME

Mindfulness Part I: Reducing Stress through Rejuvenation Mapping Brandi Lust Learning Goals We

Source Code Rejuvenation is not Refactoring Peter Pirkelbauer Damian Dechev Bjarne Stroustrup

Riken Immune Rejuvenation Medicine Co., Ltd. Offered by KWNA LLP (Tokyo - USA) in association with

Let the journey to rejuvenation begin Indriya wellness, spa &amp; salon at The Umrao is oasis

Skin rejuvenation Wrinkles correction Facial modeling Princess Filler the polyvalent

Chapter Rejuvenation Workshop Presented by: Judy Himes Phoebe Putney Memorial Hospital Al Voss

Engaging Local Communities in Research PARC Study: Physical Activity and the Rejuvenation of

Virtualization Technology Zhiming Shen Virtualization: rejuvenation 1960s: first track of

October 26, 2016 Programs Math Program ELA Program Leadership Program Boys and

Program Behaviour Program Behaviour semantics .c .c .c source program code inputs Program

TOGETHER AT LAST Ravi Chugh, Brian Hempel, Mitchell Spradlin, Jacob Albers Program 2 Program 2

How To Prevent The Challenging Filler Face David J. Goldberg, MD Skin Laser &amp; Surgery

Interface D.P. Mathuria Executive Director (Tech.) Ground and surface water linkages Surface

Its your at SALT To make a meeting memorable, you first have to make it remarkable. T.I.M.E at

To make a meeting memorable, you first have to make it remarkable. T.I.M.E is the difference

into the wild 2 jungle walk and tree top adventures 3 turtle release 4 authentic local

$ n

Convolution Engine Balancing Efficiency &amp; Flexibility in Specialized Computing Wajahat

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #23:

Lecture 5 - SIMD recap Welcome! , = (, ) ,

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Random matrices and Gaussian multiplicative chaos Nick Simm Mathematics Institute, University

Biomechanics BIOEN 520 | ME 527 Session 7A Computa&gt;onal

Outline Outline Introduction Introduction Using R as a Wrapper in Using R as a Wrapper

Let the journey to rejuvenation begin Indriya wellness, spa & salon at The Umrao is oasis

How To Prevent The Challenging Filler Face David J. Goldberg, MD Skin Laser & Surgery

Convolution Engine Balancing Efficiency & Flexibility in Specialized Computing Wajahat

Biomechanics BIOEN 520 | ME 527 Session 7A Computa>onal