How fast goes the light ? Euro LLVM 2015 Arnaud de Grandmaison 1

Scope � Speed of light: the fastest implementation of a function on a given cpu (Cortex-A57) � The function under test is a typical image processing kernel: � Color space conversion from RGB to YIQ (see http://en.wikipedia.org/wiki/YIQ) � � � � � � � � � � � � � � � � � � � � � � � � � � That’s the most basic computation out there, so we’d better get it right… 2

RGB2YIQ in C, with 16-bits integer coefficients No aliasing void rgb2yiq ( uint8_t * restrict In , uint8_t * restrict Out , unsigned N ) { for ( unsigned pixel = 0 ; pixel < N ; pixel ++) { uint8_t r = * In ++, g = * In ++, b = * In ++; uint8_t y = (( YR * r ) + ( YG * g ) + ( YB * b ) + HALF_LSB ) >> S ; = (( IR * r ) + ( IG * g ) + ( IB * b ) + HALF_LSB ) >> S ; int8_t i int8_t q = (( QR * r ) + ( QG * g ) + ( QB * b ) + HALF_LSB ) >> S ; * Out ++ = y , * Out ++ = i, * Out ++ = q ; } Rounding } Matrix x vector 3

Expectations � 9 or 10 coefficients loading � 9 Multiply-accumulate � Vectorization 4

A first shot… rgb2yiq_ref : add w15 , w15 , w16 cbz w2 , .LBB0_3 add w16 , w18 , w4 movz w8 , #0x4c8b add w3 , w5 , w3 7 coefficients movz w9 , #0x9646 mul w5 , w17 , w12 movz w10 , #0x1d2f add w16 , w16 , #8 , lsl #12 movn w11 , #0x3b0e mul w17 , w17 , w14 movn w12 , #0x44ef lsr w16 , w16 , #16 Immediate half movz w13 , #0x33e2 add w18 , w3 , w5 LSB movz w14 , #0x4c1d add w15 , w15 , w17 .LBB0_2 : add w17 , w18 , #8 , lsl #12 ldrb w15 , [ x0 ] add w15 , w15 , #8 , lsl #12 ldrb w16 , [ x0 , #1 ] lsr w17 , w17 , #16 mul w18 , w15 , w8 lsr w15 , w15 , #16 mul w3 , w16 , w9 strb w16 , [ x1 ] ldrb w17 , [ x0 , #2 ] strb w17 , [ x1 , #1 ] lsl w5 , w15 , #15 strb w15 , [ x1 , #2 ] sub w5 , w5 , w15 sub w2 , w2 , #1 2 strength reduced mul w15 , w15 , w13 add x0 , x0 , #3 No multiply-accumulate, coefficients mul w4 , w17 , w10 add x1 , x1 , #3 add w18 , w18 , w3 cbnz w2 , .LBB0_2 no vectorization ! mul w3 , w16 , w11 .LBB0_3 : sub w16 , w16 , w16 , lsl #15 ret 5

Performances (reference) Time Code size Data size (bytes) First shot (reference) 1.0 1.0 0 6

RGB2YIQ v2 : fight the compiler ! int Coeffs [ 3 ][ 3 ] = {{ YR , YG , YB }, { IR , IG , IB }, { QR , QG , QB }}; Place coefficients in memory int Half_LSB = HALF_LSB ; void rgb2yiq ( uint8_t * restrict In , uint8_t * restrict Out , unsigned N ) { int yr = Coeffs [ 0 ][ 0 ], yg = Coeffs [ 0 ][ 1 ], yb = Coeffs [ 0 ][ 2 ]; int ir = Coeffs [ 1 ][ 0 ], ig = Coeffs [ 1 ][ 1 ], ib = Coeffs [ 1 ][ 2 ]; int qr = Coeffs [ 2 ][ 0 ], qg = Coeffs [ 2 ][ 1 ], qb = Coeffs [ 2 ][ 2 ]; int half_lsb = Half_LSB ; Make sure it does not alias with In or Out , and is hoisted of for ( unsigned pixel = 0 ; pixel < N ; pixel ++) { the loop uint8_t r = * In ++, g = * In ++, b = * In ++; uint8_t y = (( yr * r ) + ( yg * g ) + ( yb * b ) + half_lsb ) >> S ; = (( ir * r ) + ( ig * g ) + ( ib * b ) + half_lsb ) >> S ; int8_t i int8_t q = (( qr * r ) + ( qg * g ) + ( qb * b ) + half_lsb ) >> S ; * Out ++ = y , * Out ++ = I , * Out ++ = q ; } } 7

Second try… rgb2yiq : madd w7 , w18 , w11 , w17 stp x20 , x19 , [sp, # - 16 ] ! madd w18 , w18 , w14 , w17 cbz w2 , .LBB0_3 add w7 , w7 , w19 adrp x16 , Coeffs mul w19 , w4 , w13 add x16 , x16 , : lo12 : Coeffs add w18 , w18 , w3 adrp x17 , Half_LSB mul w3 , w4 , w16 9 coefficients ldp w8 , w9 , [ x16 ] sub w2 , w2 , #1 + half lsb ldp w10 , w11 , [ x16 , #8 ] add x0 , x0 , #3 ldp w12 , w13 , [ x16 , #16 ] add w4 , w5 , w6 ldp w14 , w15 , [ x16 , #24 ] add w5 , w7 , w19 3 MACs ! ldr w16 , [ x16 , #32 ] add w18 , w18 , w3 ldr w17 , [ x17 , : lo12 : Half_LSB ] lsr w3 , w4 , #16 .LBB0_2 : strb w3 , [ x1 ] ldrb w18 , [ x0 ] lsr w4 , w5 , #16 ldrb w3 , [ x0 , #1 ] lsr w18 , w18 , #16 mul w5 , w3 , w9 strb w4 , [ x1 , #1 ] madd w7 , w18 , w8 , w17 strb w18 , [ x1 , #2 ] ldrb w4 , [ x0 , #2 ] add x1 , x1 , #3 mul w19 , w3 , w12 cbnz w2 , .LBB0_2 mul w3 , w3 , w15 .LBB0_3 : mul w6 , w4 , w10 ldp x20 , x19 , [sp], #16 add w5 , w7 , w5 ret 8

Performances (lower is better) Time Code size Data size (bytes) First shot (reference) 1.0 1.0 0 Second try 1.03 1.0 40 9

Let’s ignore the compiler… rgb2yiq : madd w18 , w3 , w14 , w17 cbz w2 , .LBB0_3 madd w18 , w4 , w15 , w18 adrp x16 , Coeffs madd w18 , w5 , w16 , w18 add x16 , x16 , : lo12 : Coeffs adrp x17 , Half_LSB lsr w6 , w6 , #16 Shift ldp w8 , w9 , [ x16 ] lsr w7 , w7 , #16 ldp w10 , w11 , [ x16 , #8 ] lsr w18 , w18 , #16 ldp w12 , w13 , [ x16 , #16 ] Load coefficients ldp w14 , w15 , [ x16 , #24 ] strb w6 , [ x1 ] ldp w16 , [ x16 , #32 ] strb w7 , [ x1 , #1 ] Multiply-add ldp w17 , [ x17 , : lo12 : Half_LSB ] strb w18 , [ x1 , #2 ] .LBB0_2 : ldrb w3 , [ x0 ] add x0 , x0 , #3 ldrb w4 , [ x0 , #1 ] add x1 , x1 , #3 ldrb w5 , [ x0 , #2 ] sub w2 , w2 , #1 cbnz w2 , .LBB0_2 madd w6 , w3 , w8 , w17 .LBB0_3 : madd w6 , w4 , w9 , w6 ret madd w6 , w5 , w10 , w6 madd w7 , w3 , w11 , w17 madd w7 , w4 , w12 , w7 madd w7 , w5 , w13 , w7 10

Performances (lower is better) Time Code size Data size (bytes) First shot (reference) 1.0 1.0 0 Second try 1.03 1.0 40 Hand written straight asm (scalar) 0.94 0.80 40 11

Performances (lower is better) Time Code size Data size (bytes) First shot (reference) 1.0 1.0 0 Second try 1.03 1.0 40 Hand written straight asm (scalar) 0.94 0.80 40 Hand written scheduled asm (scalar) 0.79 0.80 40 12

What about vectorization ? 1. Load 8 pixels from memory to neon registers memory ... r0 g0 b0 r1 g1 b1 r2 g2 … v0 r7 r6 r5 r4 r3 r2 r1 r0 ld3 {v0, v1, v2}, [x0], #24 v1 g7 g6 g5 g4 g3 g2 g1 g0 v2 b7 b6 b5 b4 b3 b2 b1 b0 Expand to 32 bits ( uxtl, uxtl2 ) 2. v0 r3 r2 r1 r0 v1 g3 g2 g1 g0 v2 b3 b2 b1 b0 v3 r7 r6 r5 r4 v4 g7 g6 g5 g4 v5 b7 b6 b5 b4 13

What about vectorization (cont.) Bunch of mul / mla with the coefficients 3. Round shift right the y, i, q results to 16bits ( rshrn , rshrn2 ) 4. v0 y7 y6 y5 y4 y3 y2 y1 y0 v1 i7 i6 i5 i4 i3 i2 i1 i0 v2 q7 q6 q5 q4 q3 q2 q1 q0 Extract and compact the 8LSB from the y, i, q results ( xtn ) 5. v0 y7 y6 y5 y4 y3 y2 y1 y0 v1 i7 i6 i5 i4 i3 i2 i1 i0 v2 q7 q6 q5 q4 q3 q2 q1 q0 And store with st3 {v0, v1, v2}, [x1], #24 6. memory ... y0 i0 q0 y1 i1 q1 y2 i2 … 14

Performances (lower is better) Time Code size Data size (bytes) First shot (reference) 1.0 1.0 0 Second try 1.03 1.0 40 Hand written straight asm (scalar) 0.94 0.80 40 Hand written scheduled asm (scalar) 0.79 0.80 40 Hand written asm (vector) 0.49 1.88 48 15

Thank you ! 16

How fast goes the light ? Euro LLVM 2015 Arnaud de Grandmaison 1 - PowerPoint PPT Presentation

How fast goes the light ? Euro LLVM 2015 Arnaud de Grandmaison 1 Scope Speed of light: the fastest implementation of a function on a given cpu (Cortex-A57) The function under test is a typical image processing kernel: Color space

What is Light ? Discussion Questions: 1) What is light? 2) How fast does light travel? 3) What

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

Outline Light Real light How humans see light How computers trick humans into

GOES DCS Architecture DCS (GOES Data Collection System) DCS receives a combined 34K messages

GOES-R Series Products Dr. Jamese Sims, OSGS/GOES-R GOES-R Series Satellite Product Manager

journey Tales from a practitioner Business Agility Australia 25 September 2018 Page heading

Light Energy Gabriella Bicknell Mrs.Branin Grade 5 What is Light? Light is like sound. We

light right light right light right light right to steady the tongue, hold the sides of

Computer Graphics - Light Transport - Philipp Slusallek LIGHT 2 What is Light ?

Properties of Light All About Light What is light? It is a small part of the EM spectrum, but it

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

What is light? Light is a form of energy that can be detected by the human eye. Light cannot be

Chapter 5 Light: The Cosmic Messenger 5.1 Basic Properties of Light and Matter Our goals for

Let There be Light Let There be Light: Let There be Light: Let There be Light Climatic

BEACHWOOD BEACHWOOD LIGHT UP LIGHT UP LIGHT UP LIGHT UP BEACHWOOD BEACHWOOD Friday October

Planning and Optimization E3. Landmarks: LM-cut Heuristic Malte Helmert and Thomas Keller

Machine Learning Decision trees Types of classifiers We can divide the large variety of

Sign In Lecture #4: Simple Compression: Huffman Trees Website:

Lecture #5: Higher-Order Functions Do You Understand the Machinery? (I) What is printed (0, 1, or

On some numerical invariants of finite groups Jan Krempa Institute of Mathematics, University of

FAEC DATA Act Working Group Update Presenters: Bob Taylor and Jim Lisle on behalf of the Federal

NCHIMA 66 th Annual Meeting Karen Snyder Director, Healthcare & Life Sciences Channels &

Zero touch Connectivity Per Nihln per@sunet.se What are we doing Eduroam The

How fast goes the light ? Euro LLVM 2015 Arnaud de Grandmaison 1 - PowerPoint PPT Presentation

How fast goes the light ? Euro LLVM 2015 Arnaud de Grandmaison 1 Scope Speed of light: the fastest implementation of a function on a given cpu (Cortex-A57) The function under test is a typical image processing kernel: Color space

What is Light ? Discussion Questions: 1) What is light? 2) How fast does light travel? 3) What

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

Outline Light Real light How humans see light How computers trick humans into

GOES DCS Architecture DCS (GOES Data Collection System) DCS receives a combined 34K messages

GOES-R Series Products Dr. Jamese Sims, OSGS/GOES-R GOES-R Series Satellite Product Manager

journey Tales from a practitioner Business Agility Australia 25 September 2018 Page heading

Light Energy Gabriella Bicknell Mrs.Branin Grade 5 What is Light? Light is like sound. We

light right light right light right light right to steady the tongue, hold the sides of

Computer Graphics - Light Transport - Philipp Slusallek LIGHT 2 What is Light ?

Properties of Light All About Light What is light? It is a small part of the EM spectrum, but it

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

What is light? Light is a form of energy that can be detected by the human eye. Light cannot be

Chapter 5 Light: The Cosmic Messenger 5.1 Basic Properties of Light and Matter Our goals for

Let There be Light Let There be Light: Let There be Light: Let There be Light Climatic

BEACHWOOD BEACHWOOD LIGHT UP LIGHT UP LIGHT UP LIGHT UP BEACHWOOD BEACHWOOD Friday October

Planning and Optimization E3. Landmarks: LM-cut Heuristic Malte Helmert and Thomas Keller

Machine Learning Decision trees Types of classifiers We can divide the large variety of

Sign In Lecture #4: Simple Compression: Huffman Trees Website:

Lecture #5: Higher-Order Functions Do You Understand the Machinery? (I) What is printed (0, 1, or

On some numerical invariants of finite groups Jan Krempa Institute of Mathematics, University of

FAEC DATA Act Working Group Update Presenters: Bob Taylor and Jim Lisle on behalf of the Federal

NCHIMA 66 th Annual Meeting Karen Snyder Director, Healthcare &amp; Life Sciences Channels &amp;

Zero touch Connectivity Per Nihln per@sunet.se What are we doing Eduroam The

NCHIMA 66 th Annual Meeting Karen Snyder Director, Healthcare & Life Sciences Channels &