welcome

Welcome! , = (, ) , + , , - PowerPoint PPT Presentation

INFOMAGR Advanced Graphics Jacco Bikker - February April 2016 Welcome! , = (, ) , + , , , Todays Agenda:


  1. INFOMAGR – Advanced Graphics Jacco Bikker - February – April 2016 Welcome! 𝑱 π’š, π’š β€² = 𝒉(π’š, π’š β€² ) 𝝑 π’š, π’š β€² + 𝝇 π’š, π’š β€² , π’š β€²β€² 𝑱 π’š β€² , π’š β€²β€² π’†π’šβ€²β€² 𝑻

  2. Today’s Agenda: ο‚§ Introduction ο‚§ SIMD Concepts ο‚§ C++ / SSE & AVX ο‚§ C# / RyuJIT

  3. Advanced Graphics – SIMD Recap 3 Introduction S.I.M.D. Single Instruction Multiple Data: Applying the same instruction to several input elements. In other words: if we are going to apply the same sequence of instructions to a large input set, this allows us to do this in parallel (and thus: faster). SIMD is also known as instruction level parallelism .

  4. Advanced Graphics – SIMD Recap 4 Introduction Hardware – VLIW Vector instructions: A A A A A4 Vector4 a = { 1, PI, e, 4 }; Vector4 b = { 4, 4, 4, 4 }; Vector4 c = a * b; Concept: function A4 consists of instructions operating on 4 items ο‚§ executing A4 requires the same number of instructions required to ο‚§ execute A on a single item throughput of A4 is four times higher. ο‚§ The β€˜4’ in the above is known as the vector width. Modern processors support 4-wide vectors (Pentium 3 and up), 8-wide (i3/i5/i7), 16-wide (Larrabee / Xeon Phi) and 32-wide (NVidia and AMD GPUs).

  5. Advanced Graphics – SIMD Recap 5 Introduction SIMD Using Integers An integer is a 32-bit value, which means that it stores 4 bytes: A4 char[] a = { 1, 2, 3, 4 }; uint a4 = (1 << 24) + (2 << 16) + (3 << 8) + 4; In C++ we can directly exploit this: union { char a[4]; uint a4; }; a4 = (1 << 24) + (2 << 16) + (3 << 8) + 4; a[0]++; a[1]++; a[2]++; a[3]++; a4 += 0x01010101;

  6. Advanced Graphics – SIMD Recap 6 Introduction SIMD Using Integers An integer is a 32-bit value, which means that it stores 4 bytes: A4 char[] a = { 1, 2, 3, 4 }; uint a4 = (1 << 24) + (2 << 16) + (3 << 8) + 4; C# also allows this, although it is a bit of a hack: [StructLayout(LayoutKind.Explicit)] struct byte_array { [FieldOffset(0)] public byte a; [FieldOffset(1)] public byte b; [FieldOffset(2)] public byte c; [FieldOffset(3)] public byte d; [FieldOffset(0)] public unsigned int abcd; }

  7. Advanced Graphics – SIMD Recap 7 Introduction uint = unsigned char[4] Evil use of this: Pinging google.com yields: 74.125.136.101 Each value is an unsigned 8-bit value (0..255). We can specify a user name when Combing them in one 32-bit integer: visiting a website, but any username will be accepted by google. Like this: 101 + http://advgr2016@google.com 256 * 136 + 256 * 256 * 125 + Or: 256 * 256 * 256 * 74 = 1249740901. http://www.ing.nl@1249740901 Browse to: http://1249740901 (works!) Replace the IP address used here by your own site which contains a copy of the ing.nl site to obtain passwords, and send the link to a β€˜friend’.

  8. Advanced Graphics – SIMD Recap 8 Introduction Other Examples Rapid string comparison: char a[] = β€œoptimization skills rule”; char a[] = β€œoptimization skills rule”; char b[] = β€œoptimization is so nice!”; char b[] = β€œoptimization is so nice!”; bool equal = true; bool equal = true; int l = strlen( a ); int q = strlen( a ) / 4; for ( int i = 0; i < l; i++ ) for ( int i = 0; i < q; i++ ) { { if (a[i] != b[i]) if (((int*)a)[i] != ((int*)b)[i]) { { equal = false; equal = false; break; break; } } } } Likewise, we can copy byte arrays faster.

  9. Advanced Graphics – SIMD Recap 9 Introduction SIMD using 32-bit values - Limitations Mapping four chars to an int value has a number of limitations: { 100, 100, 100, 100 } + { 1, 1, 1, 200 } = { 101, 101, 102, 44 } { 100, 100, 100, 100 } * { 2, 2, 2, 2 } = { … } { 100, 100, 100, 200 } * 2 = { 200, 200, 201, 144 } In general: ο‚§ Streams are not separated (prone to overflow into next stream); ο‚§ Limited to small unsigned integer values; ο‚§ Hard to do multiplication / division.

  10. Advanced Graphics – SIMD Recap 10 Introduction SIMD using 32-bit values - Limitations Ideally, we would like to see: Isolated streams ο‚§ Support for more data types (char, short, uint, int, float, double) ο‚§ An easy to use approach ο‚§ Meet SSE!

  11. Advanced Graphics – SIMD Recap 11 Introduction Vector Processors - Early systems The Solomon project (1960) One CPU feeding a number of ALUs with the same instruction, but different data. This way, a single algorithm is executed in parallel for a large dataset. ILLIAC IV (1962) Design: 1 GFLOPS using 256 ALUs. Actual implementation: 1974, 64 ALUs, ~100 MFLOPS. Fastest machine in the world for massively parallel tasks. Cray-1 (1976) Regular processor, but using vector registers of 64x64 bits. Reached 240 MFLOPS. MMX (1997, P2) and SSE (1999, P3) Vector registers and instructions added to a regular x86 processor.

  12. Advanced Graphics – SIMD Recap 12 Introduction SIMD / SSE SSE was first introduced with the Pentium-3 processor in 1999, and adds a set of 128-bit registers, as well as instructions to operate on these registers. 32-bit: { char, char, char, char } = int 128-bit: { float, float, float, float } = __m128 { int, int, int, int } = __m128i Apart from storing 4 floats or ints, the registers can also store two 64- bit values, eight 16-bit values or sixteen 8-bit values.

  13. Advanced Graphics – SIMD Recap 13 Introduction SIMD / SSE Problems when working with 32-bit integers: ο‚§ Streams are not separated (prone to overflow into next stream); ο‚§ Limited to small unsigned integer values; ο‚§ Hard to do multiplication / division. Ideal situation: Isolated streams ο‚§ Support for more data types (char, short, uint, int, float, double) ο‚§ An easy to use approach ο‚§ SSE offers these benefits, except for one (guess which  ).

  14. Today’s Agenda: ο‚§ Introduction ο‚§ SIMD Concepts ο‚§ C++ / SSE & AVX ο‚§ C# / RyuJIT

  15. Advanced Graphics – SIMD Recap 15 Concepts Streams Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); This is quite high-level. What the processor needs to do is: Vector3 tmp = T – P; float length = sqrt( tmp.x * tmp.x + tmp.y * tmp.y + tmp.z * tmp.z ); D = tmp / length;

  16. Advanced Graphics – SIMD Recap 16 Concepts Streams Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); This is quite high-level. What the processor needs to do is: float tmp_x = T.x – P.x; float tmp_y = T.y – P.y; float tmp_z = T.z – P.z; float sqlen = tmp_x * tmp_x + tmp_y * tmp_y + tmp_z * tmp_z; float length = sqrt( sqlen ); D.x = tmp_x / length; D.y = tmp_y / length; D.z = tmp_z / length;

  17. Advanced Graphics – SIMD Recap 17 Concepts Streams Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); Using vector instructions: __m128 A = T – P // 75% float B = dot( A, A ) // 75% __m128 C = { B, B, B } // 75%, overhead __m128 D = A / C // 75%

  18. Advanced Graphics – SIMD Recap 18 Concepts Streams Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); A = T.X – P.X A = T.X – P.X A = T.X – P.X A = T.X – P.X B = T.Y – P.Y B = T.Y – P.Y B = T.Y – P.Y B = T.Y – P.Y C = T.Z – P.Z C = T.Z – P.Z C = T.Z – P.Z C = T.Z – P.Z D = A * A D = A * A D = A * A D = A * A E = B * B E = B * B E = B * B E = B * B F = C * C F = C * C F = C * C F = C * C F += E F += E F += E F += E F += D F += D F += D F += D G = sqrt( F ) G = sqrt( F ) G = sqrt( F ) G = sqrt( F ) D.X = A / G D.X = A / G D.X = A / G D.X = A / G D.Y = B / G D.Y = B / G D.Y = B / G D.Y = B / G D.Z = C / G D.Z = C / G D.Z = C / G D.Z = C / G 0 1 2 3

  19. Advanced Graphics – SIMD Recap 19 Concepts Streams Optimal utilization of SIMD hardware is achieved when we run the same algorithm four times in parallel . This way, the approach also scales naturally to 8-wide, 16- wide and 32-wide SIMD.

  20. Advanced Graphics – SIMD Recap 20 Concepts Streams – Data Organization A3 = T3.X – P3.X A4 = T4.X – P4.X A1 = T1.X – P1.X A2 = T2.X – P2.X B3 = T3.Y – P3.Y B4 = T4.Y – P4.Y B1 = T1.Y – P1.Y B2 = T2.Y – P2.Y C3 = T3.Z – P3.Z C4 = T4.Z – P4.Z C1 = T1.Z – P1.Z C2 = T2.Z – P2.Z D3 = A3 * A3 D4 = A4 * A4 D1 = A1 * A1 D2 = A2 * A2 E3 = B3 * B3 E4 = B4 * B4 E1 = B1 * B1 E2 = B2 * B2 F3 = C3 * C3 F4 = C4 * C4 F1 = C1 * C1 F2 = C2 * C2 F3 += E3 F4 += E4 F1 += E1 F2 += E2 F3 += D3 F4 += D4 F1 += D1 F2 += D2 G3 = sqrt( F3 ) G4 = sqrt( F4 ) G1 = sqrt( F1 ) G2 = sqrt( F2 ) D1.X = A1 / G1 D2.X = A2 / G2 D3.X = A3 / G3 D4.X = A4 / G4 D3.Y = B3 / G3 D4.Y = B4 / G4 D1.Y = B1 / G1 D2.Y = B2 / G2 D3.Z = C3 / G3 D4.Z = C4 / G4 D1.Z = C1 / G1 D2.Z = C2 / G2

Recommend


More recommend