INFOMAGR β Advanced Graphics Jacco Bikker - February β April 2016 Welcome! π± π, π β² = π(π, π β² ) π π, π β² + π π, π β² , π β²β² π± π β² , π β²β² ππβ²β² π»
Todayβs Agenda: ο§ Introduction ο§ SIMD Concepts ο§ C++ / SSE & AVX ο§ C# / RyuJIT
Advanced Graphics β SIMD Recap 3 Introduction S.I.M.D. Single Instruction Multiple Data: Applying the same instruction to several input elements. In other words: if we are going to apply the same sequence of instructions to a large input set, this allows us to do this in parallel (and thus: faster). SIMD is also known as instruction level parallelism .
Advanced Graphics β SIMD Recap 4 Introduction Hardware β VLIW Vector instructions: A A A A A4 Vector4 a = { 1, PI, e, 4 }; Vector4 b = { 4, 4, 4, 4 }; Vector4 c = a * b; Concept: function A4 consists of instructions operating on 4 items ο§ executing A4 requires the same number of instructions required to ο§ execute A on a single item throughput of A4 is four times higher. ο§ The β4β in the above is known as the vector width. Modern processors support 4-wide vectors (Pentium 3 and up), 8-wide (i3/i5/i7), 16-wide (Larrabee / Xeon Phi) and 32-wide (NVidia and AMD GPUs).
Advanced Graphics β SIMD Recap 5 Introduction SIMD Using Integers An integer is a 32-bit value, which means that it stores 4 bytes: A4 char[] a = { 1, 2, 3, 4 }; uint a4 = (1 << 24) + (2 << 16) + (3 << 8) + 4; In C++ we can directly exploit this: union { char a[4]; uint a4; }; a4 = (1 << 24) + (2 << 16) + (3 << 8) + 4; a[0]++; a[1]++; a[2]++; a[3]++; a4 += 0x01010101;
Advanced Graphics β SIMD Recap 6 Introduction SIMD Using Integers An integer is a 32-bit value, which means that it stores 4 bytes: A4 char[] a = { 1, 2, 3, 4 }; uint a4 = (1 << 24) + (2 << 16) + (3 << 8) + 4; C# also allows this, although it is a bit of a hack: [StructLayout(LayoutKind.Explicit)] struct byte_array { [FieldOffset(0)] public byte a; [FieldOffset(1)] public byte b; [FieldOffset(2)] public byte c; [FieldOffset(3)] public byte d; [FieldOffset(0)] public unsigned int abcd; }
Advanced Graphics β SIMD Recap 7 Introduction uint = unsigned char[4] Evil use of this: Pinging google.com yields: 74.125.136.101 Each value is an unsigned 8-bit value (0..255). We can specify a user name when Combing them in one 32-bit integer: visiting a website, but any username will be accepted by google. Like this: 101 + http://advgr2016@google.com 256 * 136 + 256 * 256 * 125 + Or: 256 * 256 * 256 * 74 = 1249740901. http://www.ing.nl@1249740901 Browse to: http://1249740901 (works!) Replace the IP address used here by your own site which contains a copy of the ing.nl site to obtain passwords, and send the link to a βfriendβ.
Advanced Graphics β SIMD Recap 8 Introduction Other Examples Rapid string comparison: char a[] = βoptimization skills ruleβ; char a[] = βoptimization skills ruleβ; char b[] = βoptimization is so nice!β; char b[] = βoptimization is so nice!β; bool equal = true; bool equal = true; int l = strlen( a ); int q = strlen( a ) / 4; for ( int i = 0; i < l; i++ ) for ( int i = 0; i < q; i++ ) { { if (a[i] != b[i]) if (((int*)a)[i] != ((int*)b)[i]) { { equal = false; equal = false; break; break; } } } } Likewise, we can copy byte arrays faster.
Advanced Graphics β SIMD Recap 9 Introduction SIMD using 32-bit values - Limitations Mapping four chars to an int value has a number of limitations: { 100, 100, 100, 100 } + { 1, 1, 1, 200 } = { 101, 101, 102, 44 } { 100, 100, 100, 100 } * { 2, 2, 2, 2 } = { β¦ } { 100, 100, 100, 200 } * 2 = { 200, 200, 201, 144 } In general: ο§ Streams are not separated (prone to overflow into next stream); ο§ Limited to small unsigned integer values; ο§ Hard to do multiplication / division.
Advanced Graphics β SIMD Recap 10 Introduction SIMD using 32-bit values - Limitations Ideally, we would like to see: Isolated streams ο§ Support for more data types (char, short, uint, int, float, double) ο§ An easy to use approach ο§ Meet SSE!
Advanced Graphics β SIMD Recap 11 Introduction Vector Processors - Early systems The Solomon project (1960) One CPU feeding a number of ALUs with the same instruction, but different data. This way, a single algorithm is executed in parallel for a large dataset. ILLIAC IV (1962) Design: 1 GFLOPS using 256 ALUs. Actual implementation: 1974, 64 ALUs, ~100 MFLOPS. Fastest machine in the world for massively parallel tasks. Cray-1 (1976) Regular processor, but using vector registers of 64x64 bits. Reached 240 MFLOPS. MMX (1997, P2) and SSE (1999, P3) Vector registers and instructions added to a regular x86 processor.
Advanced Graphics β SIMD Recap 12 Introduction SIMD / SSE SSE was first introduced with the Pentium-3 processor in 1999, and adds a set of 128-bit registers, as well as instructions to operate on these registers. 32-bit: { char, char, char, char } = int 128-bit: { float, float, float, float } = __m128 { int, int, int, int } = __m128i Apart from storing 4 floats or ints, the registers can also store two 64- bit values, eight 16-bit values or sixteen 8-bit values.
Advanced Graphics β SIMD Recap 13 Introduction SIMD / SSE Problems when working with 32-bit integers: ο§ Streams are not separated (prone to overflow into next stream); ο§ Limited to small unsigned integer values; ο§ Hard to do multiplication / division. Ideal situation: Isolated streams ο§ Support for more data types (char, short, uint, int, float, double) ο§ An easy to use approach ο§ SSE offers these benefits, except for one (guess which ο ).
Todayβs Agenda: ο§ Introduction ο§ SIMD Concepts ο§ C++ / SSE & AVX ο§ C# / RyuJIT
Advanced Graphics β SIMD Recap 15 Concepts Streams Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); This is quite high-level. What the processor needs to do is: Vector3 tmp = T β P; float length = sqrt( tmp.x * tmp.x + tmp.y * tmp.y + tmp.z * tmp.z ); D = tmp / length;
Advanced Graphics β SIMD Recap 16 Concepts Streams Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); This is quite high-level. What the processor needs to do is: float tmp_x = T.x β P.x; float tmp_y = T.y β P.y; float tmp_z = T.z β P.z; float sqlen = tmp_x * tmp_x + tmp_y * tmp_y + tmp_z * tmp_z; float length = sqrt( sqlen ); D.x = tmp_x / length; D.y = tmp_y / length; D.z = tmp_z / length;
Advanced Graphics β SIMD Recap 17 Concepts Streams Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); Using vector instructions: __m128 A = T β P // 75% float B = dot( A, A ) // 75% __m128 C = { B, B, B } // 75%, overhead __m128 D = A / C // 75%
Advanced Graphics β SIMD Recap 18 Concepts Streams Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); A = T.X β P.X A = T.X β P.X A = T.X β P.X A = T.X β P.X B = T.Y β P.Y B = T.Y β P.Y B = T.Y β P.Y B = T.Y β P.Y C = T.Z β P.Z C = T.Z β P.Z C = T.Z β P.Z C = T.Z β P.Z D = A * A D = A * A D = A * A D = A * A E = B * B E = B * B E = B * B E = B * B F = C * C F = C * C F = C * C F = C * C F += E F += E F += E F += E F += D F += D F += D F += D G = sqrt( F ) G = sqrt( F ) G = sqrt( F ) G = sqrt( F ) D.X = A / G D.X = A / G D.X = A / G D.X = A / G D.Y = B / G D.Y = B / G D.Y = B / G D.Y = B / G D.Z = C / G D.Z = C / G D.Z = C / G D.Z = C / G 0 1 2 3
Advanced Graphics β SIMD Recap 19 Concepts Streams Optimal utilization of SIMD hardware is achieved when we run the same algorithm four times in parallel . This way, the approach also scales naturally to 8-wide, 16- wide and 32-wide SIMD.
Advanced Graphics β SIMD Recap 20 Concepts Streams β Data Organization A3 = T3.X β P3.X A4 = T4.X β P4.X A1 = T1.X β P1.X A2 = T2.X β P2.X B3 = T3.Y β P3.Y B4 = T4.Y β P4.Y B1 = T1.Y β P1.Y B2 = T2.Y β P2.Y C3 = T3.Z β P3.Z C4 = T4.Z β P4.Z C1 = T1.Z β P1.Z C2 = T2.Z β P2.Z D3 = A3 * A3 D4 = A4 * A4 D1 = A1 * A1 D2 = A2 * A2 E3 = B3 * B3 E4 = B4 * B4 E1 = B1 * B1 E2 = B2 * B2 F3 = C3 * C3 F4 = C4 * C4 F1 = C1 * C1 F2 = C2 * C2 F3 += E3 F4 += E4 F1 += E1 F2 += E2 F3 += D3 F4 += D4 F1 += D1 F2 += D2 G3 = sqrt( F3 ) G4 = sqrt( F4 ) G1 = sqrt( F1 ) G2 = sqrt( F2 ) D1.X = A1 / G1 D2.X = A2 / G2 D3.X = A3 / G3 D4.X = A4 / G4 D3.Y = B3 / G3 D4.Y = B4 / G4 D1.Y = B1 / G1 D2.Y = B2 / G2 D3.Z = C3 / G3 D4.Z = C4 / G4 D1.Z = C1 / G1 D2.Z = C2 / G2
Recommend
More recommend