π± π, πβ² = π(π, πβ²) π π, πβ² +
π»
π π, πβ², πβ²β² π± πβ², πβ²β² ππβ²β²
INFOMAGR β Advanced Graphics
Jacco Bikker - February β April 2016
Welcome! , = (, ) , + , , - - PowerPoint PPT Presentation
INFOMAGR Advanced Graphics Jacco Bikker - February April 2016 Welcome! , = (, ) , + , , , Todays Agenda:
π± π, πβ² = π(π, πβ²) π π, πβ² +
π»
π π, πβ², πβ²β² π± πβ², πβ²β² ππβ²β²
Jacco Bikker - February β April 2016
Todayβs Agenda:
Introduction
Advanced Graphics β SIMD Recap 3
S.I.M.D.
Single Instruction Multiple Data: Applying the same instruction to several input elements. In other words: if we are going to apply the same sequence of instructions to a large input set, this allows us to do this in parallel (and thus: faster). SIMD is also known as instruction level parallelism.
Introduction
Advanced Graphics β SIMD Recap 4
Hardware β VLIW
Vector instructions:
Vector4 a = { 1, PI, e, 4 }; Vector4 b = { 4, 4, 4, 4 }; Vector4 c = a * b; Concept:
execute A on a single item
The β4β in the above is known as the vector width. Modern processors support 4-wide vectors (Pentium 3 and up), 8-wide (i3/i5/i7), 16-wide (Larrabee / Xeon Phi) and 32-wide (NVidia and AMD GPUs).
A A A A A4
Introduction
Advanced Graphics β SIMD Recap 5
SIMD Using Integers
An integer is a 32-bit value, which means that it stores 4 bytes:
char[] a = { 1, 2, 3, 4 }; uint a4 = (1 << 24) + (2 << 16) + (3 << 8) + 4;
In C++ we can directly exploit this:
union { char a[4]; uint a4; }; a4 = (1 << 24) + (2 << 16) + (3 << 8) + 4; a[0]++; a[1]++; a[2]++; a[3]++; a4 += 0x01010101;
A4
Introduction
Advanced Graphics β SIMD Recap 6
SIMD Using Integers
An integer is a 32-bit value, which means that it stores 4 bytes:
char[] a = { 1, 2, 3, 4 }; uint a4 = (1 << 24) + (2 << 16) + (3 << 8) + 4;
C# also allows this, although it is a bit of a hack:
[StructLayout(LayoutKind.Explicit)] struct byte_array { [FieldOffset(0)] public byte a; [FieldOffset(1)] public byte b; [FieldOffset(2)] public byte c; [FieldOffset(3)] public byte d; [FieldOffset(0)] public unsigned int abcd; }
A4
Introduction
Advanced Graphics β SIMD Recap 7
uint = unsigned char[4]
Pinging google.com yields: 74.125.136.101 Each value is an unsigned 8-bit value (0..255). Combing them in one 32-bit integer: 101 + 256 * 136 + 256 * 256 * 125 + 256 * 256 * 256 * 74 = 1249740901. Browse to: http://1249740901 (works!)
Evil use of this: We can specify a user name when visiting a website, but any username will be accepted by google. Like this: http://advgr2016@google.com Or: http://www.ing.nl@1249740901 Replace the IP address used here by your own site which contains a copy of the ing.nl site to obtain passwords, and send the link to a βfriendβ.
Introduction
Advanced Graphics β SIMD Recap 8
Other Examples
Rapid string comparison:
char a[] = βoptimization skills ruleβ; char b[] = βoptimization is so nice!β; bool equal = true; int l = strlen( a ); for ( int i = 0; i < l; i++ ) { if (a[i] != b[i]) { equal = false; break; } }
Likewise, we can copy byte arrays faster.
char a[] = βoptimization skills ruleβ; char b[] = βoptimization is so nice!β; bool equal = true; int q = strlen( a ) / 4; for ( int i = 0; i < q; i++ ) { if (((int*)a)[i] != ((int*)b)[i]) { equal = false; break; } }
Introduction
Advanced Graphics β SIMD Recap 9
SIMD using 32-bit values - Limitations
Mapping four chars to an int value has a number of limitations:
{ 100, 100, 100, 100 } + { 1, 1, 1, 200 } = { 101, 101, 102, 44 } { 100, 100, 100, 100 } * { 2, 2, 2, 2 } = { β¦ } { 100, 100, 100, 200 } * 2 = { 200, 200, 201, 144 }
In general:
Introduction
Advanced Graphics β SIMD Recap 10
SIMD using 32-bit values - Limitations
Ideally, we would like to see:
Meet SSE!
Introduction
Advanced Graphics β SIMD Recap 11
Vector Processors - Early systems
The Solomon project (1960) One CPU feeding a number of ALUs with the same instruction, but different
ILLIAC IV (1962) Design: 1 GFLOPS using 256 ALUs. Actual implementation: 1974, 64 ALUs, ~100 MFLOPS. Fastest machine in the world for massively parallel tasks. Cray-1 (1976) Regular processor, but using vector registers of 64x64 bits. Reached 240 MFLOPS. MMX (1997, P2) and SSE (1999, P3) Vector registers and instructions added to a regular x86 processor.
Introduction
Advanced Graphics β SIMD Recap 12
SIMD / SSE
SSE was first introduced with the Pentium-3 processor in 1999, and adds a set of 128-bit registers, as well as instructions to operate on these registers. 32-bit: { char, char, char, char } = int 128-bit: { float, float, float, float } = __m128 { int, int, int, int } = __m128i Apart from storing 4 floats or ints, the registers can also store two 64- bit values, eight 16-bit values or sixteen 8-bit values.
Introduction
Advanced Graphics β SIMD Recap 13
SIMD / SSE
Problems when working with 32-bit integers:
Ideal situation:
SSE offers these benefits, except for one (guess which ο ).
Todayβs Agenda:
Concepts
Advanced Graphics β SIMD Recap 15
Streams
Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); This is quite high-level. What the processor needs to do is:
Vector3 tmp = T β P; float length = sqrt( tmp.x * tmp.x + tmp.y * tmp.y + tmp.z * tmp.z ); D = tmp / length;
Concepts
Advanced Graphics β SIMD Recap 16
Streams
Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); This is quite high-level. What the processor needs to do is:
float tmp_x = T.x β P.x; float tmp_y = T.y β P.y; float tmp_z = T.z β P.z; float sqlen = tmp_x * tmp_x + tmp_y * tmp_y + tmp_z * tmp_z; float length = sqrt( sqlen ); D.x = tmp_x / length; D.y = tmp_y / length; D.z = tmp_z / length;
Concepts
Advanced Graphics β SIMD Recap 17
Streams
Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); Using vector instructions:
__m128 A = T β P float B = dot( A, A ) __m128 C = { B, B, B } __m128 D = A / C // 75% // 75% // 75%, overhead // 75%
Concepts
Advanced Graphics β SIMD Recap 18
Streams
Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P );
A = T.X β P.X B = T.Y β P.Y C = T.Z β P.Z D = A * A E = B * B F = C * C F += E F += D G = sqrt( F ) D.X = A / G D.Y = B / G D.Z = C / G A = T.X β P.X B = T.Y β P.Y C = T.Z β P.Z D = A * A E = B * B F = C * C F += E F += D G = sqrt( F ) D.X = A / G D.Y = B / G D.Z = C / G A = T.X β P.X B = T.Y β P.Y C = T.Z β P.Z D = A * A E = B * B F = C * C F += E F += D G = sqrt( F ) D.X = A / G D.Y = B / G D.Z = C / G A = T.X β P.X B = T.Y β P.Y C = T.Z β P.Z D = A * A E = B * B F = C * C F += E F += D G = sqrt( F ) D.X = A / G D.Y = B / G D.Z = C / G
0 1 2 3
Concepts
Advanced Graphics β SIMD Recap 19
Streams
Optimal utilization of SIMD hardware is achieved when we run the same algorithm four times in parallel. This way, the approach also scales naturally to 8-wide, 16- wide and 32-wide SIMD.
Concepts
Advanced Graphics β SIMD Recap 20
Streams β Data Organization
A1 = T1.X β P1.X B1 = T1.Y β P1.Y C1 = T1.Z β P1.Z D1 = A1 * A1 E1 = B1 * B1 F1 = C1 * C1 F1 += E1 F1 += D1 G1 = sqrt( F1 ) D1.X = A1 / G1 D1.Y = B1 / G1 D1.Z = C1 / G1 A2 = T2.X β P2.X B2 = T2.Y β P2.Y C2 = T2.Z β P2.Z D2 = A2 * A2 E2 = B2 * B2 F2 = C2 * C2 F2 += E2 F2 += D2 G2 = sqrt( F2 ) D2.X = A2 / G2 D2.Y = B2 / G2 D2.Z = C2 / G2 A3 = T3.X β P3.X B3 = T3.Y β P3.Y C3 = T3.Z β P3.Z D3 = A3 * A3 E3 = B3 * B3 F3 = C3 * C3 F3 += E3 F3 += D3 G3 = sqrt( F3 ) D3.X = A3 / G3 D3.Y = B3 / G3 D3.Z = C3 / G3 A4 = T4.X β P4.X B4 = T4.Y β P4.Y C4 = T4.Z β P4.Z D4 = A4 * A4 E4 = B4 * B4 F4 = C4 * C4 F4 += E4 F4 += D4 G4 = sqrt( F4 ) D4.X = A4 / G4 D4.Y = B4 / G4 D4.Z = C4 / G4
Concepts
Advanced Graphics β SIMD Recap 21
Streams β Data Organization
A1 = TX1 β PX1 B1 = TY1 β PY1 C1 = TZ1 β PZ1 D1 = A1 * A1 E1 = B1 * B1 F1 = C1 * C1 F1 += E1 F1 += D1 G1 = sqrt( F1 ) DX1 = A1 / G1 DY1 = B1 / G1 DZ1 = C1 / G1
Input:
TX = { T1.x, T2.x, T3.x, T4.x }; PX = { P1.x, P2.x, P3.x, P4.x }; TY = { T1.y, T2.y, T3.y, T4.y }; PY = { P1.y, P2.y, P3.y, P4.y }; TZ = { T1.z, T2.z, T3.z, T4.z }; PZ = { P1.z, P2.z, P3.z, P4.z }; A2 = TX2 β PX2 B2 = TY2 β PY2 C2 = TZ2 β PZ2 D2 = A2 * A2 E2 = B2 * B2 F2 = C2 * C2 F2 += E2 F2 += D2 G2 = sqrt( F2 ) DX2 = A2 / G2 DY2 = B2 / G2 DZ2 = C2 / G2 A3 = TX3 β PX3 B3 = TY3 β PY3 C3 = TZ3 β PZ3 D3 = A3 * A3 E3 = B3 * B3 F3 = C3 * C3 F3 += E3 F3 += D3 G3 = sqrt( F3 ) DX3 = A3 / G3 DY3 = B3 / G3 DZ3 = C3 / G3 A4 = TX4 β PX4 B4 = TY4 β PY4 C4 = TZ4 β PZ4 D4 = A4 * A4 E4 = B4 * B4 F4 = C4 * C4 F4 += E4 F4 += D4 G4 = sqrt( F4 ) DX4 = A4 / G4 DY4 = B4 / G4 DZ4 = C4 / G4
union { __m128 x4[128]; }; union { __m128 y4[128]; }; union { __m128 z4[128]; }; union { __m128i mass4[128]; };
Concepts
Advanced Graphics β SIMD Recap 22
Streams β Data Organization
Consider the following data structure: struct Particle { float x, y, z; int mass; }; Particle particle[512]; float x[512]; float y[512]; float z[512]; int mass[512];
Concepts
Advanced Graphics β SIMD Recap 23
Streams β Ray Tracing
Leveraging SIMD for ray tracing:
Option 3 is the least intrusive:
class Ray4 { public: __m128 ox4, oy4, oz4; __m128 dx4, dy4, dz4; __m128 t4; }; vec3 e1 = tri.V2 - tri.V1; vec3 e2 = tri.V3 - tri.V1; vec3 P = cross( D, e2 ); float det = dot( e1, P ); if (det > -EPS && det < EPS) return NOHIT; float inv_det = 1 / det; vec3 T = O - tri.V1; float u = dot( T, P ) * inv_det; if (u < 0 || u > 1) return NOHIT; vec3 Q = cross( T, e1 ); float v = dot( D, Q ) * inv_det; if (v < 0 || u + v > 1) return NOHIT; float t = dot( e2, Q ) * inv_det; if (t > EPSILON) { *out = t; return HIT; } return NOHIT;
Concepts
Advanced Graphics β SIMD Recap 24
Streams β Flow Divergence
Like other instructions, comparisons between vectors yield a vector of booleans.
__m128 mask = _mm_cmpeq_ps( v1, v2 );
The mask contains a bitfield: 32 x β1β for each TRUE, 32 x β0β for each FALSE. The mask can be converted to a 4-bit integer using _mm_movemask_ps:
int result = _mm_movemask_ps( mask );
Now we can use regular conditionals:
if (result == 0) { /* false for all streams */ } if (result == 15) { /* true for all streams */ } if (result < 15) { /* not true for all streams */ } if (result > 0) { /* not false for all streams */ }
Concepts
Advanced Graphics β SIMD Recap 25
Streams β Masking
More powerful than βanyβ, βallβ or βnoneβ via movemask is masking.
if (det > -EPS && det < EPS) return NOHIT;
Translated to SSE:
__m128 mask1 = _mm_cmple_ps( det4, MINUSEPS4 ); __m128 mask2 = _mm_cmpge_ps( det4, EPSILON4 ); __m128 det4mask = _mm_or_ps( mask1, mask2 ); if (_mm_movemask_ps( det4mask ) == 0) return NOHIT; // all rays missed
Note that if only one ray survives, we continue executing the algorithm. A few lines later we have another check:
if (u < 0 || u > 1) return NOHIT;
Concepts
Advanced Graphics β SIMD Recap 26
Streams β Masking
Like last time, we translate
if (u < 0 || u > 1) return NOHIT;
to
mask1 = _mm_cmpge_ps( u4, ZERO4 ); mask2 = _mm_cmple_ps( u4, ONE4 ); umask = _mm_and_ps( mask1, mask2 );
Some rays may have βdiedβ in the previous conditional statement, so we include the mask produced by that condition:
combinedmask = _mm_and_ps( det4mask, umask ); if (_mm_movemask_ps( combinedmask ) == 0) return;
Concepts
Advanced Graphics β SIMD Recap 27
Streams β Masking
Particularly interesting is the last conditional:
if (t > EPSILON) { *out = t; return HIT; }
For four rays, we only want to change the distance we return for those rays that are still βaliveβ. For this, we use a blend operation:
__m128 t4_out = _mm_blend_ps( t4_in, t4, finalmask );
The beauty here is that, to the processor, this is not conditional code.
Concepts
Advanced Graphics β SIMD Recap 28
Streams β Masking
An interesting question when working with masking is: should we unconditionally continue, even if the number of surviving rays might be zero? We are balancing the cost of executing conditional code against the cost of sometimes executing code for zero rays. In practice, you will have to try to see which is quickest.
Concepts
Advanced Graphics β SIMD Recap 29
Streams β Summary
Practical use of SSE / AVX:
These concepts apply to SSE, AVX and SIMD in C#.
Todayβs Agenda:
C++/SSE
Advanced Graphics β SIMD Recap 31
Basic SSE
Any PC since the Pentium 3 will support SSE (even Atom processors). It is safe to assume a system has at least SSE4. Basic operations:
__m128 a4 = _mm_set_ps( 1.0f, 2.0f, 3.14159f, 1.41f ); __m128 b4 = _mm_set_ps1( 2.0f ); // broadcast __m128 c4 = _mm_add_ps( a4, b4 ); __m128 d4 = _mm_div_ps( a4, b4 ); __m128 e4 = _mm_sqrt_ps( a4 );
C++/SSE
Advanced Graphics β SIMD Recap 32
Basic SSE
Any PC since the Pentium 3 will support SSE (even Atom processors). It is safe to assume a system has at least SSE4. Example: normalizing four vectors:
__m128 x4 = _mm_set_ps( A.x, B.x, C.x, D.x ); __m128 y4 = _mm_set_ps( A.y, B.y, C.y, D.y ); __m128 z4 = _mm_set_ps( A.z, B.z, C.z, D.z ); __m128 sqX4 = _mm_mul_ps( x4, x4 ); __m128 sqY4 = _mm_mul_ps( y4, y4 ); __m128 sqZ4 = _mm_mul_ps( z4, z4 ); __m128 sqlen4 = _mm_add_ps( _mm_add_ps( sqX4, sqY4 ), sqZ4 ); __m128 len4 = _mm_sqrt_ps( sqlen4 ); x4 = _mm_div_ps( x4, len4 ); y4 = _mm_div_ps( y4, len4 ); z4 = _mm_div_ps( z4, len4 );
C++/SSE
Advanced Graphics β SIMD Recap 33
Intermediate SSE
SSE includes powerful functions that prevent conditional code, as well as specialized arithmetic functions.
__m128 min4 = _mm_min_ps( a4, b4 ); __m128 max4 = _mm_max_ps( a4, b4 ); __m128 one_over_sq4 = _mm_rsqrt_ps( a4 ); // reciprocal square root __m128i int4 = _mm_cvtps_epi32( a4 ); // cast to integer __m128 f4 = _mm_cvtepi32_ps( int4 ); // cast to float
C++/SSE
Advanced Graphics β SIMD Recap 34
Advanced SSE
Comparisons and masking.
__m128 mask4a = _mm_cmple_ps( a4, b4 ); // less or equal __m128 mask4b = _mm_cmpgt_ps( a4, b4 ); // greater than __m128 mask4c = _mm_cmpne_ps( a4, b4 ); // not equal __m128 mask4d = _mm_cmpeq_ps( a4, b4 ); // equal __m128 combined = _mm_and_ps( mask4a, mask4b ); __m128 inverted = _mm_andnot_ps( mask4a, mask4b ); __m128 either = _mm_or_ps( mask4a, mask4b ); __m128 blended = _mm_blendv_ps( a4, b4, mask4a );
A good source of additional information is MSDN: https://msdn.microsoft.com/en-us/library/bb892950(v=vs.90).aspx
C++/SSE
Advanced Graphics β SIMD Recap 35
AVX
Recent CPUs support 8-wide SIMD through AVX. Simply replace __m128 with __m256, and add 256 to each function:
__m256 a8 = _mm256_set_ps1( 0 );
C++/SSE
Advanced Graphics β SIMD Recap 36
Alignment
SSE and AVX data must be properly aligned: __m128 must be aligned to 16 bytes; __m256 must be aligned to 32 bytes. Visual Studio will do this for you for variables on the stack. When allocating buffers
__m128* data = _aligned_malloc( 1024 * sizeof( __m128 ), 16 );
C++/SSE
Advanced Graphics β SIMD Recap 37
Debugging
The Visual Studio debugger considers __m128 and __m256 to be basic types. In the debugger you can inspect them as arrays of floats, ints, shorts, bytes etc.
Todayβs Agenda:
C#/RyuJIT
Advanced Graphics β SIMD Recap 39
RyuJIT
Needful things for Windows 7 / VS2013:
Note: many websites discuss older versions of RyuJIT. You do not need to set environment variables.
C#/RyuJIT
Advanced Graphics β SIMD Recap 40
System.Numerics.Vectors
namespace System.Numerics { public struct Vector3 : IEquatable<Vector3>, IFormattable { public float X; public float Y; public float Z; public Vector3(float value); public Vector3(Vector2 value, float z); public Vector3(float x, float y, float z); public static Vector3 operator -(Vector3 value); public static Vector3 operator -(Vector3 left, Vector3 right); public static bool operator !=(Vector3 left, Vector3 right); public static Vector3 operator *(float left, Vector3 right); public static Vector3 operator *(Vector3 left, float right); public static Vector3 operator *(Vector3 left, Vector3 right); public static Vector3 operator /(Vector3 value1, float value2); public static Vector3 operator /(Vector3 left, Vector3 right); public static Vector3 operator +(Vector3 left, Vector3 right); public static bool operator ==(Vector3 left, Vector3 right);
C#/RyuJIT
Advanced Graphics β SIMD Recap 41
class Ray4 { Vector<float> OX4, OY4, OZ4; Vector<float> DX4, DY4, DZ4; Vector<float> t4, u4, v4; Vector<float> NX4, NY4, NZ4; Vector<int> objIdx; } Ray4 ray;
SIMD Data
β Structure of Arrays
C#/RyuJIT
Advanced Graphics β SIMD Recap 42
Vectorization
public Ray Generate( Random rng, int x, int y ) { float r0 = (float)rng.NextDouble(); float r1 = (float)rng.NextDouble(); float r2 = (float)rng.NextDouble() - 0.5f; float r3 = (float)rng.NextDouble() - 0.5f; // calculate sub-pixel ray target position on screen plane float u = ((float)x + r0) / (float)screenWidth; float v = ((float)y + r1) / (float)screenHeight; Vector3 T = p1 + u * (p2 - p1) + v * (p3 - p1); // calculate position on aperture Vector3 P = pos + lensSize * (r2 * right + r3 * up); // calculate ray direction Vector3 D = Vector3.Normalize( T - P ); // return new primary ray return new Ray( P, D, 1e34f ); }
C#/RyuJIT
Advanced Graphics β SIMD Recap 43
Vectorization
public Ray4 Generate4( Random rng, int x, int y ) { ... }
C#/RyuJIT
Advanced Graphics β SIMD Recap 44
Vectorization
public Ray4 Generate4( Random rng, int x, int y ) { // float r0 = (float)rng.NextDouble(); // float r1 = (float)rng.NextDouble(); // float r2 = (float)rng.NextDouble() - 0.5f; // float r3 = (float)rng.NextDouble() - 0.5f; float [] r0 = { (float)rng.NextDouble(), (float)rng.NextDouble(), (float)rng.NextDouble(), (float)rng.NextDouble() }; Vector<float> r0_4 = new Vector( r0 ); float [] r1 = { (float)rng.NextDouble(), (float)rng.NextDouble(), (float)rng.NextDouble(), (float)rng.NextDouble() }; Vector<float> r0_4 = new Vector( r1 ); float [] r2 = { (float)rng.NextDouble() - 0.5f, (float)rng.NextDouble() β 0.5f, (float)rng.NextDouble() - 0.5f, (float)rng.NextDouble() β 0.5f }; Vector<float> r0_4 = new Vector( r2 ); float [] r3 = { (float)rng.NextDouble() β 0.5f, (float)rng.NextDouble() β 0.5f, (float)rng.NextDouble() β 0.5f, (float)rng.NextDouble() β 0.5f }; Vector<float> r0_4 = new Vector( r3 ); ... }
C#/RyuJIT
Advanced Graphics β SIMD Recap 45
Vectorization
public Ray4 Generate4( Random rng, int x, int y ) { ... // calculate sub-pixel ray target position on screen plane // float u = ((float)x + r0) / (float)screenWidth; // float v = ((float)y + r1) / (float)screenHeight; ... }
C#/RyuJIT
Advanced Graphics β SIMD Recap 46
Vectorization
public Ray4 Generate4( Random rng, int x, int y ) { ... // calculate sub-pixel ray target position on screen plane // float u = ((float)x + r0) / (float)screenWidth; // float v = ((float)y + r1) / (float)screenHeight; float [] values = { x, x + 1, x + 2, x + 3 }; Vector<float> x4 = new Vector<float>( values ); Vector<float> y4 = new Vector<float>( y ); Vector<float> u4 = (x4 + r0_4) / screenWidth4; Vector<float> v4 = (y4 + r1_4) / screenHeight4; ... }
C#/RyuJIT
Advanced Graphics β SIMD Recap 47
Vectorization
public Ray4 Generate4( Random rng, int x, int y ) { ... // Vector3 T = p1 + u * (p2 - p1) + v * (p3 - p1); ... }
C#/RyuJIT
Advanced Graphics β SIMD Recap 48
Vectorization
public Ray4 Generate4( Random rng, int x, int y ) { ... // Vector3 T = p1 + u * (p2 - p1) + v * (p3 - p1); Vector<float> Tx4 = p1x4 + u4 * (p2x4 - p1x4) + v4 * (p3x4 - p1x4); ... }
C#/RyuJIT
Advanced Graphics β SIMD Recap 49
Vectorization
public Ray4 Generate4( Random rng, int x, int y ) { ... // Vector3 T = p1 + u * (p2 - p1) + v * (p3 - p1); Vector<float> Tx4 = p1x4 + u4 * (p2x4 - p1x4) + v4 * (p3x4 - p1x4); Vector<float> Ty4 = p1y4 + u4 * (p2y4 - p1y4) + v4 * (p3y4 - p1y4); Vector<float> Tz4 = p1z4 + u4 * (p2z4 - p1z4) + v4 * (p3z4 - p1z4); ... }
C#/RyuJIT
Advanced Graphics β SIMD Recap 50
Vectorization
public Ray4 Generate4( Random rng, int x, int y ) { ... // Vector3 P = pos + lensSize * (r2 * right + r3 * up); ... }
C#/RyuJIT
Advanced Graphics β SIMD Recap 51
Vectorization
public Ray4 Generate4( Random rng, int x, int y ) { ... // Vector3 P = pos + lensSize * (r2 * right + r3 * up); Vector<float> Px4 = posx4 + lensSize4 * (r2_4 * rightx4 + r3_4 * upx4); Vector<float> Py4 = posy4 + lensSize4 * (r2_4 * righty4 + r3_4 * upy4); Vector<float> Pz4 = posz4 + lensSize4 * (r2_4 * rightz4 + r3_4 * upz4); ... }
C#/RyuJIT
Advanced Graphics β SIMD Recap 52
Vectorization
public Ray4 Generate4( Random rng, int x, int y ) { ... Vector3.Normalize( T - P ); Vector<float> x4 = Tx4 - Px4; Vector<float> y4 = Ty4 - Py4; Vector<float> z4 = Tz4 - Pz4; Vector<float> len4 = Vector.SquareRoot<float>( x4 * x4 + y4 * y4 + z4 * z4 ); x4 /= len4; y4 /= len4; z4 /= len4; ... }
C#/RyuJIT
Advanced Graphics β SIMD Recap 53
Vectorization
public Ray4 Generate4( Random rng, int x, int y ) { ... Ray4 r4 = new Ray4(); r4.Ox4 = Px4; r4.Oy4 = Py4; r4.Oz4 = Pz4; r4.Dx4 = x4; r4.Dy4 = y4; r4.Dz4 = z4; r4.t4 = new Vector<float>( 1e34f ); return r4; }
C#/RyuJIT
Advanced Graphics β SIMD Recap 54
Vectorization
Digest:
if we previously used pos.x, we now use pos.x4 (not pos.xyzw!), if we previously used constant PI, we now use { PI, PI, PI, PI }.
Theoretical improvement is 4x. However:
Todayβs Agenda:
Practical
Advanced Graphics β SIMD Recap 56
Ray/AABB Intersection
Intersection of a ray and an AABB can be efficiently calculated using the slab test*:
If π’πππ < π’πππ¦ and π’πππ¦ > 0, the ray intersects the AABB.
*: Kay and Kajiya, Ray tracing complex scenes. In: Proceedings of ACM SIGGRAPH 1986, pages 269β278.
Practical
Advanced Graphics β SIMD Recap 57
Ray/AABB Intersection
Scalar code (3D):
bool intersection( box b, ray r ) { float tx1 = (b.min.x - r.O.x) * r.rD.x; float tx2 = (b.max.x - r.O.x) * r.rD.x; float tmin = min(tx1, tx2); float tmax = max(tx1, tx2); float ty1 = (b.min.y - r.O.y) * r.rD.y; float ty2 = (b.max.y - r.O.y) * r.rD.y; tmin = max(tmin, min(ty1, ty2)); tmax = min(tmax, max(ty1, ty2)); float tz1 = (b.min.z - r.O.z) * r.rD.z; float tz2 = (b.max.z - r.O.z) * r.rD.z; tmin = max(tmin, min(tz1, tz2)); tmax = min(tmax, max(tz1, tz2)); return tmax >= tmin && tmax >= 0; }
Practical
Advanced Graphics β SIMD Recap 58
Ray/AABB Intersection
Vector code:
bool intersection( box b, ray r ) { __m128 t1 = _mm_mul_ps( _mm_sub_ps( node->bmin4, O4 ), rD4 ); __m128 t2 = _mm_mul_ps( _mm_sub_ps( node->bmax4, O4 ), rD4 ); __m128 vmax4 = _mm_max_ps( t1, t2 ), vmin4 = _mm_min_ps( t1, t2 ); float* vmax = (float*)&vmax4, *vmin = (float*)&vmin4; float tmax = min(vmax[0], min(vmax[1], vmax[2])); float tmin = max(vmin[0], max(vmin[1], vmin[2])); return tmax >= tmin && tmax >= 0; } struct BVHNode { AABB bounds; int leftFirst; int count; }; struct BVHNode { float3 bmin; int leftFirst; float3 bmax; int count; };
struct BVHNode { union { struct { float3 bmin; int leftFirst; }; __m128 bmin4; }; union { struct { float3 bmax; int count; }; }; __m128 bmax4; };
Practical
Advanced Graphics β SIMD Recap 59
Ray/AABB Intersection
Vector code:
bool intersection( box b, ray r ) { __m128 t1 = _mm_mul_ps( _mm_sub_ps( node->bmin4, O4 ), rD4 ); __m128 t2 = _mm_mul_ps( _mm_sub_ps( node->bmax4, O4 ), rD4 ); __m128 vmax4 = _mm_max_ps( t1, t2 ), vmin4 = _mm_min_ps( t1, t2 ); float* vmax = (float*)&vmax4, *vmin = (float*)&vmin4; float tmax = min(vmax[0], min(vmax[1], vmax[2])); float tmin = max(vmin[0], max(vmin[1], vmin[2])); return tmax >= tmin && tmax >= 0; }
Check here for an even faster version: http://www.flipcode.com/archives/SSE_RayBox_Intersection_Test.shtml
Jacco Bikker - February β April 4016
next lecture: βLight Transportβ