Welcome! , = (, ) , + , , - - PowerPoint PPT Presentation

β–Ά
welcome
SMART_READER_LITE
LIVE PREVIEW

Welcome! , = (, ) , + , , - - PowerPoint PPT Presentation

INFOMAGR Advanced Graphics Jacco Bikker - February April 2016 Welcome! , = (, ) , + , , , Todays Agenda:


slide-1
SLIDE 1

𝑱 π’š, π’šβ€² = 𝒉(π’š, π’šβ€²) 𝝑 π’š, π’šβ€² +

𝑻

𝝇 π’š, π’šβ€², π’šβ€²β€² 𝑱 π’šβ€², π’šβ€²β€² π’†π’šβ€²β€²

INFOMAGR – Advanced Graphics

Jacco Bikker - February – April 2016

Welcome!

slide-2
SLIDE 2

Today’s Agenda:

  • Introduction
  • SIMD Concepts
  • C++ / SSE & AVX
  • C# / RyuJIT
slide-3
SLIDE 3

Introduction

Advanced Graphics – SIMD Recap 3

S.I.M.D.

Single Instruction Multiple Data: Applying the same instruction to several input elements. In other words: if we are going to apply the same sequence of instructions to a large input set, this allows us to do this in parallel (and thus: faster). SIMD is also known as instruction level parallelism.

slide-4
SLIDE 4

Introduction

Advanced Graphics – SIMD Recap 4

Hardware – VLIW

Vector instructions:

Vector4 a = { 1, PI, e, 4 }; Vector4 b = { 4, 4, 4, 4 }; Vector4 c = a * b; Concept:

  • function A4 consists of instructions operating on 4 items
  • executing A4 requires the same number of instructions required to

execute A on a single item

  • throughput of A4 is four times higher.

The β€˜4’ in the above is known as the vector width. Modern processors support 4-wide vectors (Pentium 3 and up), 8-wide (i3/i5/i7), 16-wide (Larrabee / Xeon Phi) and 32-wide (NVidia and AMD GPUs).

A A A A A4

slide-5
SLIDE 5

Introduction

Advanced Graphics – SIMD Recap 5

SIMD Using Integers

An integer is a 32-bit value, which means that it stores 4 bytes:

char[] a = { 1, 2, 3, 4 }; uint a4 = (1 << 24) + (2 << 16) + (3 << 8) + 4;

In C++ we can directly exploit this:

union { char a[4]; uint a4; }; a4 = (1 << 24) + (2 << 16) + (3 << 8) + 4; a[0]++; a[1]++; a[2]++; a[3]++; a4 += 0x01010101;

A4

slide-6
SLIDE 6

Introduction

Advanced Graphics – SIMD Recap 6

SIMD Using Integers

An integer is a 32-bit value, which means that it stores 4 bytes:

char[] a = { 1, 2, 3, 4 }; uint a4 = (1 << 24) + (2 << 16) + (3 << 8) + 4;

C# also allows this, although it is a bit of a hack:

[StructLayout(LayoutKind.Explicit)] struct byte_array { [FieldOffset(0)] public byte a; [FieldOffset(1)] public byte b; [FieldOffset(2)] public byte c; [FieldOffset(3)] public byte d; [FieldOffset(0)] public unsigned int abcd; }

A4

slide-7
SLIDE 7

Introduction

Advanced Graphics – SIMD Recap 7

uint = unsigned char[4]

Pinging google.com yields: 74.125.136.101 Each value is an unsigned 8-bit value (0..255). Combing them in one 32-bit integer: 101 + 256 * 136 + 256 * 256 * 125 + 256 * 256 * 256 * 74 = 1249740901. Browse to: http://1249740901 (works!)

Evil use of this: We can specify a user name when visiting a website, but any username will be accepted by google. Like this: http://advgr2016@google.com Or: http://www.ing.nl@1249740901 Replace the IP address used here by your own site which contains a copy of the ing.nl site to obtain passwords, and send the link to a β€˜friend’.

slide-8
SLIDE 8

Introduction

Advanced Graphics – SIMD Recap 8

Other Examples

Rapid string comparison:

char a[] = β€œoptimization skills rule”; char b[] = β€œoptimization is so nice!”; bool equal = true; int l = strlen( a ); for ( int i = 0; i < l; i++ ) { if (a[i] != b[i]) { equal = false; break; } }

Likewise, we can copy byte arrays faster.

char a[] = β€œoptimization skills rule”; char b[] = β€œoptimization is so nice!”; bool equal = true; int q = strlen( a ) / 4; for ( int i = 0; i < q; i++ ) { if (((int*)a)[i] != ((int*)b)[i]) { equal = false; break; } }

slide-9
SLIDE 9

Introduction

Advanced Graphics – SIMD Recap 9

SIMD using 32-bit values - Limitations

Mapping four chars to an int value has a number of limitations:

{ 100, 100, 100, 100 } + { 1, 1, 1, 200 } = { 101, 101, 102, 44 } { 100, 100, 100, 100 } * { 2, 2, 2, 2 } = { … } { 100, 100, 100, 200 } * 2 = { 200, 200, 201, 144 }

In general:

  • Streams are not separated (prone to overflow into next stream);
  • Limited to small unsigned integer values;
  • Hard to do multiplication / division.
slide-10
SLIDE 10

Introduction

Advanced Graphics – SIMD Recap 10

SIMD using 32-bit values - Limitations

Ideally, we would like to see:

  • Isolated streams
  • Support for more data types (char, short, uint, int, float, double)
  • An easy to use approach

Meet SSE!

slide-11
SLIDE 11

Introduction

Advanced Graphics – SIMD Recap 11

Vector Processors - Early systems

The Solomon project (1960) One CPU feeding a number of ALUs with the same instruction, but different

  • data. This way, a single algorithm is executed in parallel for a large dataset.

ILLIAC IV (1962) Design: 1 GFLOPS using 256 ALUs. Actual implementation: 1974, 64 ALUs, ~100 MFLOPS. Fastest machine in the world for massively parallel tasks. Cray-1 (1976) Regular processor, but using vector registers of 64x64 bits. Reached 240 MFLOPS. MMX (1997, P2) and SSE (1999, P3) Vector registers and instructions added to a regular x86 processor.

slide-12
SLIDE 12

Introduction

Advanced Graphics – SIMD Recap 12

SIMD / SSE

SSE was first introduced with the Pentium-3 processor in 1999, and adds a set of 128-bit registers, as well as instructions to operate on these registers. 32-bit: { char, char, char, char } = int 128-bit: { float, float, float, float } = __m128 { int, int, int, int } = __m128i Apart from storing 4 floats or ints, the registers can also store two 64- bit values, eight 16-bit values or sixteen 8-bit values.

slide-13
SLIDE 13

Introduction

Advanced Graphics – SIMD Recap 13

SIMD / SSE

Problems when working with 32-bit integers:

  • Streams are not separated (prone to overflow into next stream);
  • Limited to small unsigned integer values;
  • Hard to do multiplication / division.

Ideal situation:

  • Isolated streams
  • Support for more data types (char, short, uint, int, float, double)
  • An easy to use approach

SSE offers these benefits, except for one (guess which  ).

slide-14
SLIDE 14

Today’s Agenda:

  • Introduction
  • SIMD Concepts
  • C++ / SSE & AVX
  • C# / RyuJIT
slide-15
SLIDE 15

Concepts

Advanced Graphics – SIMD Recap 15

Streams

Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); This is quite high-level. What the processor needs to do is:

Vector3 tmp = T – P; float length = sqrt( tmp.x * tmp.x + tmp.y * tmp.y + tmp.z * tmp.z ); D = tmp / length;

slide-16
SLIDE 16

Concepts

Advanced Graphics – SIMD Recap 16

Streams

Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); This is quite high-level. What the processor needs to do is:

float tmp_x = T.x – P.x; float tmp_y = T.y – P.y; float tmp_z = T.z – P.z; float sqlen = tmp_x * tmp_x + tmp_y * tmp_y + tmp_z * tmp_z; float length = sqrt( sqlen ); D.x = tmp_x / length; D.y = tmp_y / length; D.z = tmp_z / length;

slide-17
SLIDE 17

Concepts

Advanced Graphics – SIMD Recap 17

Streams

Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); Using vector instructions:

__m128 A = T – P float B = dot( A, A ) __m128 C = { B, B, B } __m128 D = A / C // 75% // 75% // 75%, overhead // 75%

slide-18
SLIDE 18

Concepts

Advanced Graphics – SIMD Recap 18

Streams

Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P );

A = T.X – P.X B = T.Y – P.Y C = T.Z – P.Z D = A * A E = B * B F = C * C F += E F += D G = sqrt( F ) D.X = A / G D.Y = B / G D.Z = C / G A = T.X – P.X B = T.Y – P.Y C = T.Z – P.Z D = A * A E = B * B F = C * C F += E F += D G = sqrt( F ) D.X = A / G D.Y = B / G D.Z = C / G A = T.X – P.X B = T.Y – P.Y C = T.Z – P.Z D = A * A E = B * B F = C * C F += E F += D G = sqrt( F ) D.X = A / G D.Y = B / G D.Z = C / G A = T.X – P.X B = T.Y – P.Y C = T.Z – P.Z D = A * A E = B * B F = C * C F += E F += D G = sqrt( F ) D.X = A / G D.Y = B / G D.Z = C / G

0 1 2 3

slide-19
SLIDE 19

Concepts

Advanced Graphics – SIMD Recap 19

Streams

Optimal utilization of SIMD hardware is achieved when we run the same algorithm four times in parallel. This way, the approach also scales naturally to 8-wide, 16- wide and 32-wide SIMD.

slide-20
SLIDE 20

Concepts

Advanced Graphics – SIMD Recap 20

Streams – Data Organization

A1 = T1.X – P1.X B1 = T1.Y – P1.Y C1 = T1.Z – P1.Z D1 = A1 * A1 E1 = B1 * B1 F1 = C1 * C1 F1 += E1 F1 += D1 G1 = sqrt( F1 ) D1.X = A1 / G1 D1.Y = B1 / G1 D1.Z = C1 / G1 A2 = T2.X – P2.X B2 = T2.Y – P2.Y C2 = T2.Z – P2.Z D2 = A2 * A2 E2 = B2 * B2 F2 = C2 * C2 F2 += E2 F2 += D2 G2 = sqrt( F2 ) D2.X = A2 / G2 D2.Y = B2 / G2 D2.Z = C2 / G2 A3 = T3.X – P3.X B3 = T3.Y – P3.Y C3 = T3.Z – P3.Z D3 = A3 * A3 E3 = B3 * B3 F3 = C3 * C3 F3 += E3 F3 += D3 G3 = sqrt( F3 ) D3.X = A3 / G3 D3.Y = B3 / G3 D3.Z = C3 / G3 A4 = T4.X – P4.X B4 = T4.Y – P4.Y C4 = T4.Z – P4.Z D4 = A4 * A4 E4 = B4 * B4 F4 = C4 * C4 F4 += E4 F4 += D4 G4 = sqrt( F4 ) D4.X = A4 / G4 D4.Y = B4 / G4 D4.Z = C4 / G4

slide-21
SLIDE 21

Concepts

Advanced Graphics – SIMD Recap 21

Streams – Data Organization

A1 = TX1 – PX1 B1 = TY1 – PY1 C1 = TZ1 – PZ1 D1 = A1 * A1 E1 = B1 * B1 F1 = C1 * C1 F1 += E1 F1 += D1 G1 = sqrt( F1 ) DX1 = A1 / G1 DY1 = B1 / G1 DZ1 = C1 / G1

Input:

TX = { T1.x, T2.x, T3.x, T4.x }; PX = { P1.x, P2.x, P3.x, P4.x }; TY = { T1.y, T2.y, T3.y, T4.y }; PY = { P1.y, P2.y, P3.y, P4.y }; TZ = { T1.z, T2.z, T3.z, T4.z }; PZ = { P1.z, P2.z, P3.z, P4.z }; A2 = TX2 – PX2 B2 = TY2 – PY2 C2 = TZ2 – PZ2 D2 = A2 * A2 E2 = B2 * B2 F2 = C2 * C2 F2 += E2 F2 += D2 G2 = sqrt( F2 ) DX2 = A2 / G2 DY2 = B2 / G2 DZ2 = C2 / G2 A3 = TX3 – PX3 B3 = TY3 – PY3 C3 = TZ3 – PZ3 D3 = A3 * A3 E3 = B3 * B3 F3 = C3 * C3 F3 += E3 F3 += D3 G3 = sqrt( F3 ) DX3 = A3 / G3 DY3 = B3 / G3 DZ3 = C3 / G3 A4 = TX4 – PX4 B4 = TY4 – PY4 C4 = TZ4 – PZ4 D4 = A4 * A4 E4 = B4 * B4 F4 = C4 * C4 F4 += E4 F4 += D4 G4 = sqrt( F4 ) DX4 = A4 / G4 DY4 = B4 / G4 DZ4 = C4 / G4

slide-22
SLIDE 22

union { __m128 x4[128]; }; union { __m128 y4[128]; }; union { __m128 z4[128]; }; union { __m128i mass4[128]; };

Concepts

Advanced Graphics – SIMD Recap 22

Streams – Data Organization

Consider the following data structure: struct Particle { float x, y, z; int mass; }; Particle particle[512]; float x[512]; float y[512]; float z[512]; int mass[512];

Ao AoS SoA

slide-23
SLIDE 23

Concepts

Advanced Graphics – SIMD Recap 23

Streams – Ray Tracing

Leveraging SIMD for ray tracing:

  • 1. One ray, four primitives
  • 2. One ray, four nodes
  • 3. Four rays, one primitive / node

Option 3 is the least intrusive:

class Ray4 { public: __m128 ox4, oy4, oz4; __m128 dx4, dy4, dz4; __m128 t4; }; vec3 e1 = tri.V2 - tri.V1; vec3 e2 = tri.V3 - tri.V1; vec3 P = cross( D, e2 ); float det = dot( e1, P ); if (det > -EPS && det < EPS) return NOHIT; float inv_det = 1 / det; vec3 T = O - tri.V1; float u = dot( T, P ) * inv_det; if (u < 0 || u > 1) return NOHIT; vec3 Q = cross( T, e1 ); float v = dot( D, Q ) * inv_det; if (v < 0 || u + v > 1) return NOHIT; float t = dot( e2, Q ) * inv_det; if (t > EPSILON) { *out = t; return HIT; } return NOHIT;

slide-24
SLIDE 24

Concepts

Advanced Graphics – SIMD Recap 24

Streams – Flow Divergence

Like other instructions, comparisons between vectors yield a vector of booleans.

__m128 mask = _mm_cmpeq_ps( v1, v2 );

The mask contains a bitfield: 32 x β€˜1’ for each TRUE, 32 x β€˜0’ for each FALSE. The mask can be converted to a 4-bit integer using _mm_movemask_ps:

int result = _mm_movemask_ps( mask );

Now we can use regular conditionals:

if (result == 0) { /* false for all streams */ } if (result == 15) { /* true for all streams */ } if (result < 15) { /* not true for all streams */ } if (result > 0) { /* not false for all streams */ }

slide-25
SLIDE 25

Concepts

Advanced Graphics – SIMD Recap 25

Streams – Masking

More powerful than β€˜any’, β€˜all’ or β€˜none’ via movemask is masking.

if (det > -EPS && det < EPS) return NOHIT;

Translated to SSE:

__m128 mask1 = _mm_cmple_ps( det4, MINUSEPS4 ); __m128 mask2 = _mm_cmpge_ps( det4, EPSILON4 ); __m128 det4mask = _mm_or_ps( mask1, mask2 ); if (_mm_movemask_ps( det4mask ) == 0) return NOHIT; // all rays missed

Note that if only one ray survives, we continue executing the algorithm. A few lines later we have another check:

if (u < 0 || u > 1) return NOHIT;

slide-26
SLIDE 26

Concepts

Advanced Graphics – SIMD Recap 26

Streams – Masking

Like last time, we translate

if (u < 0 || u > 1) return NOHIT;

to

mask1 = _mm_cmpge_ps( u4, ZERO4 ); mask2 = _mm_cmple_ps( u4, ONE4 ); umask = _mm_and_ps( mask1, mask2 );

Some rays may have β€˜died’ in the previous conditional statement, so we include the mask produced by that condition:

combinedmask = _mm_and_ps( det4mask, umask ); if (_mm_movemask_ps( combinedmask ) == 0) return;

slide-27
SLIDE 27

Concepts

Advanced Graphics – SIMD Recap 27

Streams – Masking

Particularly interesting is the last conditional:

if (t > EPSILON) { *out = t; return HIT; }

For four rays, we only want to change the distance we return for those rays that are still β€˜alive’. For this, we use a blend operation:

__m128 t4_out = _mm_blend_ps( t4_in, t4, finalmask );

The beauty here is that, to the processor, this is not conditional code.

slide-28
SLIDE 28

Concepts

Advanced Graphics – SIMD Recap 28

Streams – Masking

An interesting question when working with masking is: should we unconditionally continue, even if the number of surviving rays might be zero? We are balancing the cost of executing conditional code against the cost of sometimes executing code for zero rays. In practice, you will have to try to see which is quickest.

slide-29
SLIDE 29

Concepts

Advanced Graphics – SIMD Recap 29

Streams – Summary

Practical use of SSE / AVX:

  • Translate your algorithm to a pure scalar flow (write out all vector operations).
  • Use vectors of four or eight elements (__m128, __m256).
  • Run the scalar flow four or eight times in parallel.
  • Reorganize data so that each line in the algorithm can fetch 128 or 256 consecutive bits.
  • Use 128-bit masks to store results of comparisons.
  • Convert these to useful integers using _mm_movemask_ps.
  • Continue the algorithm as long as at least one stream is alive; combine masks.
  • Use _mm_blendv_ps to overwrite some values in a __m128 register.

These concepts apply to SSE, AVX and SIMD in C#.

slide-30
SLIDE 30

Today’s Agenda:

  • Introduction
  • SIMD Concepts
  • C++ / SSE & AVX
  • C# / RyuJIT
slide-31
SLIDE 31

C++/SSE

Advanced Graphics – SIMD Recap 31

Basic SSE

Any PC since the Pentium 3 will support SSE (even Atom processors). It is safe to assume a system has at least SSE4. Basic operations:

__m128 a4 = _mm_set_ps( 1.0f, 2.0f, 3.14159f, 1.41f ); __m128 b4 = _mm_set_ps1( 2.0f ); // broadcast __m128 c4 = _mm_add_ps( a4, b4 ); __m128 d4 = _mm_div_ps( a4, b4 ); __m128 e4 = _mm_sqrt_ps( a4 );

slide-32
SLIDE 32

C++/SSE

Advanced Graphics – SIMD Recap 32

Basic SSE

Any PC since the Pentium 3 will support SSE (even Atom processors). It is safe to assume a system has at least SSE4. Example: normalizing four vectors:

__m128 x4 = _mm_set_ps( A.x, B.x, C.x, D.x ); __m128 y4 = _mm_set_ps( A.y, B.y, C.y, D.y ); __m128 z4 = _mm_set_ps( A.z, B.z, C.z, D.z ); __m128 sqX4 = _mm_mul_ps( x4, x4 ); __m128 sqY4 = _mm_mul_ps( y4, y4 ); __m128 sqZ4 = _mm_mul_ps( z4, z4 ); __m128 sqlen4 = _mm_add_ps( _mm_add_ps( sqX4, sqY4 ), sqZ4 ); __m128 len4 = _mm_sqrt_ps( sqlen4 ); x4 = _mm_div_ps( x4, len4 ); y4 = _mm_div_ps( y4, len4 ); z4 = _mm_div_ps( z4, len4 );

slide-33
SLIDE 33

C++/SSE

Advanced Graphics – SIMD Recap 33

Intermediate SSE

SSE includes powerful functions that prevent conditional code, as well as specialized arithmetic functions.

__m128 min4 = _mm_min_ps( a4, b4 ); __m128 max4 = _mm_max_ps( a4, b4 ); __m128 one_over_sq4 = _mm_rsqrt_ps( a4 ); // reciprocal square root __m128i int4 = _mm_cvtps_epi32( a4 ); // cast to integer __m128 f4 = _mm_cvtepi32_ps( int4 ); // cast to float

slide-34
SLIDE 34

C++/SSE

Advanced Graphics – SIMD Recap 34

Advanced SSE

Comparisons and masking.

__m128 mask4a = _mm_cmple_ps( a4, b4 ); // less or equal __m128 mask4b = _mm_cmpgt_ps( a4, b4 ); // greater than __m128 mask4c = _mm_cmpne_ps( a4, b4 ); // not equal __m128 mask4d = _mm_cmpeq_ps( a4, b4 ); // equal __m128 combined = _mm_and_ps( mask4a, mask4b ); __m128 inverted = _mm_andnot_ps( mask4a, mask4b ); __m128 either = _mm_or_ps( mask4a, mask4b ); __m128 blended = _mm_blendv_ps( a4, b4, mask4a );

A good source of additional information is MSDN: https://msdn.microsoft.com/en-us/library/bb892950(v=vs.90).aspx

slide-35
SLIDE 35

C++/SSE

Advanced Graphics – SIMD Recap 35

AVX

Recent CPUs support 8-wide SIMD through AVX. Simply replace __m128 with __m256, and add 256 to each function:

__m256 a8 = _mm256_set_ps1( 0 );

slide-36
SLIDE 36

C++/SSE

Advanced Graphics – SIMD Recap 36

Alignment

SSE and AVX data must be properly aligned: __m128 must be aligned to 16 bytes; __m256 must be aligned to 32 bytes. Visual Studio will do this for you for variables on the stack. When allocating buffers

  • f these values, make sure you use an aligned malloc/ free:

__m128* data = _aligned_malloc( 1024 * sizeof( __m128 ), 16 );

slide-37
SLIDE 37

C++/SSE

Advanced Graphics – SIMD Recap 37

Debugging

The Visual Studio debugger considers __m128 and __m256 to be basic types. In the debugger you can inspect them as arrays of floats, ints, shorts, bytes etc.

slide-38
SLIDE 38

Today’s Agenda:

  • Introduction
  • SIMD Concepts
  • C++ / SSE & AVX
  • C# / RyuJIT
slide-39
SLIDE 39

C#/RyuJIT

Advanced Graphics – SIMD Recap 39

RyuJIT

Needful things for Windows 7 / VS2013:

  • .NET 4.6
  • Nuget 2.8.6 (you may have to remove 2.8.5 first)
  • Install-Package System.Numerics.Vectors–Pre
  • x64

Note: many websites discuss older versions of RyuJIT. You do not need to set environment variables.

slide-40
SLIDE 40

C#/RyuJIT

Advanced Graphics – SIMD Recap 40

System.Numerics.Vectors

namespace System.Numerics { public struct Vector3 : IEquatable<Vector3>, IFormattable { public float X; public float Y; public float Z; public Vector3(float value); public Vector3(Vector2 value, float z); public Vector3(float x, float y, float z); public static Vector3 operator -(Vector3 value); public static Vector3 operator -(Vector3 left, Vector3 right); public static bool operator !=(Vector3 left, Vector3 right); public static Vector3 operator *(float left, Vector3 right); public static Vector3 operator *(Vector3 left, float right); public static Vector3 operator *(Vector3 left, Vector3 right); public static Vector3 operator /(Vector3 value1, float value2); public static Vector3 operator /(Vector3 left, Vector3 right); public static Vector3 operator +(Vector3 left, Vector3 right); public static bool operator ==(Vector3 left, Vector3 right);

slide-41
SLIDE 41

C#/RyuJIT

Advanced Graphics – SIMD Recap 41

class Ray4 { Vector<float> OX4, OY4, OZ4; Vector<float> DX4, DY4, DZ4; Vector<float> t4, u4, v4; Vector<float> NX4, NY4, NZ4; Vector<int> objIdx; } Ray4 ray;

SIMD Data

SoA

  • A

– Structure of Arrays

slide-42
SLIDE 42

C#/RyuJIT

Advanced Graphics – SIMD Recap 42

Vectorization

public Ray Generate( Random rng, int x, int y ) { float r0 = (float)rng.NextDouble(); float r1 = (float)rng.NextDouble(); float r2 = (float)rng.NextDouble() - 0.5f; float r3 = (float)rng.NextDouble() - 0.5f; // calculate sub-pixel ray target position on screen plane float u = ((float)x + r0) / (float)screenWidth; float v = ((float)y + r1) / (float)screenHeight; Vector3 T = p1 + u * (p2 - p1) + v * (p3 - p1); // calculate position on aperture Vector3 P = pos + lensSize * (r2 * right + r3 * up); // calculate ray direction Vector3 D = Vector3.Normalize( T - P ); // return new primary ray return new Ray( P, D, 1e34f ); }

slide-43
SLIDE 43

C#/RyuJIT

Advanced Graphics – SIMD Recap 43

Vectorization

public Ray4 Generate4( Random rng, int x, int y ) { ... }

slide-44
SLIDE 44

C#/RyuJIT

Advanced Graphics – SIMD Recap 44

Vectorization

public Ray4 Generate4( Random rng, int x, int y ) { // float r0 = (float)rng.NextDouble(); // float r1 = (float)rng.NextDouble(); // float r2 = (float)rng.NextDouble() - 0.5f; // float r3 = (float)rng.NextDouble() - 0.5f; float [] r0 = { (float)rng.NextDouble(), (float)rng.NextDouble(), (float)rng.NextDouble(), (float)rng.NextDouble() }; Vector<float> r0_4 = new Vector( r0 ); float [] r1 = { (float)rng.NextDouble(), (float)rng.NextDouble(), (float)rng.NextDouble(), (float)rng.NextDouble() }; Vector<float> r0_4 = new Vector( r1 ); float [] r2 = { (float)rng.NextDouble() - 0.5f, (float)rng.NextDouble() – 0.5f, (float)rng.NextDouble() - 0.5f, (float)rng.NextDouble() – 0.5f }; Vector<float> r0_4 = new Vector( r2 ); float [] r3 = { (float)rng.NextDouble() – 0.5f, (float)rng.NextDouble() – 0.5f, (float)rng.NextDouble() – 0.5f, (float)rng.NextDouble() – 0.5f }; Vector<float> r0_4 = new Vector( r3 ); ... }

slide-45
SLIDE 45

C#/RyuJIT

Advanced Graphics – SIMD Recap 45

Vectorization

public Ray4 Generate4( Random rng, int x, int y ) { ... // calculate sub-pixel ray target position on screen plane // float u = ((float)x + r0) / (float)screenWidth; // float v = ((float)y + r1) / (float)screenHeight; ... }

slide-46
SLIDE 46

C#/RyuJIT

Advanced Graphics – SIMD Recap 46

Vectorization

public Ray4 Generate4( Random rng, int x, int y ) { ... // calculate sub-pixel ray target position on screen plane // float u = ((float)x + r0) / (float)screenWidth; // float v = ((float)y + r1) / (float)screenHeight; float [] values = { x, x + 1, x + 2, x + 3 }; Vector<float> x4 = new Vector<float>( values ); Vector<float> y4 = new Vector<float>( y ); Vector<float> u4 = (x4 + r0_4) / screenWidth4; Vector<float> v4 = (y4 + r1_4) / screenHeight4; ... }

slide-47
SLIDE 47

C#/RyuJIT

Advanced Graphics – SIMD Recap 47

Vectorization

public Ray4 Generate4( Random rng, int x, int y ) { ... // Vector3 T = p1 + u * (p2 - p1) + v * (p3 - p1); ... }

slide-48
SLIDE 48

C#/RyuJIT

Advanced Graphics – SIMD Recap 48

Vectorization

public Ray4 Generate4( Random rng, int x, int y ) { ... // Vector3 T = p1 + u * (p2 - p1) + v * (p3 - p1); Vector<float> Tx4 = p1x4 + u4 * (p2x4 - p1x4) + v4 * (p3x4 - p1x4); ... }

slide-49
SLIDE 49

C#/RyuJIT

Advanced Graphics – SIMD Recap 49

Vectorization

public Ray4 Generate4( Random rng, int x, int y ) { ... // Vector3 T = p1 + u * (p2 - p1) + v * (p3 - p1); Vector<float> Tx4 = p1x4 + u4 * (p2x4 - p1x4) + v4 * (p3x4 - p1x4); Vector<float> Ty4 = p1y4 + u4 * (p2y4 - p1y4) + v4 * (p3y4 - p1y4); Vector<float> Tz4 = p1z4 + u4 * (p2z4 - p1z4) + v4 * (p3z4 - p1z4); ... }

slide-50
SLIDE 50

C#/RyuJIT

Advanced Graphics – SIMD Recap 50

Vectorization

public Ray4 Generate4( Random rng, int x, int y ) { ... // Vector3 P = pos + lensSize * (r2 * right + r3 * up); ... }

slide-51
SLIDE 51

C#/RyuJIT

Advanced Graphics – SIMD Recap 51

Vectorization

public Ray4 Generate4( Random rng, int x, int y ) { ... // Vector3 P = pos + lensSize * (r2 * right + r3 * up); Vector<float> Px4 = posx4 + lensSize4 * (r2_4 * rightx4 + r3_4 * upx4); Vector<float> Py4 = posy4 + lensSize4 * (r2_4 * righty4 + r3_4 * upy4); Vector<float> Pz4 = posz4 + lensSize4 * (r2_4 * rightz4 + r3_4 * upz4); ... }

slide-52
SLIDE 52

C#/RyuJIT

Advanced Graphics – SIMD Recap 52

Vectorization

public Ray4 Generate4( Random rng, int x, int y ) { ... Vector3.Normalize( T - P ); Vector<float> x4 = Tx4 - Px4; Vector<float> y4 = Ty4 - Py4; Vector<float> z4 = Tz4 - Pz4; Vector<float> len4 = Vector.SquareRoot<float>( x4 * x4 + y4 * y4 + z4 * z4 ); x4 /= len4; y4 /= len4; z4 /= len4; ... }

slide-53
SLIDE 53

C#/RyuJIT

Advanced Graphics – SIMD Recap 53

Vectorization

public Ray4 Generate4( Random rng, int x, int y ) { ... Ray4 r4 = new Ray4(); r4.Ox4 = Px4; r4.Oy4 = Py4; r4.Oz4 = Pz4; r4.Dx4 = x4; r4.Dy4 = y4; r4.Dz4 = z4; r4.t4 = new Vector<float>( 1e34f ); return r4; }

slide-54
SLIDE 54

C#/RyuJIT

Advanced Graphics – SIMD Recap 54

Vectorization

Digest:

  • Vectorization starts with identifying a scalar flow.
  • Using vector logic, four streams execute identical code.
  • Data layout must be adjusted for an efficient vector flow:

if we previously used pos.x, we now use pos.x4 (not pos.xyzw!), if we previously used constant PI, we now use { PI, PI, PI, PI }.

  • This is referred to as struct of arrays.

Theoretical improvement is 4x. However:

  • There is always some overhead to get data to the desired format.
  • The C# JIT compiler is far from optimal.
slide-55
SLIDE 55

Today’s Agenda:

  • Introduction
  • SIMD Concepts
  • C++ / SSE & AVX
  • C# / RyuJIT
slide-56
SLIDE 56

Practical

Advanced Graphics – SIMD Recap 56

Ray/AABB Intersection

Intersection of a ray and an AABB can be efficiently calculated using the slab test*:

  • 1. Calculate π‘’π‘›π‘—π‘œ,𝑒𝑛𝑏𝑦 using the intersections
  • f the ray with the horizontal planes;
  • 2. Update π‘’π‘›π‘—π‘œ,𝑒𝑛𝑏𝑦 using the intersections
  • f the ray with the vertical planes.

If π‘’π‘›π‘—π‘œ < 𝑒𝑛𝑏𝑦 and 𝑒𝑛𝑏𝑦 > 0, the ray intersects the AABB.

*: Kay and Kajiya, Ray tracing complex scenes. In: Proceedings of ACM SIGGRAPH 1986, pages 269–278.

slide-57
SLIDE 57

Practical

Advanced Graphics – SIMD Recap 57

Ray/AABB Intersection

Scalar code (3D):

bool intersection( box b, ray r ) { float tx1 = (b.min.x - r.O.x) * r.rD.x; float tx2 = (b.max.x - r.O.x) * r.rD.x; float tmin = min(tx1, tx2); float tmax = max(tx1, tx2); float ty1 = (b.min.y - r.O.y) * r.rD.y; float ty2 = (b.max.y - r.O.y) * r.rD.y; tmin = max(tmin, min(ty1, ty2)); tmax = min(tmax, max(ty1, ty2)); float tz1 = (b.min.z - r.O.z) * r.rD.z; float tz2 = (b.max.z - r.O.z) * r.rD.z; tmin = max(tmin, min(tz1, tz2)); tmax = min(tmax, max(tz1, tz2)); return tmax >= tmin && tmax >= 0; }

slide-58
SLIDE 58

Practical

Advanced Graphics – SIMD Recap 58

Ray/AABB Intersection

Vector code:

bool intersection( box b, ray r ) { __m128 t1 = _mm_mul_ps( _mm_sub_ps( node->bmin4, O4 ), rD4 ); __m128 t2 = _mm_mul_ps( _mm_sub_ps( node->bmax4, O4 ), rD4 ); __m128 vmax4 = _mm_max_ps( t1, t2 ), vmin4 = _mm_min_ps( t1, t2 ); float* vmax = (float*)&vmax4, *vmin = (float*)&vmin4; float tmax = min(vmax[0], min(vmax[1], vmax[2])); float tmin = max(vmin[0], max(vmin[1], vmin[2])); return tmax >= tmin && tmax >= 0; } struct BVHNode { AABB bounds; int leftFirst; int count; }; struct BVHNode { float3 bmin; int leftFirst; float3 bmax; int count; };

struct BVHNode { union { struct { float3 bmin; int leftFirst; }; __m128 bmin4; }; union { struct { float3 bmax; int count; }; }; __m128 bmax4; };

slide-59
SLIDE 59

Practical

Advanced Graphics – SIMD Recap 59

Ray/AABB Intersection

Vector code:

bool intersection( box b, ray r ) { __m128 t1 = _mm_mul_ps( _mm_sub_ps( node->bmin4, O4 ), rD4 ); __m128 t2 = _mm_mul_ps( _mm_sub_ps( node->bmax4, O4 ), rD4 ); __m128 vmax4 = _mm_max_ps( t1, t2 ), vmin4 = _mm_min_ps( t1, t2 ); float* vmax = (float*)&vmax4, *vmin = (float*)&vmin4; float tmax = min(vmax[0], min(vmax[1], vmax[2])); float tmin = max(vmin[0], max(vmin[1], vmin[2])); return tmax >= tmin && tmax >= 0; }

Check here for an even faster version: http://www.flipcode.com/archives/SSE_RayBox_Intersection_Test.shtml

slide-60
SLIDE 60

INFOMAGR – Advanced Graphics

Jacco Bikker - February – April 4016

END of β€œReal-time Ray Tracing”

next lecture: β€œLight Transport”