Welcome! , = (, ) , + , , - PowerPoint PPT Presentation

INFOMAGR – Advanced Graphics Jacco Bikker - February – April 2016 Welcome! 𝑱 𝒚, 𝒚 ′ = 𝒉(𝒚, 𝒚 ′ ) 𝝑 𝒚, 𝒚 ′ + 𝝇 𝒚, 𝒚 ′ , 𝒚 ′′ 𝑱 𝒚 ′ , 𝒚 ′′ 𝒆𝒚′′ 𝑻

Today’s Agenda:  Introduction  SIMD Concepts  C++ / SSE & AVX  C# / RyuJIT

Advanced Graphics – SIMD Recap 3 Introduction S.I.M.D. Single Instruction Multiple Data: Applying the same instruction to several input elements. In other words: if we are going to apply the same sequence of instructions to a large input set, this allows us to do this in parallel (and thus: faster). SIMD is also known as instruction level parallelism .

Advanced Graphics – SIMD Recap 4 Introduction Hardware – VLIW Vector instructions: A A A A A4 Vector4 a = { 1, PI, e, 4 }; Vector4 b = { 4, 4, 4, 4 }; Vector4 c = a * b; Concept: function A4 consists of instructions operating on 4 items  executing A4 requires the same number of instructions required to  execute A on a single item throughput of A4 is four times higher.  The ‘4’ in the above is known as the vector width. Modern processors support 4-wide vectors (Pentium 3 and up), 8-wide (i3/i5/i7), 16-wide (Larrabee / Xeon Phi) and 32-wide (NVidia and AMD GPUs).

Advanced Graphics – SIMD Recap 5 Introduction SIMD Using Integers An integer is a 32-bit value, which means that it stores 4 bytes: A4 char[] a = { 1, 2, 3, 4 }; uint a4 = (1 << 24) + (2 << 16) + (3 << 8) + 4; In C++ we can directly exploit this: union { char a[4]; uint a4; }; a4 = (1 << 24) + (2 << 16) + (3 << 8) + 4; a[0]++; a[1]++; a[2]++; a[3]++; a4 += 0x01010101;

Advanced Graphics – SIMD Recap 6 Introduction SIMD Using Integers An integer is a 32-bit value, which means that it stores 4 bytes: A4 char[] a = { 1, 2, 3, 4 }; uint a4 = (1 << 24) + (2 << 16) + (3 << 8) + 4; C# also allows this, although it is a bit of a hack: [StructLayout(LayoutKind.Explicit)] struct byte_array { [FieldOffset(0)] public byte a; [FieldOffset(1)] public byte b; [FieldOffset(2)] public byte c; [FieldOffset(3)] public byte d; [FieldOffset(0)] public unsigned int abcd; }

Advanced Graphics – SIMD Recap 7 Introduction uint = unsigned char[4] Evil use of this: Pinging google.com yields: 74.125.136.101 Each value is an unsigned 8-bit value (0..255). We can specify a user name when Combing them in one 32-bit integer: visiting a website, but any username will be accepted by google. Like this: 101 + http://advgr2016@google.com 256 * 136 + 256 * 256 * 125 + Or: 256 * 256 * 256 * 74 = 1249740901. http://www.ing.nl@1249740901 Browse to: http://1249740901 (works!) Replace the IP address used here by your own site which contains a copy of the ing.nl site to obtain passwords, and send the link to a ‘friend’.

Advanced Graphics – SIMD Recap 8 Introduction Other Examples Rapid string comparison: char a[] = “optimization skills rule”; char a[] = “optimization skills rule”; char b[] = “optimization is so nice!”; char b[] = “optimization is so nice!”; bool equal = true; bool equal = true; int l = strlen( a ); int q = strlen( a ) / 4; for ( int i = 0; i < l; i++ ) for ( int i = 0; i < q; i++ ) { { if (a[i] != b[i]) if (((int*)a)[i] != ((int*)b)[i]) { { equal = false; equal = false; break; break; } } } } Likewise, we can copy byte arrays faster.

Advanced Graphics – SIMD Recap 9 Introduction SIMD using 32-bit values - Limitations Mapping four chars to an int value has a number of limitations: { 100, 100, 100, 100 } + { 1, 1, 1, 200 } = { 101, 101, 102, 44 } { 100, 100, 100, 100 } * { 2, 2, 2, 2 } = { … } { 100, 100, 100, 200 } * 2 = { 200, 200, 201, 144 } In general:  Streams are not separated (prone to overflow into next stream);  Limited to small unsigned integer values;  Hard to do multiplication / division.

Advanced Graphics – SIMD Recap 10 Introduction SIMD using 32-bit values - Limitations Ideally, we would like to see: Isolated streams  Support for more data types (char, short, uint, int, float, double)  An easy to use approach  Meet SSE!

Advanced Graphics – SIMD Recap 11 Introduction Vector Processors - Early systems The Solomon project (1960) One CPU feeding a number of ALUs with the same instruction, but different data. This way, a single algorithm is executed in parallel for a large dataset. ILLIAC IV (1962) Design: 1 GFLOPS using 256 ALUs. Actual implementation: 1974, 64 ALUs, ~100 MFLOPS. Fastest machine in the world for massively parallel tasks. Cray-1 (1976) Regular processor, but using vector registers of 64x64 bits. Reached 240 MFLOPS. MMX (1997, P2) and SSE (1999, P3) Vector registers and instructions added to a regular x86 processor.

Advanced Graphics – SIMD Recap 12 Introduction SIMD / SSE SSE was first introduced with the Pentium-3 processor in 1999, and adds a set of 128-bit registers, as well as instructions to operate on these registers. 32-bit: { char, char, char, char } = int 128-bit: { float, float, float, float } = __m128 { int, int, int, int } = __m128i Apart from storing 4 floats or ints, the registers can also store two 64- bit values, eight 16-bit values or sixteen 8-bit values.

Advanced Graphics – SIMD Recap 13 Introduction SIMD / SSE Problems when working with 32-bit integers:  Streams are not separated (prone to overflow into next stream);  Limited to small unsigned integer values;  Hard to do multiplication / division. Ideal situation: Isolated streams  Support for more data types (char, short, uint, int, float, double)  An easy to use approach  SSE offers these benefits, except for one (guess which  ).

Today’s Agenda:  Introduction  SIMD Concepts  C++ / SSE & AVX  C# / RyuJIT

Advanced Graphics – SIMD Recap 15 Concepts Streams Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); This is quite high-level. What the processor needs to do is: Vector3 tmp = T – P; float length = sqrt( tmp.x * tmp.x + tmp.y * tmp.y + tmp.z * tmp.z ); D = tmp / length;

Advanced Graphics – SIMD Recap 16 Concepts Streams Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); This is quite high-level. What the processor needs to do is: float tmp_x = T.x – P.x; float tmp_y = T.y – P.y; float tmp_z = T.z – P.z; float sqlen = tmp_x * tmp_x + tmp_y * tmp_y + tmp_z * tmp_z; float length = sqrt( sqlen ); D.x = tmp_x / length; D.y = tmp_y / length; D.z = tmp_z / length;

Advanced Graphics – SIMD Recap 17 Concepts Streams Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); Using vector instructions: __m128 A = T – P // 75% float B = dot( A, A ) // 75% __m128 C = { B, B, B } // 75%, overhead __m128 D = A / C // 75%

Advanced Graphics – SIMD Recap 18 Concepts Streams Consider the following scalar code: Vector3 D = Vector3.Normalize( T - P ); A = T.X – P.X A = T.X – P.X A = T.X – P.X A = T.X – P.X B = T.Y – P.Y B = T.Y – P.Y B = T.Y – P.Y B = T.Y – P.Y C = T.Z – P.Z C = T.Z – P.Z C = T.Z – P.Z C = T.Z – P.Z D = A * A D = A * A D = A * A D = A * A E = B * B E = B * B E = B * B E = B * B F = C * C F = C * C F = C * C F = C * C F += E F += E F += E F += E F += D F += D F += D F += D G = sqrt( F ) G = sqrt( F ) G = sqrt( F ) G = sqrt( F ) D.X = A / G D.X = A / G D.X = A / G D.X = A / G D.Y = B / G D.Y = B / G D.Y = B / G D.Y = B / G D.Z = C / G D.Z = C / G D.Z = C / G D.Z = C / G 0 1 2 3

Advanced Graphics – SIMD Recap 19 Concepts Streams Optimal utilization of SIMD hardware is achieved when we run the same algorithm four times in parallel . This way, the approach also scales naturally to 8-wide, 16- wide and 32-wide SIMD.

Advanced Graphics – SIMD Recap 20 Concepts Streams – Data Organization A3 = T3.X – P3.X A4 = T4.X – P4.X A1 = T1.X – P1.X A2 = T2.X – P2.X B3 = T3.Y – P3.Y B4 = T4.Y – P4.Y B1 = T1.Y – P1.Y B2 = T2.Y – P2.Y C3 = T3.Z – P3.Z C4 = T4.Z – P4.Z C1 = T1.Z – P1.Z C2 = T2.Z – P2.Z D3 = A3 * A3 D4 = A4 * A4 D1 = A1 * A1 D2 = A2 * A2 E3 = B3 * B3 E4 = B4 * B4 E1 = B1 * B1 E2 = B2 * B2 F3 = C3 * C3 F4 = C4 * C4 F1 = C1 * C1 F2 = C2 * C2 F3 += E3 F4 += E4 F1 += E1 F2 += E2 F3 += D3 F4 += D4 F1 += D1 F2 += D2 G3 = sqrt( F3 ) G4 = sqrt( F4 ) G1 = sqrt( F1 ) G2 = sqrt( F2 ) D1.X = A1 / G1 D2.X = A2 / G2 D3.X = A3 / G3 D4.X = A4 / G4 D3.Y = B3 / G3 D4.Y = B4 / G4 D1.Y = B1 / G1 D2.Y = B2 / G2 D3.Z = C3 / G3 D4.Z = C4 / G4 D1.Z = C1 / G1 D2.Z = C2 / G2

Welcome! , = (, ) , + , , - PowerPoint PPT Presentation

INFOMAGR Advanced Graphics Jacco Bikker - February April 2016 Welcome! , = (, ) , + , , , Todays Agenda:

Welcome back... Welcome back... ..to me. Welcome back... ..to me. Test out Welcome back...

Welcome! Welcome! Welcome! Welcome! Autor:Johann Oberdorfer Autor:Johann Oberdorfer With

WELCOME WELCOME WELCOME WELCOME 85th ANNUAL MEETING 85th ANNUAL MEETING 85th ANNUAL MEETING

WELCOME WELCOME WELCOME WELCOME to our vibrant & small Conservation Village to our vibrant

WELCOME! WELCOME! WELCOME! WELCOME! African American Student Advocates African American

New Student Welcome Day will begin shortly. New Student Welcome Day 1 New Student Welcome Day

Welcome! Welcome! Welcome! Welcome! What will happen today? What will happen today? Lecture

Welcome to the Welcome to the by to the 2017 Opening Welcome to the Opening Meeting Kyle

10 minutes Welcome The presentation will begin in: 9 minutes Welcome The presentation will

Welcome back. Today. Welcome back. Today. Continue Sampling combinatorial structures. Welcome

Welcome Monthly Meeting August 2, 2019 Welcome & Check-in Agenda I. Welcome and

Welcome Quarterly engagement event Welcome and update Dr David Kelly Agenda Welcome and

Kaleidoscope Sensory Storytimes Welcome, welcome everyone, Now youre here lets have some fun.

Registered Charity: 1105351 Welcome! Welcome! Sandgate Primary School Sandgate Primary School

HOUSEKEEPING WELCOME WELCOME | WELCOME SNAPSHOT

Welcome Centre Immigrant Services (Ajax) Hermia Corbette, Ajax Welcome Centre Manager

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

SIMD+ Overview Illiac IV History Early machines First massively

CSCI [4|6] 730 Operating Systems CPU Scheduling Maria Hybinette, UGA Maria Hybinette, UGA

CS 543 - Computer Graphics: Illumination and Shading I by Robert W. Lindeman gogo@wpi.edu

Image Processing 12. Photometric Stereo Aleix M. Martinez aleix@ece.osu.edu Illumination A 3D

Good morning! Lecture will start at 10:45 (let's wait for everyone). If you have any

Firmware Insider Bluetooth Randomness is Mostly Random RANDOMNESS IS MY PASSION Jrn

Welcome! , = (, ) , + , , - PowerPoint PPT Presentation

INFOMAGR Advanced Graphics Jacco Bikker - February April 2016 Welcome! , = (, ) , + , , , Todays Agenda:

Welcome back... Welcome back... ..to me. Welcome back... ..to me. Test out Welcome back...

Welcome! Welcome! Welcome! Welcome! Autor:Johann Oberdorfer Autor:Johann Oberdorfer With

WELCOME WELCOME WELCOME WELCOME 85th ANNUAL MEETING 85th ANNUAL MEETING 85th ANNUAL MEETING

WELCOME WELCOME WELCOME WELCOME to our vibrant &amp; small Conservation Village to our vibrant

WELCOME! WELCOME! WELCOME! WELCOME! African American Student Advocates African American

New Student Welcome Day will begin shortly. New Student Welcome Day 1 New Student Welcome Day

Welcome! Welcome! Welcome! Welcome! What will happen today? What will happen today? Lecture

Welcome to the Welcome to the by to the 2017 Opening Welcome to the Opening Meeting Kyle

10 minutes Welcome The presentation will begin in: 9 minutes Welcome The presentation will

Welcome back. Today. Welcome back. Today. Continue Sampling combinatorial structures. Welcome

Welcome Monthly Meeting August 2, 2019 Welcome &amp; Check-in Agenda I. Welcome and

Welcome Quarterly engagement event Welcome and update Dr David Kelly Agenda Welcome and

Kaleidoscope Sensory Storytimes Welcome, welcome everyone, Now youre here lets have some fun.

Registered Charity: 1105351 Welcome! Welcome! Sandgate Primary School Sandgate Primary School

HOUSEKEEPING WELCOME WELCOME | WELCOME SNAPSHOT

Welcome Centre Immigrant Services (Ajax) Hermia Corbette, Ajax Welcome Centre Manager

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Parallel Programming and Heterogeneous Computing A2 - Parallel Hardware Max Plauth, Sven Khler,

SIMD+ Overview Illiac IV History Early machines First massively

CSCI [4|6] 730 Operating Systems CPU Scheduling Maria Hybinette, UGA Maria Hybinette, UGA

CS 543 - Computer Graphics: Illumination and Shading I by Robert W. Lindeman gogo@wpi.edu

Image Processing 12. Photometric Stereo Aleix M. Martinez aleix@ece.osu.edu Illumination A 3D

Good morning! Lecture will start at 10:45 (let's wait for everyone). If you have any

Firmware Insider Bluetooth Randomness is Mostly Random RANDOMNESS IS MY PASSION Jrn

WELCOME WELCOME WELCOME WELCOME to our vibrant & small Conservation Village to our vibrant

Welcome Monthly Meeting August 2, 2019 Welcome & Check-in Agenda I. Welcome and