Welcome! Todays Agenda: Self-modifying code Multi-threading - PowerPoint PPT Presentation

/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2019 - Lecture 13: “Snippets” Welcome!

Today’s Agenda: ▪ Self-modifying code ▪ Multi-threading (1) ▪ Multi-threading (2) ▪ Experiments

INFOMOV – Lecture 13 – “Snippets” 3 Self-modifying Fast Polygons on Limited Hardware Typical span rendering code: for( int i = 0; i < len; i++ ) { *a++ = texture[u,v]; u += du; v += dv; } How do we make this faster? Every cycle counts… ▪ Loop unrolling ▪ Two pixels at a time ▪ …

INFOMOV – Lecture 13 – “Snippets” 4 Self-modifying Fast Polygons on Limited Hardware How about… switch (len) { case 8: *a++ = tex[u,v]; u+=du; v+=dv; case 7: *a++ = tex[u,v]; u+=du; v+=dv; case 6: *a++ = tex[u,v]; u+=du; v+=dv; case 5: *a++ = tex[u,v]; u+=du; v+=dv; case 4: *a++ = tex[u,v]; u+=du; v+=dv; case 3: *a++ = tex[u,v]; u+=du; v+=dv; case 2: *a++ = tex[u,v]; u+=du; v+=dv; case 1: *a++ = tex[u,v]; u+=du; v+=dv; }

INFOMOV – Lecture 13 – “Snippets” 5 Self-modifying Fast Polygons on Limited Hardware What if a massive unroll isn’t an option, but we have only 4 registers? for( int i = 0; i < len; i++ ) { *a++ = texture[u,v]; u += du, v += dv; } Registers: { i, a, u, v, du, dv, len }. Idea: just before entering the loop, ▪ replace ‘ len ’ by the correct constant in the code ; ▪ replace du and dv by the correct constant. Our code is now self-modifying .

INFOMOV – Lecture 13 – “Snippets” 6 Self-modifying Self-modifying Code Good reasons for not not writing SMC: ▪ the CPU pipeline (mind every potential (future) target) ▪ L1 instruction cache (handles reads only) ▪ code readability Good reasons for writing SMC: ▪ code readability ▪ genetic code optimization

INFOMOV – Lecture 13 – “Snippets” 7 Self-modifying Hardware Evolution* Experiment: ▪ take 100 FPGA’s, load them with random ‘programs’, max 100 logic gates ▪ test each chip’s ability to differentiate between two audio tones ▪ use the best candidates to produce the next generation. NASA’s evolved antenna** Outcome (generation 4000): one chip capable of the intended task. Observations: 1. The chip used only 37 logic gates, of which 5 disconnected from the rest. 2. The 5 disconnected gates where vital to the function of the chip. 3. The program could not be transferred to another chip. *: On the Origin of Circuits, Alan Bellows, 2007, https://www.damninteresting.com/on-the-origin-of-circuits **: Evolved antenna, Wikipedia.

INFOMOV – Lecture 13 – “Snippets” 8 Self-modifying Compiler Flags* Experiment: “…we propose a genetic algorithm to determine the combination of flags, that could be used, to generate efficient executable in terms of time. The input population to the genetic algorithm is the set of compiler flags that can be used to compile a program and the best chromosome corresponding to the best combination of flags is derived over generations, based on the time taken to compile and execute, as the fitness function.” *: Compiler Optimization: A Genetic Algorithm Approach, P. A. Ballal et al., 2015.

INFOMOV – Lecture 13 – “Snippets” 9 Self-modifying Compiler Flags*

Today’s Agenda: ▪ Self-modifying code ▪ Multi-threading (1) ▪ Multi-threading (2) ▪ Experiments

INFOMOV – Lecture 13 – “Snippets” 11 Multi-threading A Brief History of Many Cores Once upon a time... Then, in 2005: Intel’s Core 2 Duo (April 22). (Also 2005: AMD Athlon 64 X2. April 21.) 2007: Intel Core 2 Quad 2010: AMD Phenom II X6

INFOMOV – Lecture 13 – “Snippets” 12 Multi-threading A Brief History of Many Cores Once upon a time... Then, in 2005: Intel’s Core 2 Duo (April 22). (Also 2005: AMD Athlon 64 X2. April 21.) 2007: Intel Core 2 Quad 2010: AMD Phenom II X6 Today...

INFOMOV – Lecture 13 – “Snippets” 13 Multi-threading A Brief History of Many Cores Once upon a time... Then, in 2005: Intel’s Core 2 Duo (April 22). (Also 2005: AMD Athlon 64 X2. April 21.) 2007: Intel Core 2 Quad 2010: AMD Phenom II X6 2017: Threadripper 1950X (16 cores, 32 threads) 2018: Threadripper 2950X 2019: Epyc 7742, 64 cores, 128 threads ($6,950)

INFOMOV – Lecture 13 – “Snippets” 14 Multi-threading Threads / Scalability ...

INFOMOV – Lecture 13 – “Snippets” 15 Multi-threading Optimizing for Multiple Cores What we did before: 1. Profile. 2. Understand the hardware. 3. Trust No One. Goal: ▪ It’s fast enough when it scales linearly with the number of cores. ▪ It’s fast enough when the parallelizable code scales linearly with the number of cores. ▪ It’s fast enough if there is no sequential code.

INFOMOV – Lecture 13 – “Snippets” 16 Multi-threading Hardware Review T0 L1 I-$ L2 $ We have: T1 L1 D-$ ▪ Four physical cores T0 L1 I-$ L2 $ ▪ Each running two threads T1 L1 D-$ ▪ L1 cache: 32Kb, 4 cycles latency L3 $ ▪ L2 cache: 256Kb, 10 cycles latency T0 L1 I-$ ▪ A large shared L3 cache. L2 $ T1 L1 D-$ Observation: T0 L1 I-$ L2 $ T1 L1 D-$ If our code solely requires data from L1 and L2, this processor should do work split over four threads exactly four times faster. ▪ Work must stay on core ▪ No I/O, sleep ▪ … (Is tha (Is that tr true? ? Any co conditions?)

INFOMOV – Lecture 13 – “Snippets” 17 Multi-threading Simultaneous Multi-Threading (SMT) (Also known as hyperthreading) E E Pipelines grow wider and deeper: E E E ▪ Wider: to execute multiple instructions in parallel E in a single cycle. E E ▪ Deeper: to reduce the complexity of each pipeline E stage, which allows for a higher frequency. E E E t

INFOMOV – Lecture 13 – “Snippets” 18 Multi-threading fldz xor ecx, ecx fld dword ptr [4520h] Superscalar Pipeline mov edx, 28929227h fld dword ptr [452Ch] push esi mov esi, 0C350h add ecx, edx mov eax, 91D2A969h xor edx, 17737352h shr ecx, 1 mul eax, edx E fld st(1) E faddp st(3), st E mov eax, 91D2A969h E shr edx, 0Eh E add ecx, edx E E fmul st(1),st E xor edx, 17737352h E shr ecx, 1 E mul eax, edx E shr edx, 0Eh E dec esi t jne tobetimed+1Fh

INFOMOV – Lecture 13 – “Snippets” 20 Multi-threading fldz xor ecx, ecx fld dword ptr [4520h] Superscalar Pipeline mov edx, 28929227h fld dword ptr [452Ch] Nehalem (i7): six wide. push esi mov esi, [0C350h] add ecx, edx ▪ Three memory operations mov eax, [91D2h] ▪ Three calculations (float, int, vector) xor edx, 17737352h shr ecx, 1 mul eax, edx fld st(1) faddp st(3), st mov eax, 91D2A969h shr edx, 0Eh add ecx, edx execution unit 1 MEM fmul st(1),st execution unit 2 MEM xor edx, 17737352h execution unit 3 MEM shr ecx, 1 execution unit 4 CALC mul eax, edx execution unit 5 CALC execution unit 6 CALC shr edx, 0Eh dec esi t jne tobetimed+1Fh

INFOMOV – Lecture 13 – “Snippets” 21 Multi-threading Simultaneous Multi-Threading (SMT) (Also known as hyperthreading) E E Pipelines grow wider and deeper: E E E ▪ Wider, to execute multiple instructions in parallel E in a single cycle. E E ▪ Deeper, to reduce the complexity of each pipeline E stage, which allows for a higher frequency. E E E However, parallel instructions must be independent, t otherwise we get bubbles. Observation: two threads provide twice as many ▪ No dependencies between the threads independent instructions. ▪ … (Is (Is tha that tr true? ? Any co conditions?)

INFOMOV – Lecture 13 – “Snippets” 22 Multi-threading fldz fld st(1) xor ecx, ecx faddp st(3), st fld dword ptr [4520h] mov eax, 91D2A969h Simultaneous Multi-Threading (SMT) mov edx, 28929227h shr edx, 0Eh fld dword ptr [452Ch] add ecx, edx Nehalem (i7) pipeline: six wide*. push esi fmul st(1),st mov esi, 0C350h xor edx, 17737352h add ecx, edx shr ecx, 1 ▪ Three memory operations mov eax, [91D2h] mul eax, edx ▪ Three calculations (float, int, vector) xor edx, 17737352h shr edx, 0Eh shr ecx, 1 dec esi mul eax, edx fldz SMT: feeding the pipe from two threads. fld st(1) xor ecx, ecx faddp st(3), st fld dword ptr [4520h] All it really takes is an extra set of registers. mov eax, 91D2A969h mov edx, 28929227h shr edx, 0Eh fld dword ptr [452Ch] add ecx, edx push esi execution unit 1 MEM fld mov fmul st(1),st mov esi, 0C350h execution unit 2 MEM mov mov xor edx, 17737352h add ecx, edx execution unit 3 MEM fld shr ecx, 1 mov eax, [91D2h] execution unit 4 CALC fldz add xor mul mul eax, edx xor edx, 17737352h execution unit 5 CALC xor fld shr fmul execution unit 6 CALC push faddp shr edx, 0Eh shr ecx, 1 dec esi mul eax, edx t jne tobetimed+1Fh jne tobetimed+1Fh *: Details: The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms, Thomadakis, 2011.

INFOMOV – Lecture 13 – “Snippets” 23 Multi-threading Simultaneous Multi-Threading (SMT) Hyperthreading does mean that now two threads are using the same L1 and L2 cache. T0 L1 I-$ L2 $ T1 L1 D-$ ▪ For the average case, this will reduce data locality. ▪ If both threads use the same data, data locality remains the same. ▪ One thread can also be used to fetch data that the other thread will need *. *: Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors, Luk, 2001.

Welcome! Todays Agenda: Self-modifying code Multi-threading - PowerPoint PPT Presentation

/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2019 - Lecture 13: Snippets Welcome! Todays Agenda: Self-modifying code Multi-threading (1) Multi-threading (2) Experiments INFOMOV Lecture 13

Welcome back. Today. Welcome back. Today. Continue Sampling combinatorial structures. Welcome

Welcome! Welcome! Welcome! Welcome! What will happen today? What will happen today? Lecture

What is the League Today 1 1/23/2017 What is the League Today What is the League Today 2

Welcome back. Today. Welcome back. Today. Review: Spectral gap, Edge expansion h ( G ) ,

Welcome back... Welcome back... ..to me. Welcome back... ..to me. Test out Welcome back...

Social/Network/Analysis mohamed.bouguessa@uqo.ca/ 1 Web/today 2

Lecture 15 Logistics HW4 is due today HW5 posted today HW5 posted today Exam

Welcome to Today s ACM Webinar Welcome to today s ACM Webinar. The presentation starts

Welcome! Welcome ! - Agenda ANNUAL STEM EXPO 17 ..:: TIME AGENDA ITEM 2:30 PM Welcome Ceremony

Welcome Monthly Meeting August 2, 2019 Welcome & Check-in Agenda I. Welcome and

TEC Roadshow 2016 Welcome Agenda What well cover today: Welcome TECs current

2015 Assigners Summit Welcome Agenda: 1. Welcome 2. Part 1 Issues in assigning today 3.

Department Collaborative June 25, 2018 Welcome! Agenda for today: Welcome Presentation

WIEMANN LAMPHERE ARCHITECTS MONTPELIER TODAY MONTPELIER TODAY PARKING! VEHICLES ARE

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

Welcome! Welcome! Welcome! Welcome! Autor:Johann Oberdorfer Autor:Johann Oberdorfer With

Objectives An ASIC application MSDAP Analyze the application requirement System

Scalability in the Clouds! A Myth or Reality? Sanidhya Kashyap , Changwoo Min, Taesoo Kim

Timing issues in desktop audio playback infrastructure Alexander Patrakov April 11, 2015 About

OSiRIS: A Distributed Storage and Networking Project Update Open Storage Research Infrastructure

Linux perf_events status update Stephane Eranian Google Petascale Tools Workshop 2016 Agenda

Evaluating and improving kernel stack performance for datagram sockets from the perspective of

CPSC 213 Read assembly code anything that can be determined before execution (by compiler)

Hardware Group NSF Workshop on the Science of Power Management Hardware (Synopsis: Fund Us)

Welcome! Todays Agenda: Self-modifying code Multi-threading - PowerPoint PPT Presentation

/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2019 - Lecture 13: Snippets Welcome! Todays Agenda: Self-modifying code Multi-threading (1) Multi-threading (2) Experiments INFOMOV Lecture 13

Welcome back. Today. Welcome back. Today. Continue Sampling combinatorial structures. Welcome

Welcome! Welcome! Welcome! Welcome! What will happen today? What will happen today? Lecture

What is the League Today 1 1/23/2017 What is the League Today What is the League Today 2

Welcome back. Today. Welcome back. Today. Review: Spectral gap, Edge expansion h ( G ) ,

Welcome back... Welcome back... ..to me. Welcome back... ..to me. Test out Welcome back...

Social/Network/Analysis mohamed.bouguessa@uqo.ca/ 1 Web/today 2

Lecture 15 Logistics HW4 is due today HW5 posted today HW5 posted today Exam

Welcome to Today s ACM Webinar Welcome to today s ACM Webinar. The presentation starts

Welcome! Welcome ! - Agenda ANNUAL STEM EXPO 17 ..:: TIME AGENDA ITEM 2:30 PM Welcome Ceremony

Welcome Monthly Meeting August 2, 2019 Welcome &amp; Check-in Agenda I. Welcome and

TEC Roadshow 2016 Welcome Agenda What well cover today: Welcome TECs current

2015 Assigners Summit Welcome Agenda: 1. Welcome 2. Part 1 Issues in assigning today 3.

Department Collaborative June 25, 2018 Welcome! Agenda for today: Welcome Presentation

WIEMANN LAMPHERE ARCHITECTS MONTPELIER TODAY MONTPELIER TODAY PARKING! VEHICLES ARE

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

Welcome! Welcome! Welcome! Welcome! Autor:Johann Oberdorfer Autor:Johann Oberdorfer With

Objectives An ASIC application MSDAP Analyze the application requirement System

Scalability in the Clouds! A Myth or Reality? Sanidhya Kashyap , Changwoo Min, Taesoo Kim

Timing issues in desktop audio playback infrastructure Alexander Patrakov April 11, 2015 About

OSiRIS: A Distributed Storage and Networking Project Update Open Storage Research Infrastructure

Linux perf_events status update Stephane Eranian Google Petascale Tools Workshop 2016 Agenda

Evaluating and improving kernel stack performance for datagram sockets from the perspective of

CPSC 213 Read assembly code anything that can be determined before execution (by compiler)

Hardware Group NSF Workshop on the Science of Power Management Hardware (Synopsis: Fund Us)

Welcome Monthly Meeting August 2, 2019 Welcome & Check-in Agenda I. Welcome and