Welcome! Todays Agenda: OOP Performance Pitfalls DOD Concepts - PowerPoint PPT Presentation

/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2019 - Lecture 7 : “Data - Oriented Design” Welcome!

Today’s Agenda: ▪ OOP Performance Pitfalls ▪ DOD Concepts ▪ DOD or OO?

INFOMOV – Lecture 7 – “Data - Oriented Design” 3 Fact Checking “Floating point code is (typically) undeterministic ” float v0 = 1; fld1 float v1 = 1; fld st(0) float v2 = 1; fld st(1) float v3 = 1; fld st(2) float v4 = 1; fld st(3) float v5 = 1; fld st(4) float v6 = 1; fld st(5) float v7 = 1; fld st(6) for (int i = 0; i < 2000000; i++) { v0 *= 1.00001f; fmul st(7),st ; fxch st(7) ; fstp [v0] v1 *= 1.00001f; fxch st(5) ; fmul st,st(6) v2 *= 1.00001f; fxch st(4) ; fmul st,st(6) v3 *= 1.00001f; fxch st(3) ; fmul st,st(6) v4 *= 1.00001f; fxch st(2) ; fmul st,st(6) v5 *= 1.00001f; fxch st(1) ; fmul st,st(6) v6 *= 1.00001f; fxch st(5) ; fmul st,st(6) v7 *= 1.00001f; fld [v7] fmul st,st(7) fstp [v7] }

INFOMOV – Lecture 7 – “Data - Oriented Design” 4 Fact Checking “Doubles are slower than floats (4x)” This statement is mostly tru true. The real story, CPU (win32, x64): ▪ A float takes 32-bit in memory, but gets promoted to 80 bits in an FPU register. ▪ A double takes 64-bit in memory, but gets promoted to 80 bits in an FPU register. ▪ A long double takes 64-bit in memory, but gets promoted to 80 bits in an FPU register. Calculation time on 80-bit FPU registers does not depend on the source of the data. HOWEVER: the fp registers are rarely used anymore… The real story, GPU (Nvidia, AMD): https://www.geeks3d.com/20140305/amd-radeon-and-nvidia-geforce-fp32-fp64-gflops-table-computing ▪ Titan V: FP64 = 1/2 * FP32 (6900 vs 13800 GFLOPS) ▪ Titan X Pascal: FP64 = 1/32 * FP32 (350 vs 11300 GFLOPS) (same for all 10xx) ▪ Radeon RX Vega 64: FP64 = 1/16 * FP32 (790 vs 12700 GFLOPS) ▪ Radeon HD 7990: FP64 = 1/4 * FP32 (1946 vs 7782) FP16 (GPU only): https://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/5 ▪ GTX 1080Ti: FP16 = 1/64 * FP32 (ouch) ▪ Radeon RX Vega 64: FP16 = 2 * FP32 (!)

INFOMOV – Lecture 7 – “Data - Oriented Design” 6 OOP “Death by a Thousand Cuts” Object Oriented Programming: ▪ Objects ▪ Data ▪ Methods ▪ Instances Actor Tick tank->Tick Tick bullet->Tick smoke->Tick

INFOMOV – Lecture 7 – “Data - Oriented Design” 7 OOP “Death by a Thousand Cuts” Cost of a virtual function call: Object Oriented Programming: 1. Virtual Function Table 2. No inlining ▪ Objects ▪ Data … ▪ Methods ▪ Instances Calling such a function: cache miss 1. Read pointer to VFT of base class Actor Tick 2. Add function offset cache miss 3. Read function address from VFT 4. Load address in PC (jump) tank->Tick branch Tick But, that isn’t realistic, right? bullet->Tick It It is is, , if if we us use OO OO for or what it it was smoke->Tick de designed for: ope operating on on het heterogeneous obj objects.

INFOMOV – Lecture 7 – “Data - Oriented Design” 8 OOP “Death by a Thousand Cuts” Characteristics of OO: ▪ Virtual calls ▪ Scattered individual objects

INFOMOV – Lecture 7 – “Data - Oriented Design” 9 OOP “Death by a Thousand Cuts” Reading memory: 40 cycles @ 300Mhz Reading memory: 600 cycles @ 3.2Ghz The problem is growing with time.

INFOMOV – Lecture 7 – “Data - Oriented Design” 10 OOP “Death by a Thousand Cuts” Dealing with “bandwidth starvation”: Caching Continuous memory access (full cache lines) Large array continuous memory access (caches ‘read ahead’)

INFOMOV – Lecture 7 – “Data - Oriented Design” 11 OOP “Death by a Thousand Cuts” Code performance is typically bound by memory access. “The ideal data is in a format that we can use with the least amount of effort.” ➔ Effort = CPU-effort. “Most programs are made faster if we improve their memory access patterns.” (this will be more true every year) “You cannot be fast without knowing how data is touched.”

INFOMOV – Lecture 7 – “Data - Oriented Design” 12 OOP “Death by a Thousand Cuts” Parallel processing typically requires synchronization. Tick tank->Tick bullet->Tick smoke->Tick read write read write read write “You cannot mult ulti-thread without knowing how data is touched.”

INFOMOV – Lecture 7 – “Data - Oriented Design” 13 OOP “Death by a Thousand Cuts” Parallel processing requires coherent program flow. opp32 “You cannot mult ulti-thread without knowing how data is touched.”

INFOMOV – Lecture 7 – “Data - Oriented Design” 14 OOP “Death by a Thousand Cuts” class Bot : public Enemy { ... vec3 m_position; ... cached but not used float m_mod; cached but not used ... float m_aimDirection; ... virtual void updateAim( vec3 target ) cache miss { m_aimDirection = dot3( m_position, target ) * m_mod; } cache miss cache miss cache miss }

INFOMOV – Lecture 7 – “Data - Oriented Design” 15 OOP “Death by a Thousand Cuts” void updateAims( float* aimDir, only reads data that const AimingData* aim, is actually needed to cache vec3 target, uint count ) { reads from for (uint i = 0; i < count; ++i) linear array { aimDir[i] = dot3(aim->positions[i],target) * aim->mod[i]; } writes to actual functionality is unchanged } linear array

INFOMOV – Lecture 7 – “Data - Oriented Design” 16 OOP Algorithm Performance Factors Estimating algorithm cost: 1. Algorithmic Complexity : O( 𝑂 ), O( 𝑂 2 ), O( 𝑂 log 𝑂), … 𝑢 2. Cyclomatic Complexity* (or: Conditional Complexity) 3. Amdahl’s Law / Work -Span Model 4. Cache Effectiveness *: McCabe, A Complexity Measure, 1976.

INFOMOV – Lecture 7 – “Data - Oriented Design” 18 DOD Data Oriented Design* Origin: low-level game development. Core idea: focus software design on CPU- and cache-aware data layout . Take into account: ▪ Cache line size ▪ Data alignment ▪ Data size ▪ Access patterns ▪ Data transformations Strive for a simple, linear access pattern as much as possible. *: Nikos Drakos , “Data Oriented Design”, 2008. http://www.dataorienteddesign.com/dodmain

INFOMOV – Lecture 7 – “Data - Oriented Design” 19 DOD Bad Access Patterns: Linked List The Perfect LinkedList ™: struct LLNode LLNode* NewNode( int value ) { { LLNode* next; LLNode* retval = pool; int value; pool = pool->next; }; retval->value = value; return retval; LLNode* nodes = new LLNode[…]; } LLNode* pool = nodes; list = NewNode( -MAXINT ); for( int i = 0; i < ...; i++ ) list->next = NewNode( MAXINT ); nodes[i].next = &nodes[i + 1]; list->next->next = 0; … nodes: 0 0 0 0 0 0 0 0 0 list: -10000 10000

INFOMOV – Lecture 7 – “Data - Oriented Design” 20 DOD Bad Access Patterns: Linked List The Perfect LinkedList ™, experiment: Insert 25000 random values in the list so that for( int i = 0; i < COUNT; i++ ) { we obtain a sorted sequence. LLNode* node = NewNode( rand() & 8191); LLNode* iter = list; while (iter->next->value < node->value) iter = iter->next; node->next = iter->next; iter->next = node; }

INFOMOV – Lecture 7 – “Data - Oriented Design” 21 DOD Bad Access Patterns: Linked List KISS Array™: data = new int […]; memset ( data, 0, … * sizeof( int ) ); data[0] = -10000; data[1] = 10000; for( int i = 0; i < COUNT; i++ ) N = 2; { int pos = 1, value = rand() & 8191; while (data[pos] < value) pos++; memcpy( data + pos + 1, data + pos, (N - pos + 1) * sizeof( int ) ); data[pos] = value, N++; }

INFOMOV – Lecture 7 – “Data - Oriented Design” 22 DOD for( int i = 0; i < COUNT; i++ ) for( int i = 0; i < COUNT; i++ ) { { LLNode* node = NewNode( rand() & 8191); int pos = 1, value = rand() & 8191; LLNode* iter = list; while (data[pos] < value) pos++; while (iter->next->value < node->value) memcpy( data + pos + 1, data + pos, iter = iter->next; (N - pos + 1) * sizeof( int ) ); node->next = iter->next; data[pos] = value, N++; iter->next = node; } }

INFOMOV – Lecture 7 – “Data - Oriented Design” 23 DOD Bad Access Patterns: Linked List* Inserting elements in an array by shifting the remainder of the array is significantly faster than using an optimized linked list. Why? ▪ Finding the location in the array: pure linear access ▪ Shifting the remainder: pure linear access. ➔ Even though the amount of transferred memory is huge, this approach wins. *: Also see: Nathan Reed, Data Oriented Hash Table, 2015. http://www.reedbeta.com/blog/data-oriented-hash-table

INFOMOV – Lecture 7 – “Data - Oriented Design” 24 DOD Bad Access Patterns: Octree Root Level 1 Level 2

INFOMOV – Lecture 7 – “Data - Oriented Design” 25 DOD Bad Access Patterns: Octree Query: find the color of a voxel visible through pixel (x,y). Operation: ‘3DDDA’ (basically: Bresenham). Data layout: 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 … Color data: 32-bit (ARGB).

Welcome! Todays Agenda: OOP Performance Pitfalls DOD Concepts - PowerPoint PPT Presentation

/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2019 - Lecture 7 : Data - Oriented Design Welcome! Todays Agenda: OOP Performance Pitfalls DOD Concepts DOD or OO? INFOMOV Lecture 7 Data -

Welcome back. Today. Welcome back. Today. Continue Sampling combinatorial structures. Welcome

Welcome! Welcome! Welcome! Welcome! What will happen today? What will happen today? Lecture

What is the League Today 1 1/23/2017 What is the League Today What is the League Today 2

Welcome back. Today. Welcome back. Today. Review: Spectral gap, Edge expansion h ( G ) ,

Welcome back... Welcome back... ..to me. Welcome back... ..to me. Test out Welcome back...

Social/Network/Analysis mohamed.bouguessa@uqo.ca/ 1 Web/today 2

Lecture 15 Logistics HW4 is due today HW5 posted today HW5 posted today Exam

Welcome to Today s ACM Webinar Welcome to today s ACM Webinar. The presentation starts

Welcome! Welcome ! - Agenda ANNUAL STEM EXPO 17 ..:: TIME AGENDA ITEM 2:30 PM Welcome Ceremony

Welcome Monthly Meeting August 2, 2019 Welcome & Check-in Agenda I. Welcome and

TEC Roadshow 2016 Welcome Agenda What well cover today: Welcome TECs current

2015 Assigners Summit Welcome Agenda: 1. Welcome 2. Part 1 Issues in assigning today 3.

Department Collaborative June 25, 2018 Welcome! Agenda for today: Welcome Presentation

WIEMANN LAMPHERE ARCHITECTS MONTPELIER TODAY MONTPELIER TODAY PARKING! VEHICLES ARE

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

Welcome! Welcome! Welcome! Welcome! Autor:Johann Oberdorfer Autor:Johann Oberdorfer With

CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing SIMD Bernhard

CS 839: Design the Next-Generation Database Lecture 7: GPU Database Xiangyao Yu 2/11/2020 1

Programming the Adapteva Epiphany 64-core Network-on-chip Coprocessor Anish Varghese, Robert

Samuel Cremer 1,2 , Michel Bagein 1 , Sad Mahmoudi 1 , Pierre Manneback 1 1 UMONS, University of

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario,

Placement resource view visualization $ openstack resource provider tree balazs.gibizer@est.tech

GAUSS - GEANT4 based simulat ion f or LHCb GEANT4 Workshop 2 Oct ober 2002 W. Pokor ski /

Welcome! Todays Agenda: OOP Performance Pitfalls DOD Concepts - PowerPoint PPT Presentation

/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2019 - Lecture 7 : Data - Oriented Design Welcome! Todays Agenda: OOP Performance Pitfalls DOD Concepts DOD or OO? INFOMOV Lecture 7 Data -

Welcome back. Today. Welcome back. Today. Continue Sampling combinatorial structures. Welcome

Welcome! Welcome! Welcome! Welcome! What will happen today? What will happen today? Lecture

What is the League Today 1 1/23/2017 What is the League Today What is the League Today 2

Welcome back. Today. Welcome back. Today. Review: Spectral gap, Edge expansion h ( G ) ,

Welcome back... Welcome back... ..to me. Welcome back... ..to me. Test out Welcome back...

Social/Network/Analysis mohamed.bouguessa@uqo.ca/ 1 Web/today 2

Lecture 15 Logistics HW4 is due today HW5 posted today HW5 posted today Exam

Welcome to Today s ACM Webinar Welcome to today s ACM Webinar. The presentation starts

Welcome! Welcome ! - Agenda ANNUAL STEM EXPO 17 ..:: TIME AGENDA ITEM 2:30 PM Welcome Ceremony

Welcome Monthly Meeting August 2, 2019 Welcome &amp; Check-in Agenda I. Welcome and

TEC Roadshow 2016 Welcome Agenda What well cover today: Welcome TECs current

2015 Assigners Summit Welcome Agenda: 1. Welcome 2. Part 1 Issues in assigning today 3.

Department Collaborative June 25, 2018 Welcome! Agenda for today: Welcome Presentation

WIEMANN LAMPHERE ARCHITECTS MONTPELIER TODAY MONTPELIER TODAY PARKING! VEHICLES ARE

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

Welcome! Welcome! Welcome! Welcome! Autor:Johann Oberdorfer Autor:Johann Oberdorfer With

CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing SIMD Bernhard

CS 839: Design the Next-Generation Database Lecture 7: GPU Database Xiangyao Yu 2/11/2020 1

Programming the Adapteva Epiphany 64-core Network-on-chip Coprocessor Anish Varghese, Robert

Samuel Cremer 1,2 , Michel Bagein 1 , Sad Mahmoudi 1 , Pierre Manneback 1 1 UMONS, University of

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario,

Placement resource view visualization $ openstack resource provider tree balazs.gibizer@est.tech

GAUSS - GEANT4 based simulat ion f or LHCb GEANT4 Workshop 2 Oct ober 2002 W. Pokor ski /

Welcome Monthly Meeting August 2, 2019 Welcome & Check-in Agenda I. Welcome and