/INFOMOV/ Optimization & Vectorization
- J. Bikker - Sep-Nov 2015 - Lecture 10: “Practical”
Welcome! Todays Agenda: DotCloud: profiling & high-level (1) - - PowerPoint PPT Presentation
/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2015 - Lecture 10: Practical Welcome! Todays Agenda: DotCloud: profiling & high-level (1) DotCloud: low-level and blind stupidity DotCloud: high-level (2)
Re-introducting DotCloud
Application breakdown: INFOMOV – Lecture 10 – “Practical” 3
Tick Sort Transform Render DrawScaled
Performance Analysis & Scalability
INFOMOV – Lecture 10 – “Practical” 4
ms per frame 256 1024 4096 16384 Transform 0.002 0.005 0.016 0.061 Sort 0.090 1.190 21.600 480.100 Render 0.650 1.420 5.130 19.681 ms per dot 256 1024 4096 16384 Transform 0.0000 0.0000 0.0000 0.0000 Sort 0.0004 0.0011 0.0053 0.0293 Render 0.0025 0.0014 0.0013 0.0012
Solving the Sort Problem
Current Sort: bubblesort ( 𝑃(𝑂2) ). Alternatives*: Quicksort Heapsort Mergesort Radixsort Insertionsort Selectionsort Monkeysort Countingsort Introsort * See e.g.: http://www.sorting-algorithms.com INFOMOV – Lecture 10 – “Practical” 5
Shell sort Binary tree sort Library sort Smoothsort Strand sort Cocktail sort Comb sort Block sort Odd-even sort Pigeonhole sort Bucket sort Spread sort Burstsort Flashsort Postman sort Bread sort Bitonic sort Stooge sort
Solving the Sort Problem
Current Sort: bubblesort ( 𝑃(𝑂2) ). Best case: O(N). Which case do we have here? Factors:
INFOMOV – Lecture 10 – “Practical” 6
How much effort should we spend on this?
than rendering
be fine.
that fits well in the current code (safe time for other optimizations).
Solving the Sort Problem
Current Sort: bubblesort ( 𝑃(𝑂2) ). Alternative: QuickSort ( 𝑃( 𝑂 log 𝑂 ) ).
void Swap( vec3& a, vec3& b ) { vec3 t = a; a = b; b = t; } int Pivot( vec3 a[], int first, int last ) { int p = first; vec3 e = a[first]; for( int i = first + 1; i <= last; i++ ) if (a[i].z <= e.z) Swap( a[i], a[++p] ); Swap( a[p], a[first] ); return p; } void QuickSort( vec3 a[], int first, int last) { int pivotElement; if (first >= last) return; pivotElement = Pivot( a, first, last ); QuickSort( a, first, pivotElement - 1 ); QuickSort( a, pivotElement + 1, last ); }
INFOMOV – Lecture 10 – “Practical” 7
INFOMOV – Lecture 10 – “Practical” 8
bubblesort 256 1024 4096 16384 Transform 0.002 0.005 0.016 0.061 Sort (bubble) 0.090 1.190 21.600 480.100 Sort (quick) 0.014 0.063 0.305 1.569 Render 0.650 1.420 5.130 19.681
Repeated Profiling
Low Level Optimization of DrawScaled
void Sprite::DrawScaled( int a_X, int a_Y, int a_Width, int a_Height, Surface* a_Target ) { if ((a_Width == 0) || (a_Height == 0)) return; for ( int x = 0; x < a_Width; x++ ) for ( int y = 0; y < a_Height; y++ ) { int u = (int)((float)x * ((float)m_Width / (float)a_Width)); int v = (int)((float)y * ((float)m_Height / (float)a_Height)); Pixel color = GetBuffer()[u + v * m_Pitch]; if (color & 0xffffff) a_Target->GetBuffer()[a_X + x + ((a_Y + y) * a_Target->GetPitch())] = color; } }
Functionality:
INFOMOV – Lecture 10 – “Practical” 9
Low Level Optimization of DrawScaled
A few basic optimizations:
void Sprite::DrawScaled( int a_X, int a_Y, int a_Width, int a_Height, Surface* a_Target ) { for ( int y = 0; y < a_Height; y++ ) { int v = (int)((float)y * ((float)m_Height / (float)a_Height)); for ( int x = 0; x < a_Width; x++ ) { int u = (int)((float)x * ((float)m_Width / (float)a_Width)); Pixel color = GetBuffer()[u + v * m_Pitch]; if (color & 0xffffff) a_Target->GetBuffer()[a_X + x + ((a_Y + y) * a_Target->GetPitch())] = color; } } }
INFOMOV – Lecture 10 – “Practical” 10
Low Level Optimization of DrawScaled
More basic optimizations:
void Sprite::DrawScaled( int a_X, int a_Y, int a_Width, int a_Height, Surface* a_Target ) { float rh = (float)m_Height / (float)a_Height, rw = (float)m_Width / (float)a_Width; for ( int y = 0; y < a_Height; y++ ) { int v = (int)((float)y * rh); Pixel* line = a_Target->GetBuffer() + a_X + (a_Y + y) * a_Target->GetPitch(); for ( int x = 0; x < a_Width; x++ ) { int u = (int)((float)x * rw); Pixel color = GetBuffer()[u + v * m_Pitch]; if (color & 0xffffff) line[x] = color; } } }
INFOMOV – Lecture 10 – “Practical” 11
Low Level Optimization of DrawScaled
Fixed point optimization:
void Sprite::DrawScaled( int a_X, int a_Y, int a_Width, int a_Height, Surface* a_Target ) { const int rh = (m_Height << 10) / a_Height, rw = (m_Width << 10) / a_Width; Pixel* line = a_Target->GetBuffer() + a_X + a_Y * a_Target->GetPitch(); for ( int y = 0; y < a_Height; y++, line += a_Target->GetPitch() ) { const int v = (y * rh) >> 10; for ( int x = 0; x < a_Width; x++ ) { const int u = (x * rw) >> 10; const Pixel color = GetBuffer()[u + v * m_Pitch]; if (color & 0xffffff) line[x] = color; } } }
INFOMOV – Lecture 10 – “Practical” 12
Low Level Optimization of DrawScaled
Now what?
How many different ball sizes do we encounter? …Why don’t we simply precalculate those frames? INFOMOV – Lecture 10 – “Practical” 13
INFOMOV – Lecture 10 – “Practical” 14
High Level Optimization of DrawScaled
Sprite* scaled[64]; void Game::Init() { ... for( int i = 0; i < 64; i++ ) { int size = i + 1; scaled[i] = new Sprite( new Surface( size, size ), 1 ); scaled[i]->GetSurface()->Clear( 0 ); m_Dot->DrawScaled( 0, 0, size, size, scaled[i]->GetSurface() ); } } scaled[size]->Draw( (sx - size / 2), (sy - size / 2), screen );
INFOMOV – Lecture 10 – “Practical” 15
INFOMOV – Lecture 10 – “Practical” 16
bubblesort 256 1024 4096 16384 Transform 0.002 0.005 0.016 0.061 Sort 0.014 0.063 0.305 1.569 Render (old) 0.650 1.420 5.130 19.681 Render (new) 0.350 0.720 1.977 6.383
Repeated Profiling
Optimization of Dense Clouds
Observation: beyond a certain dot count, a large number of particles is occluded. Specifically, we won’t be able to see the back half.
if (m_Rotated[i].z > -0.2f) scaled[size]->Draw( (sx - size / 2), (sy - size / 2), screen ); (perhaps we could also limit rendering to the outer shell of the cloud?)
Rendering is now down to 4.8ms, and sorting is slowly becoming significant again: At 65536 dots, we get 4.7ms for sorting, 17.3ms for rendering. INFOMOV – Lecture 10 – “Practical” 17
Low Level Optimization of DrawScaled
Extreme Optimization:
FILE* f = fopen( "drawfunc.h", "w" ); fprintf( f, "void Sprite::DrawBall( int x, int y, int size, Surface* target )\n" ); fprintf( f, "{\nuint* a = target->GetBuffer() + x + y * SCRWIDTH;\nswitch( size )\n{\n" ); for( int i = 0; i < 64; i++ ) { ... fprintf( f, "case %i:\n", size ); for( int y = 0; y < size; y++) for( int x = 0; x < size; x++ ) { int a = y * SCRWIDTH + x; if (scaled[i]->GetBuffer()[x + y * size] & 0xffffff) fprintf( f, "a[%i]=%i;\n", a, scaled[i]->GetBuffer()[x + y * size] & 0xffffff ); } fprintf( f, "break;\n" ); } fprintf( f, "}\n}\n" );
INFOMOV – Lecture 10 – “Practical” 18
Low Level Optimization of DrawScaled
The last optimization worked surprisingly well, yielding a final performance of: 65536 dots @ ~7ms (render time only). Sorting is now definitely significant. INFOMOV – Lecture 10 – “Practical” 19
Sorting in O(1)
For this specific situation, we can sort in O(1), e.g., independent of particle count. Observation: dots do not move independently. Intuition: why rotate 64k dots if you can rotate a single camera? INFOMOV – Lecture 10 – “Practical” 20
Sorting in O(1)
INFOMOV – Lecture 10 – “Practical” 21
Sorting in O(1)
INFOMOV – Lecture 10 – “Practical” 22
Sorting in O(1)
INFOMOV – Lecture 10 – “Practical” 23
Sorting in O(1)
INFOMOV – Lecture 10 – “Practical” 24
Sorting in O(1)
INFOMOV – Lecture 10 – “Practical” 25
Sorting in O(1)
INFOMOV – Lecture 10 – “Practical” 26
For each split:
Where ‘nearest’ is the side that the ‘camera’ is on.
Introducting Billiards
Application breakdown: Clearly, we have 𝑃(𝑂2) behavior here. How do we fix this efficiently? INFOMOV – Lecture 10 – “Practical” 28
Game::Tick For each ball:
Profiling
Adding a timer:
timer t; t.reset();
Visualizing timing result:
screen->Bar( 10, 10, 170, 20, 0 ); char r[128]; sprintf( r, "simulation time: %5.2f ms", t.elapsed() ); screen->Print( r, 12, 11, 0xffffff );
Apparently, we can check 2k balls against 2k balls in 4.5ms. INFOMOV – Lecture 10 – “Practical” 29
Low level optimization
Lots of square roots in the main loop:
Improvements:
too close.
problematic (e.g., squared length > 1 + epsilon). INFOMOV – Lecture 10 – “Practical” 30
High Level Optimization
Each ball checks every other ball (with a larger index, so in fact only half of the other balls on average). We need an efficient way to find nearby balls.
Grid it is: we will check the balls in the cell our ball is in, plus the direct neighbors. INFOMOV – Lecture 10 – “Practical” 31
Steps: 1. Per frame, update the grid 2. Per ball, determine relevant cells 3. Loop over cells 4. Loop over balls in cell
High Level Optimization
Attempt 1: std::vector // allocate grid
std::vector<int> grid[SCRHEIGHT / 16][SCRWIDTH / 16]; // fill the grid for( int y = 0; y < SCRHEIGHT / 16; y++ ) for( int x = 0; x < SCRWIDTH / 16; x++ ) grid[y][x].clear(); for( int i = 0; i < BALLCOUNT; i++ ) { int gx = CLAMP( (int)(pos[i].x / 16 ), 0, SCRWIDTH / 16 - 1 ); int gy = CLAMP( (int)(pos[i].y / 16 ), 0, SCRHEIGHT / 16 - 1 ); grid[gy][gx].push_back( i ); }
INFOMOV – Lecture 10 – “Practical” 32
High Level Optimization
Attempt 1: std::vector
// using the grid int gx = CLAMP( (int)(pos[i].x / 16 ), 0, SCRWIDTH / 16 - 1 ); int gy = CLAMP( (int)(pos[i].y / 16 ), 0, SCRHEIGHT / 16 - 1 ); int gx1 = MAX( 0, gx - 1 ), gx2 = MIN( SCRWIDTH / 16 - 1, gx + 1 ); int gy1 = MAX( 0, gy - 1 ), gy2 = MIN( SCRHEIGHT / 16 - 1, gy + 1 ); for( int y = gy1; y <= gy2; y++ ) for( int x = gx1; x <= gx2; x++ ) for( int k = 0; k < grid[y][x].size(); k++ ) { int j = grid[y][x][k]; if (j <= i) continue;
INFOMOV – Lecture 10 – “Practical” 33
High Level Optimization
Attempt 1: std::vector Result: simulation time went down from 4.5ms to 1.17ms (although performance is very unstable). INFOMOV – Lecture 10 – “Practical” 34
High Level Optimization
Attempt 2: regular arrays Assumption: in a 16x16 grid cell, we will never have more than 8 balls.
// allocate grid int grid[SCRHEIGHT / 16][SCRWIDTH / 16][8]; // allocate array for storing grid cell ball count int nr[SCRHEIGHT / 16][SCRWIDTH / 16] // fill the grid memset( nr, 0, SCRHEIGHT / 16 * SCRWIDTH / 16 * 4 ); for( int i = 0; i < BALLCOUNT; i++ ) { int gx = CLAMP( (int)(pos[i].x / 16 ), 0, SCRWIDTH / 16 - 1 ); int gy = CLAMP( (int)(pos[i].y / 16 ), 0, SCRHEIGHT / 16 - 1 ); grid[gy][gx][nr[gy][gx]++] = i; }
INFOMOV – Lecture 10 – “Practical” 35
High Level Optimization
Attempt 2: regular arrays
// using the grid int gx = CLAMP( (int)(pos[i].x / 16 ), 0, SCRWIDTH / 16 - 1 ); int gy = CLAMP( (int)(pos[i].y / 16 ), 0, SCRHEIGHT / 16 - 1 ); int gx1 = MAX( 0, gx - 1 ), gx2 = MIN( SCRWIDTH / 16 - 1, gx + 1 ); int gy1 = MAX( 0, gy - 1 ), gy2 = MIN( SCRHEIGHT / 16 - 1, gy + 1 ); for( int y = gy1; y <= gy2; y++ ) for( int x = gx1; x <= gx2; x++ ) for( int k = 0; k < nr[y][x]; k++ ) { int j = grid[y][x][k]; if (j <= i) continue;
INFOMOV – Lecture 10 – “Practical” 36
High Level Optimization
Attempt 2: regular arrays Result: simulation time went down from 1.17ms to 1.07ms (but, still quite unstable). …Why is it unstable? After fixing this, we are now at 0.45ms, A 10x improvement. INFOMOV – Lecture 10 – “Practical” 37
Data Locality
The access pattern is currently as follows: For every ball For every cell in the neighborhood Check every ball in the cell Each individual ball will be in a random cell. Drawing the balls will access the screen randomly. We can improve locality by looping over cells in the outer loop. INFOMOV – Lecture 10 – “Practical” 38
More Profiling
INFOMOV – Lecture 10 – “Practical” 39
Cur urrently, , our
bottlenecks ar are:
Also note that the three for loops take quite a bit of time.
High Level Optimization
High level optimization:
High level optimization almost always yields the biggest gains in performance. Typical approach:
INFOMOV – Lecture 10 – “Practical” 41
Low Level Optimization
Use that list of common opportunities!
And:
INFOMOV – Lecture 10 – “Practical” 42