Welcome! Todays Agenda: DotCloud: profiling & high-level (1) - - PowerPoint PPT Presentation

welcome today s agenda
SMART_READER_LITE
LIVE PREVIEW

Welcome! Todays Agenda: DotCloud: profiling & high-level (1) - - PowerPoint PPT Presentation

/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2015 - Lecture 10: Practical Welcome! Todays Agenda: DotCloud: profiling & high-level (1) DotCloud: low-level and blind stupidity DotCloud: high-level (2)


slide-1
SLIDE 1

/INFOMOV/ Optimization & Vectorization

  • J. Bikker - Sep-Nov 2015 - Lecture 10: “Practical”

Welcome!

slide-2
SLIDE 2

Today’s Agenda:

  • DotCloud: profiling & high-level (1)
  • DotCloud: low-level and blind stupidity
  • DotCloud: high-level (2)
  • Billiards: high level
  • Digest
slide-3
SLIDE 3

Re-introducting DotCloud

Application breakdown: INFOMOV – Lecture 10 – “Practical” 3

Practical Matters

Tick Sort Transform Render DrawScaled

slide-4
SLIDE 4

Performance Analysis & Scalability

INFOMOV – Lecture 10 – “Practical” 4

Practical Matters

ms per frame 256 1024 4096 16384 Transform 0.002 0.005 0.016 0.061 Sort 0.090 1.190 21.600 480.100 Render 0.650 1.420 5.130 19.681 ms per dot 256 1024 4096 16384 Transform 0.0000 0.0000 0.0000 0.0000 Sort 0.0004 0.0011 0.0053 0.0293 Render 0.0025 0.0014 0.0013 0.0012

slide-5
SLIDE 5

Solving the Sort Problem

Current Sort: bubblesort ( 𝑃(𝑂2) ). Alternatives*: Quicksort Heapsort Mergesort Radixsort Insertionsort Selectionsort Monkeysort Countingsort Introsort * See e.g.: http://www.sorting-algorithms.com INFOMOV – Lecture 10 – “Practical” 5

Practical Matters

Shell sort Binary tree sort Library sort Smoothsort Strand sort Cocktail sort Comb sort Block sort Odd-even sort Pigeonhole sort Bucket sort Spread sort Burstsort Flashsort Postman sort Bread sort Bitonic sort Stooge sort

slide-6
SLIDE 6

Solving the Sort Problem

Current Sort: bubblesort ( 𝑃(𝑂2) ). Best case: O(N). Which case do we have here? Factors:

  • Size of set
  • Already sorted / almost sorted?
  • Distributed (even / uneven)
  • Type of data (just key / full records)
  • Key type (float / int / string)

INFOMOV – Lecture 10 – “Practical” 6

Practical Matters

How much effort should we spend on this?

  • For small sets, sorting takes far less time

than rendering

  • Anything that is not 𝑃(𝑂2) will probably

be fine.

  • Would be nice if we can find something

that fits well in the current code (safe time for other optimizations).

slide-7
SLIDE 7

Solving the Sort Problem

Current Sort: bubblesort ( 𝑃(𝑂2) ). Alternative: QuickSort ( 𝑃( 𝑂 log 𝑂 ) ).

void Swap( vec3& a, vec3& b ) { vec3 t = a; a = b; b = t; } int Pivot( vec3 a[], int first, int last ) { int p = first; vec3 e = a[first]; for( int i = first + 1; i <= last; i++ ) if (a[i].z <= e.z) Swap( a[i], a[++p] ); Swap( a[p], a[first] ); return p; } void QuickSort( vec3 a[], int first, int last) { int pivotElement; if (first >= last) return; pivotElement = Pivot( a, first, last ); QuickSort( a, first, pivotElement - 1 ); QuickSort( a, pivotElement + 1, last ); }

INFOMOV – Lecture 10 – “Practical” 7

Practical Matters

slide-8
SLIDE 8

INFOMOV – Lecture 10 – “Practical” 8

Practical Matters

bubblesort 256 1024 4096 16384 Transform 0.002 0.005 0.016 0.061 Sort (bubble) 0.090 1.190 21.600 480.100 Sort (quick) 0.014 0.063 0.305 1.569 Render 0.650 1.420 5.130 19.681

Repeated Profiling

slide-9
SLIDE 9

Low Level Optimization of DrawScaled

void Sprite::DrawScaled( int a_X, int a_Y, int a_Width, int a_Height, Surface* a_Target ) { if ((a_Width == 0) || (a_Height == 0)) return; for ( int x = 0; x < a_Width; x++ ) for ( int y = 0; y < a_Height; y++ ) { int u = (int)((float)x * ((float)m_Width / (float)a_Width)); int v = (int)((float)y * ((float)m_Height / (float)a_Height)); Pixel color = GetBuffer()[u + v * m_Pitch]; if (color & 0xffffff) a_Target->GetBuffer()[a_X + x + ((a_Y + y) * a_Target->GetPitch())] = color; } }

Functionality:

  • for every pixel of the rectangular target image,
  • find the corresponding source pixel,
  • using interpolation.

INFOMOV – Lecture 10 – “Practical” 9

Practical Matters

slide-10
SLIDE 10

Low Level Optimization of DrawScaled

A few basic optimizations:

void Sprite::DrawScaled( int a_X, int a_Y, int a_Width, int a_Height, Surface* a_Target ) { for ( int y = 0; y < a_Height; y++ ) { int v = (int)((float)y * ((float)m_Height / (float)a_Height)); for ( int x = 0; x < a_Width; x++ ) { int u = (int)((float)x * ((float)m_Width / (float)a_Width)); Pixel color = GetBuffer()[u + v * m_Pitch]; if (color & 0xffffff) a_Target->GetBuffer()[a_X + x + ((a_Y + y) * a_Target->GetPitch())] = color; } } }

  • Loop swap (to improve cache usage)
  • Loop hoisting (variable v is constant inside x loop)
  • Removed check for zero-width sprite (doesn’t happen in our case)

INFOMOV – Lecture 10 – “Practical” 10

Practical Matters

slide-11
SLIDE 11

Low Level Optimization of DrawScaled

More basic optimizations:

void Sprite::DrawScaled( int a_X, int a_Y, int a_Width, int a_Height, Surface* a_Target ) { float rh = (float)m_Height / (float)a_Height, rw = (float)m_Width / (float)a_Width; for ( int y = 0; y < a_Height; y++ ) { int v = (int)((float)y * rh); Pixel* line = a_Target->GetBuffer() + a_X + (a_Y + y) * a_Target->GetPitch(); for ( int x = 0; x < a_Width; x++ ) { int u = (int)((float)x * rw); Pixel color = GetBuffer()[u + v * m_Pitch]; if (color & 0xffffff) line[x] = color; } } }

  • Precalculate m_Height / a_Height, m_Width / a_Width
  • Calculate y component of target address once per line

INFOMOV – Lecture 10 – “Practical” 11

Practical Matters

slide-12
SLIDE 12

Low Level Optimization of DrawScaled

Fixed point optimization:

void Sprite::DrawScaled( int a_X, int a_Y, int a_Width, int a_Height, Surface* a_Target ) { const int rh = (m_Height << 10) / a_Height, rw = (m_Width << 10) / a_Width; Pixel* line = a_Target->GetBuffer() + a_X + a_Y * a_Target->GetPitch(); for ( int y = 0; y < a_Height; y++, line += a_Target->GetPitch() ) { const int v = (y * rh) >> 10; for ( int x = 0; x < a_Width; x++ ) { const int u = (x * rw) >> 10; const Pixel color = GetBuffer()[u + v * m_Pitch]; if (color & 0xffffff) line[x] = color; } } }

  • Fixed point works really well here… but doesn’t improve performance.
  • Incremental calculation of line address helps a bit
  • Seems we reached the end here…

INFOMOV – Lecture 10 – “Practical” 12

Practical Matters

slide-13
SLIDE 13

Low Level Optimization of DrawScaled

Now what?

  • Plot multiple pixels at a time?

How many different ball sizes do we encounter? …Why don’t we simply precalculate those frames? INFOMOV – Lecture 10 – “Practical” 13

Practical Matters

slide-14
SLIDE 14

INFOMOV – Lecture 10 – “Practical” 14

“More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason – including blind stupidity.” (W.A. Wulff)

slide-15
SLIDE 15

High Level Optimization of DrawScaled

Sprite* scaled[64]; void Game::Init() { ... for( int i = 0; i < 64; i++ ) { int size = i + 1; scaled[i] = new Sprite( new Surface( size, size ), 1 ); scaled[i]->GetSurface()->Clear( 0 ); m_Dot->DrawScaled( 0, 0, size, size, scaled[i]->GetSurface() ); } } scaled[size]->Draw( (sx - size / 2), (sy - size / 2), screen );

INFOMOV – Lecture 10 – “Practical” 15

Practical Matters

slide-16
SLIDE 16

INFOMOV – Lecture 10 – “Practical” 16

Practical Matters

bubblesort 256 1024 4096 16384 Transform 0.002 0.005 0.016 0.061 Sort 0.014 0.063 0.305 1.569 Render (old) 0.650 1.420 5.130 19.681 Render (new) 0.350 0.720 1.977 6.383

Repeated Profiling

slide-17
SLIDE 17

Optimization of Dense Clouds

Observation: beyond a certain dot count, a large number of particles is occluded. Specifically, we won’t be able to see the back half.

if (m_Rotated[i].z > -0.2f) scaled[size]->Draw( (sx - size / 2), (sy - size / 2), screen ); (perhaps we could also limit rendering to the outer shell of the cloud?)

Rendering is now down to 4.8ms, and sorting is slowly becoming significant again: At 65536 dots, we get 4.7ms for sorting, 17.3ms for rendering. INFOMOV – Lecture 10 – “Practical” 17

Practical Matters

slide-18
SLIDE 18

Low Level Optimization of DrawScaled

Extreme Optimization:

  • We simply generate a function that plots every pixel, without the need for a loop.

FILE* f = fopen( "drawfunc.h", "w" ); fprintf( f, "void Sprite::DrawBall( int x, int y, int size, Surface* target )\n" ); fprintf( f, "{\nuint* a = target->GetBuffer() + x + y * SCRWIDTH;\nswitch( size )\n{\n" ); for( int i = 0; i < 64; i++ ) { ... fprintf( f, "case %i:\n", size ); for( int y = 0; y < size; y++) for( int x = 0; x < size; x++ ) { int a = y * SCRWIDTH + x; if (scaled[i]->GetBuffer()[x + y * size] & 0xffffff) fprintf( f, "a[%i]=%i;\n", a, scaled[i]->GetBuffer()[x + y * size] & 0xffffff ); } fprintf( f, "break;\n" ); } fprintf( f, "}\n}\n" );

INFOMOV – Lecture 10 – “Practical” 18

Practical Matters

slide-19
SLIDE 19

Low Level Optimization of DrawScaled

The last optimization worked surprisingly well, yielding a final performance of: 65536 dots @ ~7ms (render time only). Sorting is now definitely significant. INFOMOV – Lecture 10 – “Practical” 19

Practical Matters

slide-20
SLIDE 20

Sorting in O(1)

For this specific situation, we can sort in O(1), e.g., independent of particle count. Observation: dots do not move independently. Intuition: why rotate 64k dots if you can rotate a single camera? INFOMOV – Lecture 10 – “Practical” 20

Practical Matters

slide-21
SLIDE 21

Sorting in O(1)

INFOMOV – Lecture 10 – “Practical” 21

Practical Matters

slide-22
SLIDE 22

Sorting in O(1)

INFOMOV – Lecture 10 – “Practical” 22

Practical Matters

slide-23
SLIDE 23

Sorting in O(1)

INFOMOV – Lecture 10 – “Practical” 23

Practical Matters

slide-24
SLIDE 24

Sorting in O(1)

INFOMOV – Lecture 10 – “Practical” 24

Practical Matters

slide-25
SLIDE 25

Sorting in O(1)

INFOMOV – Lecture 10 – “Practical” 25

Practical Matters

slide-26
SLIDE 26

Sorting in O(1)

INFOMOV – Lecture 10 – “Practical” 26

Practical Matters

For each split:

  • Process nearest half first
  • Then farthest half
  • Recurse

Where ‘nearest’ is the side that the ‘camera’ is on.

slide-27
SLIDE 27

Today’s Agenda:

  • DotCloud: profiling & high-level (1)
  • DotCloud: low-level and blind stupidity
  • DotCloud: high-level (2)
  • Billiards: high level
  • Digest
slide-28
SLIDE 28

Introducting Billiards

Application breakdown: Clearly, we have 𝑃(𝑂2) behavior here. How do we fix this efficiently? INFOMOV – Lecture 10 – “Practical” 28

Practical Matters

Game::Tick For each ball:

  • Draw ball
  • Update position
  • Bounce off of boundaries
  • For each other ball:
  • Bounce off of peer
slide-29
SLIDE 29

Profiling

Adding a timer:

timer t; t.reset();

Visualizing timing result:

screen->Bar( 10, 10, 170, 20, 0 ); char r[128]; sprintf( r, "simulation time: %5.2f ms", t.elapsed() ); screen->Print( r, 12, 11, 0xffffff );

 Apparently, we can check 2k balls against 2k balls in 4.5ms. INFOMOV – Lecture 10 – “Practical” 29

Practical Matters

slide-30
SLIDE 30

Low level optimization

Lots of square roots in the main loop:

  • Length
  • Normalize (that’s actually the same square root!)
  • Another length (not being used, cost?)
  • Normalize (this one is for accuracy reasons)
  • Normalize (also for accuracy reasons)

Improvements:

  • Use 𝑒2, only calculate square root if we already know balls are

too close.

  • Do the ‘accuracy fix’ normalizations only if accuracy becomes

problematic (e.g., squared length > 1 + epsilon). INFOMOV – Lecture 10 – “Practical” 30

Practical Matters

slide-31
SLIDE 31

High Level Optimization

Each ball checks every other ball (with a larger index, so in fact only half of the other balls on average). We need an efficient way to find nearby balls.

  • Quadtree?
  • Grid?

Grid it is: we will check the balls in the cell our ball is in, plus the direct neighbors. INFOMOV – Lecture 10 – “Practical” 31

Practical Matters

Steps: 1. Per frame, update the grid 2. Per ball, determine relevant cells 3. Loop over cells 4. Loop over balls in cell

slide-32
SLIDE 32

High Level Optimization

Attempt 1: std::vector // allocate grid

std::vector<int> grid[SCRHEIGHT / 16][SCRWIDTH / 16]; // fill the grid for( int y = 0; y < SCRHEIGHT / 16; y++ ) for( int x = 0; x < SCRWIDTH / 16; x++ ) grid[y][x].clear(); for( int i = 0; i < BALLCOUNT; i++ ) { int gx = CLAMP( (int)(pos[i].x / 16 ), 0, SCRWIDTH / 16 - 1 ); int gy = CLAMP( (int)(pos[i].y / 16 ), 0, SCRHEIGHT / 16 - 1 ); grid[gy][gx].push_back( i ); }

INFOMOV – Lecture 10 – “Practical” 32

Practical Matters

slide-33
SLIDE 33

High Level Optimization

Attempt 1: std::vector

// using the grid int gx = CLAMP( (int)(pos[i].x / 16 ), 0, SCRWIDTH / 16 - 1 ); int gy = CLAMP( (int)(pos[i].y / 16 ), 0, SCRHEIGHT / 16 - 1 ); int gx1 = MAX( 0, gx - 1 ), gx2 = MIN( SCRWIDTH / 16 - 1, gx + 1 ); int gy1 = MAX( 0, gy - 1 ), gy2 = MIN( SCRHEIGHT / 16 - 1, gy + 1 ); for( int y = gy1; y <= gy2; y++ ) for( int x = gx1; x <= gx2; x++ ) for( int k = 0; k < grid[y][x].size(); k++ ) { int j = grid[y][x][k]; if (j <= i) continue;

INFOMOV – Lecture 10 – “Practical” 33

Practical Matters

slide-34
SLIDE 34

High Level Optimization

Attempt 1: std::vector Result: simulation time went down from 4.5ms to 1.17ms (although performance is very unstable). INFOMOV – Lecture 10 – “Practical” 34

Practical Matters

slide-35
SLIDE 35

High Level Optimization

Attempt 2: regular arrays Assumption: in a 16x16 grid cell, we will never have more than 8 balls.

// allocate grid int grid[SCRHEIGHT / 16][SCRWIDTH / 16][8]; // allocate array for storing grid cell ball count int nr[SCRHEIGHT / 16][SCRWIDTH / 16] // fill the grid memset( nr, 0, SCRHEIGHT / 16 * SCRWIDTH / 16 * 4 ); for( int i = 0; i < BALLCOUNT; i++ ) { int gx = CLAMP( (int)(pos[i].x / 16 ), 0, SCRWIDTH / 16 - 1 ); int gy = CLAMP( (int)(pos[i].y / 16 ), 0, SCRHEIGHT / 16 - 1 ); grid[gy][gx][nr[gy][gx]++] = i; }

INFOMOV – Lecture 10 – “Practical” 35

Practical Matters

slide-36
SLIDE 36

High Level Optimization

Attempt 2: regular arrays

// using the grid int gx = CLAMP( (int)(pos[i].x / 16 ), 0, SCRWIDTH / 16 - 1 ); int gy = CLAMP( (int)(pos[i].y / 16 ), 0, SCRHEIGHT / 16 - 1 ); int gx1 = MAX( 0, gx - 1 ), gx2 = MIN( SCRWIDTH / 16 - 1, gx + 1 ); int gy1 = MAX( 0, gy - 1 ), gy2 = MIN( SCRHEIGHT / 16 - 1, gy + 1 ); for( int y = gy1; y <= gy2; y++ ) for( int x = gx1; x <= gx2; x++ ) for( int k = 0; k < nr[y][x]; k++ ) { int j = grid[y][x][k]; if (j <= i) continue;

INFOMOV – Lecture 10 – “Practical” 36

Practical Matters

slide-37
SLIDE 37

High Level Optimization

Attempt 2: regular arrays Result: simulation time went down from 1.17ms to 1.07ms (but, still quite unstable). …Why is it unstable? After fixing this, we are now at 0.45ms, A 10x improvement. INFOMOV – Lecture 10 – “Practical” 37

Practical Matters

slide-38
SLIDE 38

Data Locality

The access pattern is currently as follows: For every ball For every cell in the neighborhood Check every ball in the cell  Each individual ball will be in a random cell.  Drawing the balls will access the screen randomly. We can improve locality by looping over cells in the outer loop. INFOMOV – Lecture 10 – “Practical” 38

Practical Matters

slide-39
SLIDE 39

More Profiling

INFOMOV – Lecture 10 – “Practical” 39

Practical Matters

Cur urrently, , our

  • ur bo

bottlenecks ar are:

  • Clear
  • Sprite::Draw
  • vec2::length

Also note that the three for loops take quite a bit of time.

slide-40
SLIDE 40

Today’s Agenda:

  • DotCloud: profiling & high-level (1)
  • DotCloud: low-level and blind stupidity
  • DotCloud: high-level (2)
  • Billiards: high level
  • Digest
slide-41
SLIDE 41

High Level Optimization

High level optimization:

  • 1. reducing algorithmic complexity of bottlenecks;
  • 2. exchanging inefficient algorithms

High level optimization almost always yields the biggest gains in performance. Typical approach:

  • Divide and conquer: spatial subdivision / object subdivision
  • Preventing work for a group of input elements
  • Use your intuition: does this for loop really need every iteration?

INFOMOV – Lecture 10 – “Practical” 41

Digest

slide-42
SLIDE 42

Low Level Optimization

Use that list of common opportunities!

  • Expensive operations
  • Powers of two enable bitshifts, masks (for cheap modulos), …
  • Lookup tables
  • Branching is evil
  • Late in / early out
  • Work around excessive type conversion

And:

  • Mind memory access patterns: strive for linear
  • Watch for static expressions inside a loop

INFOMOV – Lecture 10 – “Practical” 42

Digest

slide-43
SLIDE 43

/INFOMOV/ END of “Practical”

next lecture: “Presentations”