Programming for Performance
1
Programming for Performance 1 Textbook Definition of Real-time A - - PowerPoint PPT Presentation
Programming for Performance 1 Textbook Definition of Real-time A Real-time System responds in a (timely) predictable way to unpredictable external stimuli arrivals. A system is a real-time system when it can support the execution of
1
A Real-time System responds in a (timely) predictable way to unpredictable external stimuli arrivals. A system is a real-time system when it can support the execution of applications with time constraints on that execution.
the same challenges on a smaller scale
– Any lateness of results unacceptable
– Occasional lateness is not a total system failure
– Rising cost of lateness
– No system in video games are hard real time – Failures obviously aren’t as bad as in many real-time systems
– Hardware consumes data at 44 KHz (stereo) – Any amount of dropout is very bad – Can’t extrapolate to fill in the missing sound data
– Sound must correlate with visual or input events
– 60 fps (frames per second) is ideal – 20 fps is okay – 5 fps is no fun at all – Some games are more sensitive (FPS, fighters)
– Latency (individual operation) – Throughput (individual operation) – Framerate – CPU/GPU utilization
– Time from initiation of DVD read to time the head is placed over the correct track: up to 200 ms
– Display, sound, input 10-50ms – Network: 300ms
– Wireless controllers, wireless headphones, motion smoothing on TVs
time
– Most standard computing performance measures (TFLOPS, etc) – Amount of data that can be read from an Xbox 360 DVD in one second: 6 - 15 MB – Vertex or pixel processing rate
when measuring performance
– CPU example: deep pipelines to increase clock rate – GPU example: triangle throughput vs. state change latency – Don’t concentrate solely on one to the detriment of another
but it may make the controls feels sluggish
the next
milliseconds per frame (i.e. 33 ms)
external constraints (i.e. vsync) different systems may be running for different portions of frame
where both are running for 30 ms
– You are leaving quality on the table, could get either better performance or more stuff by balancing better
– Want to balance utilization of cores as well as possible
– Good for selling things, but not useful for optimisation
– Must use this to ensure application always performs better than lower-bounds
– Good indicator, but can be misleading if the performance can spike
– Record per frame rate over many frames, plot the results in a spreadsheet to look for trouble areas or areas of high visibility – Helps if gameplay session can be repeatable (journaling)
– Smooth
– Responsive
– Consistent
– Display rate – Sound latency – Controller response – Load time – Network latency
– Online FPS with a 1000 ms ping
– Memory optimisation
– Designers always want more than you can provide – Puts positive pressure on programmer to improve system
– Must out-do previous title, competition
– A game that only works on today’s state-of-the-art hardware may shut out a large portion of your audience (and sales)
– Richer content – Faster, tighter controls – Higher game reviews
– Optimising promotes understanding
– Takes more time to develop
– Compilers can and will beat you some (most?) of the time – Maintainability / readability suffers (even without Assembly) – Portability sacrificed – Hard to debug – Easy to be fooled
– Lost opportunity
your speed just by fixing it
don't need to worry about it
– Can waste a lot of time optimizing things that don't matter
techniques can hide performance issues where you can't find them
– Language features and hardware quirks are common culprits here, since they are resistant to many profiling techniques – So are over-designed and needlessly abstract systems
costs of design choices up front
things are nearing completion
– Find performance bottlenecks – Fix them – Repeat
intuition and measurement
– Have an overall understanding of the problem to be solved – Carefully consider algorithms and data structures – Understand how the compiler translates your code, and how the computer executes it – Identify performance bottlenecks – Eliminate them using the appropriate level of optimisation
– How long do I have to work on this? – Has this been solved before? (yes!)
– What are the characteristics of the data?
– What can be computed offline? – Is there a simpler problem lurking within? – Can the hardware help me?
– A bubble-sort in hand-tweaked assembly is still slow – Have a toolkit of good general purpose algorithms developed by smart people
– In practice, we are less formal about it – Remember that ‘n’ and ‘c’ matter in real code! – We care more about the particularities of compilers and hardware
– Helps if you are familiar with the algorithm/code – Don’t trust it alone though!
– Measure performance to find hot spots – Many tools available:
– Profiling exhibits some quantum uncertainty. Can’t always
game:
– Frame rate counter – Rendering statistics
– Memory used per pool – Network ping time – Collision tests per frame – Anything else that is interesting
contribution to the frame rate:
– Disable parts of the renderer
– Turn off sound – Turn off collision
architecture paying off
– Places code at the beginning and end of every function to record timings – Gives accurate tallies of function frequency, total function time, etc. – Records call graph – Intrusive since code is changed
– At regular intervals (e.g. 1 ms), the current PC is recorded – Later, the samples can be tallied and cross-referenced with the source code – Fast, non-obtrusive – Works on non-instrumented code
– Less accurate: events can be missed – No call graph available
– System library calls – Context switches or other thread events – OpenGL calls
– Thread map – Replay graphics driver calls to generate detail profiling info
– Compilers are sophisticated but dumb – Narrow view of program at any given time – No concept of “the problem”
strengths
– Don't get paranoid though, C++ is plenty fast enough for games, as long as you are aware of its limitations
– We'll talk about this in the next lecture
– Fast CPU – Deep pipelines – Slow memory
– Poor cache locality – Unpredictable branching – High memory usage
compilers
this:
– Most modern CPUs have L2 cache miss latencies in the range of 100-300 cycles. – If you can compute the value in 50 cycles that would have been read from main memory, you've gained 2-6X performance – There are many opportunities to do this
– Profile, profile, profile!
– Optimisation is very non-linear – It takes a big bag-of-tricks to be effective – Practice!
– Hand-tweaked assembly can beat the compiler by a factor of 100 in some cases – This takes a clear understanding of all factors to succeed – Spending days only to have the compiler beat you is no fun
34
– Static lighting (light maps) – Potentially visible set (PVS) calculation
“All programming is an exercise in caching”
results for quick retrieval
inventory cache
– If you are lucky, the result isn’t needed at all
– Store dirty flags – Copy-on-write – Operating systems
modifies a given page
– Small == fast – Fit into cache line width – Walk linearly
– Better cache utilization if only touching certain fields – SoA may be better for SIMD
struct { float x. y, z; float dx, dy, dz; float age; } particles [10]; struct { float x[10], y[10], z[10]; float dx[10], dy[10], dz[10] float age[10]; } particles;
if (OnScreen(object.BoundingSphere())) {
}
– Simulate gravity, but not collisions, for particles – Render at a lower resolution and scale up – Use Taylor Series or other mathematical approximations
– Often opens the way for pre-computation
– Calculate properties in vertex shader and interpolate, rather than calculating in pixel shader – Store animation keyframes and linearly interpolate, rather than calculating an animation curve at each point
each independently
– Binary search – Quicksort – BSP trees or other spatial hierarchy
parallelized
a = a / 16; // divide (≈40 cycles) a = a >> 4; // shift (≈1 cycle)
same operation
– Vector and matrix operations – Particle systems – Skinning
– Unless you know how to build your own chips
be expressed in C++
– Conditional writes, bit rotates, cache prefetches, etc. – Different register preservation semantics – Jump tables
the GPU on modern hardware
– GPU’s are significantly faster at this sort of thing
– No one is even shipping single core phones any more
technique right now
– If you properly multithread a game, and you have 4 cores, you could quadruple the performance
– Heavyweight threads – Job based – Local optimization (OpenMP, OpenCL)
– Deadlocks, race conditions, memory stompage, etc
– Understand the fundamental performance characteristics of the systems you are implementing – Develop a repertoire of performance friendly techniques – Profile relentlessly – Become familiar with your compiler and hardware
– But remember that brute force can also work well
performance
Rules of Optimisation Rule 1: Don’t do it. Rule 2 (experts only): Don’t do it yet.
“…premature optimisation is the root of all evil.”