Runtime CPU Spike Detection using Manual and Compiler-Automated Instrumentation
Adisak Pochanayon
Principal Software Engineer Netherrealm Studios adisak@wbgames.com
and Compiler-Automated Instrumentation Adisak Pochanayon Principal - - PowerPoint PPT Presentation
Runtime CPU Spike Detection using Manual and Compiler-Automated Instrumentation Adisak Pochanayon Principal Software Engineer Netherrealm Studios adisak@wbgames.com Topics to Cover This talk is about runtime instrumentation-based
Principal Software Engineer Netherrealm Studios adisak@wbgames.com
class CScopedMarker { CScopedMarker(ProfDesc &info) { StartMarker(info); } ~CScopedMarker(ProfDesc &info) { StopMarker(info); } }; #define ScopedMarker(INFO) CScopedMarker(INFO) \ ProfInfo##__LINE__
__wrap_function_name() __real_function_name()
CALLING FUNCTION TARGET FUNCTION
CALLING FUNCTION “WRAPPER” FUNCTION “REAL” FUNCTION
not require source markup of the target function.
CALLING FUNCTION TARGET FUNCTION
CALLING FUNCTION DETOUR FUNCTION TARGET FUNCTION JUMP
CALLING FUNCTION TARGET FUNCTION TARGET PROLOG Trampoline Buffer TARGET PROLOG COPY
CALLING FUNCTION TARGET FUNCTION TARGET PROLOG JUMP Trampoline Buffer TARGET PROLOG
CALLING FUNCTION TARGET FUNCTION TARGET PROLOG JUMP TARGET PROLOG
CALLING FUNCTION DETOUR FUNCTION TARGET FUNCTION JUMP TARGET PROLOG JUMP
– http://research.microsoft.com/en-us/projects/detours/
– http://research.microsoft.com/pubs/68568/huntusenixnt99.pdf
Visual C++ /callcap and /fastcap, GNU gprof (w/ gcc –pg)
FUNCTION BODY PROLOG EPILOG
Compiler Automated Instrumentation
FUNCTION BODY PROLOG EPILOG Log Entry { _penter() __cyg_profile_func_enter () } Log Exit { _pexit() __cyg_profile_func_exit () }
void __cyg_profile_func_enter (void *this_fn, void *call_site); void __cyg_profile_func_exit (void *this_fn,void *call_site);
extern "C" void __declspec(naked) _cdecl _penter( void ) { _asm { push eax push ebx push ecx push edx push ebp push edi push esi } if(0==tls_PET_bIsInProcessing) { tls_PET_bIsInProcessing=true; // Call C Work Function _internal_PET_LogEntry(0); tls_PET_bIsInProcessing=false; } _asm { pop esi pop edi pop ebp pop edx pop ecx pop ebx pop eax ret } }
void __declspec(naked) _cdecl __penter( void ) { __asm { // Tiny Prolog // - Set link register (r12) & return address (two steps) std r12,-20h(r1) // Saving LR here is extra step ! mflr r12 stw r12,-8h(r1) // Return Address bl PET_prolog bl _internal_PET_LogEntry b PET_epilog } }
“C++” Instrumented Function _penter() {asm}
“C++” Instrumented Function _penter() {asm} PET_Prolog {asm}
“C++” Instrumented Function _penter() {asm} PET_Prolog {asm} PET_Prolog Early Out {asm}
“C++” Instrumented Function _penter() {asm} PET_Prolog {asm}
“C++” Profiling Routine for Logging Function Entry
“C++” Instrumented Function _penter() {asm} PET_Prolog {asm}
PET_Epilog {asm} “C++” Profiling Routine for Logging Function Entry
“C++” Instrumented Function _penter() {asm} PET_Prolog {asm} PET_Prolog Early Out {asm}
PET_Epilog {asm} “C++” Profiling Routine for Logging Function Entry
// Get the TLS thread-specific base lwz r11,0(r13) // Do not try to run in DPC! // In DPC { 0(r13) == 0 } cmplwi cr6,r11,0 beq cr6,label__early_exit_prolog
lau r14,_tls_start // Get the TLS global base lau r12,tls_PET_bIsInProcessing lal r14,r14,_tls_start lal r12,r12,tls_PET_bIsInProcessing sub r11,r11,r14 // TLS Base Offset (r11) add r14,r11,r12 // r14 == &tls_PET_bIsInProcessing // Avoid recursion using thread variable tls_PET_bIsInProcessing lwzx r12,r11,r12 cmplwi cr6,r12,0 bne cr6,label__early_exit_prolog li r12,1 stw r12,0(r14) // Set tls_PET_bIsInProcessing
That may have looked complicated but it’s actually pretty
if(tls_PET_bIsInProcessing) goto label__early_exit_prolog; tls_PET_bIsInProcessing=true;
// Save r0/r2-r10 (temporaries) std r0,8h(r1) std r2,10h(r1) // (r2 is reserved on XBOX 360) std r3,18h(r1) std r4,20h(r1) std r5,28h(r1) std r6,30h(r1) std r7,38h(r1) std r8,40h(r1) std r9,48h(r1) std r10,50h(r1) blr // Return To Caller
label__early_exit_prolog: // Tiny Epilog -- adjust stack (r1) & restore r12/r14/r11 addi r1,r1,100h lwz r12,-8h(r1) mtlr r12 ld r12,-20h(r1) ld r14,-28h(r1) ld r11,-30h(r1) blr
“C++” Instrumented Function _pexit() {asm} PET_Prolog {asm} PET_Prolog Early Out {asm}
PET_Epilog {asm} “C++” Profiling Routine for Logging Function Exit
1. time entered 2.
3.
4. description (most of these are optional)
– function address – function name (pointer to static) – line number – source file name (pointer to static) – user generated description (dynamic string)
Why threshold functions? Lets say you set the global threshold to 1ms to find any function that takes over 1ms. Your Game::Tick() function for a game at 60 fps will take 16.66ms so to avoid kicking a spike warning, you can insert this markup into Game::Tick() :
PET_FUNCTION(); PET_SetFunctionThresholdMSec(1000.0/MIN_FPS);
Script Functions (0.1ms) Global Threshhold for Most Game Functions (1ms) Tick Functions (1 Frame = 16 ms) File I/O Functions (2 s)
Threshold (OT).
void ExecuteScript(ScriptContext *sc) { PET_FUNCTION(); sc->ExecuteStep(); PET_OTMessagef("Script function: %s", sc->GetCurrentFunctionName()); }
void ExecuteScript(ScriptContext *sc) { PET_FUNCTION(); PET_SetFunctionThresholdMSec(0.1); PET_SetFunctionDescf("Script function: %s", sc->GetCurrentFunctionName()); sc->ExecuteStep(); }
Control if Profiling is Active PET_SetThreadActive() PET_FUNCTION_IGNORE() PET_FUNCTION_IGNORE_CHILDREN() PET_Pause() PET_Unpause() Conditional Profiling (can allow "channels") PET_FUNCTION_CONDITIONAL(cond) PET_FUNCTION_CONDITIONAL_PAUSED(cond) PET_FUNCTION_CONDITIONAL_IGNORE_CHILDREN(cond) PET_FUNCTION_CONDITIONAL_IGNORE(cond)
PET: Function took too long: 11.302 MSec Thread: MainThread PEAK CHILD @ frame 2323 Function Name Trace 0) 0x82449eb4 - main 1) 0x824499c8 - GuardedMain 2) 0x82444c80 - FEngineLoop::Tick 3) 0x82440ddc - UMK9GameEngine::Tick 4) 0x82d6d39c - MKScriptVM::Tick 5) 0x82d6f2dc - MKListNoDestroy::ForEach 6) 0x82d6cd54 - MKScriptVM::Step 7) 0x82d690ac - _call_c_function 8) 0x82440ddc - CreateEnduranceOpponentPhase1 9) 0x82440ddc - CreatePlayerPhase1 10) 0x82440ddc - SpawnMKFGCharacterObj 11) 0x82440ddc – CreateMeshes PET: Thread (MainThread) Function took too long: 14.499 MSec 10) 0x82440ddc - SpawnMKFGCharacterObj PET: Thread (MainThread) Function took too long: 14.599 MSec 9) 0x82440ddc - CreatePlayerPhase1 PET: Thread (MainThread) Function took too long: 15.097 MSec 8) 0x82440ddc - CreateEnduranceOpponentPhase1 !!!!!-----------------------------------------------------------!!!!! Slow Script->C Call: _CreateEnduranceOpponentPhase1 takes 16.576 ms (0.995 frames) to execute !!!!!-----------------------------------------------------------!!!!! PET: Thread (MainThread) Function took too long: 15.377 MSec 7) 0x82d690ac - _call_c_function PET: Thread (MainThread) Function took too long: 15.401 MSec 6) 0x82d6cd54 - MKScriptVM::Step PET: Thread (MainThread) Function took too long: 15.490 MSec 5) 0x82d6f2dc - MKListNoDestroy::ForEach PET: Thread (MainThread) Function took too long: 15.578 MSec 4) 0x82d6d39c - MKScriptVM::Tick
PET: Function took too long: 8.586 MSec Thread: MainThread PEAK CHILD @ frame 2422 Function Name Trace 0) Tick (LaunchEngineLoop.cpp - LINE: 1981) 1) Tick (MK9Game.cpp - LINE: 499) 2) Tick (ScriptCore.cpp - LINE: 908) 3) Step (ScriptCore.cpp - LINE: 440) 4) _call_c_function (ScriptCore.cpp - LINE: 4194) 5) CreateEnduranceOpponentPhase1 (FGPlayer.cpp - LINE: 881) 6) CreatePlayerPhase1 (FGPlayer.cpp - LINE: 527) 7) SpawnMKFGCharacterObj (FGPlayer.cpp - LINE: 243) 8) CreateMeshes (FGPlayer.cpp - LINE: 1088) 9) SetSkeletalMesh (UnSkeletalComponent.cpp - LINE: 3465) PET: Thread (MainThread) Function took too long: 12.851 MSec 8) CreateMeshes (FGPlayer.cpp - LINE: 1088) PET: Thread (MainThread) Function took too long: 15.429 MSec 7) SpawnMKFGCharacterObj (FGPlayer.cpp - LINE: 243) PET: Thread (MainThread) Function took too long: 15.522 MSec 6) CreatePlayerPhase1 (FGPlayer.cpp - LINE: 527) PET: Thread (MainThread) Function took too long: 15.869 MSec 5) CreateEnduranceOpponentPhase1 (FGPlayer.cpp - LINE: 881) !!!!!-----------------------------------------------------------!!!!! Slow Script->C Call: _CreateEnduranceOpponentPhase1 takes 16.446 ms (0.987 frames) to execute !!!!!-----------------------------------------------------------!!!!! PET: Thread (MainThread) Function took too long: 16.015 MSec 4) _call_c_function (ScriptCore.cpp - LINE: 4194) PET: Thread (MainThread) Function took too long: 16.035 MSec 3) Step (ScriptCore.cpp - LINE: 440) PET: Thread (MainThread) Function took too long: 16.200 MSec 2) Tick (ScriptCore.cpp - LINE: 908)
PET_Pause, PET_Ignore, or appropriate thresholds
– Slow but useful results which were verified in RAZOR – Divide and Conquer / Instrument Children of Slow functions
– Turn on CAI Mode / Execute slow move – A bunch of functions get logged (instrument if desired) – Much faster at locating issues (automatic)
children functions AND give script context info