can we understand performance counter results
play

Can We Understand Performance Counter Results? Vince Weaver ICL - PowerPoint PPT Presentation

Can We Understand Performance Counter Results? Vince Weaver ICL Lunch Talk 23 July 2010 How Do We Know if Counters are Working? Three common failures: Wrong counter (PAPI, Kernel, User) Counter works but gives wrong values Counter


  1. Can We Understand Performance Counter Results? Vince Weaver ICL Lunch Talk 23 July 2010

  2. How Do We Know if Counters are Working? Three common failures: • Wrong counter (PAPI, Kernel, User) • Counter works but gives wrong values • Counter is giving “right” values but documentation is wrong 1

  3. Deterministic Events Easiest to Validate • Retired Instructions • Retired Branches • Retired Loads and Stores • Retired Multiplies and Divides • Retired µ ops • Retired Floating Point and SSE • Other ( fxch , cpuid , move operations, serializing instructions, memory barriers, and not-taken branches) 2

  4. Ideal Deterministic Events • Results are same run-to-run • Event is frequent enough to be useful • The expected count can easily be determined by code inspection • Available on many processors 3

  5. Retired Instruction Overcount Estimated Timer Frequency (Hz) 253.perlbmk.535 253.perlbmk.704 253.perlbmk.957 253.perlbmk.535 1000 253.perlbmk.957 250 100 253.perlbmk.704 550MHz Pentium III 100Hz 250Hz 1000Hz Estimated Timer Frequency (Hz) 253.perlbmk.704 176.gcc.expr 253.perlbmk.957 1000 176.gcc.200 250 100 176.gcc.scilab 2.2GHz Phenom 100Hz 250Hz 1000Hz Estimated Timer Frequency (Hz) 253.perlbmk.535 1000 176.gcc.166 250 176.gcc.166 100 2.8GHz Pentium 4 100Hz 250Hz 1000Hz 4

  6. Tracking Down the Source of Overcounts • Work backward from existing benchmarks? • Assembly Language! 5

  7. Contributors to Instruction Count on x86 64 Expected Count +1 for every Hardware Interrupt +1 for each memory page touched +1 for first floating point ins Processor Errata Undocumented processor quirks 6

  8. Retired Instruction Results machine Raw Results Adjusted Results Adjustments Made Expected 226,990,030 226,990,030 Core2 10,793 ± 40 12 ± 1 HW Int Atom 11,601 ± 495 -43 ± 12 HW Int Nehalem 11,794 ± 1316 2 ± 7 HW Int Nehalem-EX 11,915 ± 9 6 ± 2 HW Int Pentium D R 2,610,571 ± 8 200,561 ± 8 Instr Double Counts Pentium D C 10,794 ± 28 -52 ± 5 HW Int Phenom 310,601 ± 11 11 ± 0 HW Int, FP Except Istanbul 311,830 ± 78 9 ± 1 HW Int, FP Except Pin 2.51868e9 ± 0 0 ± 0 Count rep string as 1 Qemu -16,410,000 ± 0 Valgrind -6,909,896 ± 0 7

  9. Retired Stores Results machine Raw Results Adjusted Results Adjustments Made Expected 24,060,000 24,060,000 Core2 0 ± 0 0 ± 0 Atom n/a n/a Nehalem 411,632 ± 1483 410,014 ± 1 HW Int Nehalem-EX 411,914 ± 6 410,018 ± 1 HW Int Pentium D -12,880,000 ± 0 Phenom n/a n/a Istanbul n/a n/a Pin 802,180,000 ± 0 980,000 ± 0 Count rep string as 1 Qemu n/a n/a Valgrind -7,542,176 ± 0 8

  10. Retired Floating Point machine FP1 FP2 SSE Core2 73,500,376 ± 140 40,299,997 ± 0 23,200,000 ± 0 Atom 38,800,000 ± 0 0 ± 0 88,299,597 ± 792 Nehalem 50,150,648 ± 140 17,199,998 ± 1 24,201,639 ± 957 Nehalem-EX 50,155,704 ± 562 17,199,998 ± 2 24,007,005 ± 197,401 Pentium D 100,400,262 ± 9 140,940,555 ± 39,287 53,149,435 ± 522,879 Phenom 26,600,001 ± 0 112,700,001 ± 0 15,800,000 ± 0 Istanbul 26,600,001 ± 0 112,700,001 ± 0 15,800,000 ± 0 9

  11. Other Architectures • ARM – Cannot select only userspace events • ia64 – Loads, Stores, Instructions all deterministic • POWER – On Power6 Instructions is deterministic, Branches is not • SPARC – on Niagara1, Instructions is deterministic 10

  12. Non-deterministic Events • Cache and Memory Related • Branch Predictor • Cycles • Stalls 11

  13. Simplistic Cache Model 0 1 2 3 4 5 511 12

  14. Simplistic Cache Model 0 1 2 3 4 5 511 13

  15. Simplistic Cache Model ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� 0 ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� 1 2 3 4 5 511 14

  16. Simplistic Cache Model ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� 0 ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� 1 2 3 4 5 511 15

  17. Simplistic Cache Model ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� 0 ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� 1 2 3 4 5 511 16

  18. Simplistic Cache Model ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� 0 ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� 1 2 3 4 5 511 17

  19. Simplistic Cache Model ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� 0 ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� 1 2 3 4 5 511 18

  20. Simplistic Cache Model ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� 0 ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� 1 2 3 4 5 511 19

  21. Simplistic Cache Model ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� 0 ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� ��� ��� �� �� 1 2 3 4 5 511 20

  22. L1 Data Cache Accesses float array[1000],sum = 0.0; PAPI_start_counters(events,1); for(int i=0; i<1000; i++) { sum += array[i]; } PAPI_stop_counters(counts,1); 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend