Can We Understand Performance Counter Results? Vince Weaver ICL - - PowerPoint PPT Presentation
Can We Understand Performance Counter Results? Vince Weaver ICL - - PowerPoint PPT Presentation
Can We Understand Performance Counter Results? Vince Weaver ICL Lunch Talk 23 July 2010 How Do We Know if Counters are Working? Three common failures: Wrong counter (PAPI, Kernel, User) Counter works but gives wrong values Counter
How Do We Know if Counters are Working?
Three common failures:
- Wrong counter (PAPI, Kernel, User)
- Counter works but gives wrong values
- Counter is giving “right” values but documentation is
wrong
1
Deterministic Events Easiest to Validate
- Retired Instructions
- Retired Branches
- Retired Loads and Stores
- Retired Multiplies and Divides
- Retired µops
- Retired Floating Point and SSE
- Other ( fxch, cpuid, move operations, serializing
instructions, memory barriers, and not-taken branches)
2
Ideal Deterministic Events
- Results are same run-to-run
- Event is frequent enough to be useful
- The expected count can easily be determined by code
inspection
- Available on many processors
3
Retired Instruction Overcount
100 1000 250 Estimated Timer Frequency (Hz)
100Hz 250Hz 1000Hz
550MHz Pentium III
253.perlbmk.535 253.perlbmk.704 253.perlbmk.957 253.perlbmk.535 253.perlbmk.704 253.perlbmk.957
100 1000 250 Estimated Timer Frequency (Hz)
100Hz 250Hz 1000Hz
2.2GHz Phenom
253.perlbmk.704 176.gcc.200 253.perlbmk.957 176.gcc.expr 176.gcc.scilab
100 1000 250 Estimated Timer Frequency (Hz)
100Hz 250Hz 1000Hz
2.8GHz Pentium 4
253.perlbmk.535 176.gcc.166 176.gcc.166
4
Tracking Down the Source of Overcounts
- Work backward from existing benchmarks?
- Assembly Language!
5
Contributors to Instruction Count on x86 64
Expected Count +1 for every Hardware Interrupt +1 for each memory page touched +1 for first floating point ins Processor Errata Undocumented processor quirks
6
Retired Instruction Results
machine Raw Results Adjusted Results Adjustments Made Expected 226,990,030 226,990,030 Core2 10,793±40 12±1 HW Int Atom 11,601±495
- 43±12
HW Int Nehalem 11,794±1316 2±7 HW Int Nehalem-EX 11,915±9 6±2 HW Int Pentium D R 2,610,571±8 200,561±8 Instr Double Counts Pentium D C 10,794±28
- 52±5
HW Int Phenom 310,601±11 11±0 HW Int, FP Except Istanbul 311,830±78 9±1 HW Int, FP Except Pin 2.51868e9±0 0±0 Count rep string as 1 Qemu
- 16,410,000±0
Valgrind
- 6,909,896±0
7
Retired Stores Results
machine Raw Results Adjusted Results Adjustments Made Expected 24,060,000 24,060,000 Core2 0±0 0±0 Atom n/a n/a Nehalem 411,632±1483 410,014±1 HW Int Nehalem-EX 411,914±6 410,018±1 HW Int Pentium D
- 12,880,000±0
Phenom n/a n/a Istanbul n/a n/a Pin 802,180,000±0 980,000±0 Count rep string as 1 Qemu n/a n/a Valgrind
- 7,542,176±0
8
Retired Floating Point
machine FP1 FP2 SSE Core2 73,500,376±140 40,299,997± 23,200,000± Atom 38,800,000± 0± 88,299,597± 792 Nehalem 50,150,648±140 17,199,998± 1 24,201,639± 957 Nehalem-EX 50,155,704±562 17,199,998± 2 24,007,005±197,401 Pentium D 100,400,262± 9 140,940,555±39,287 53,149,435±522,879 Phenom 26,600,001± 112,700,001± 15,800,000± Istanbul 26,600,001± 112,700,001± 15,800,000±
9
Other Architectures
- ARM – Cannot select only userspace events
- ia64 – Loads, Stores, Instructions all deterministic
- POWER – On Power6 Instructions is deterministic,
Branches is not
- SPARC – on Niagara1, Instructions is deterministic
10
Non-deterministic Events
- Cache and Memory Related
- Branch Predictor
- Cycles
- Stalls
11
Simplistic Cache Model
1 2 3 4 511 5 12
Simplistic Cache Model
1 2 3 4 511 5 13
Simplistic Cache Model
- 1
2 3 4 511 5 14
Simplistic Cache Model
- 1
2 3 4 511 5 15
Simplistic Cache Model
- 1
2 3 4 511 5 16
Simplistic Cache Model
- 1
2 3 4 511 5 17
Simplistic Cache Model
- 1
2 3 4 511 5 18
Simplistic Cache Model
- 1
2 3 4 511 5 19
Simplistic Cache Model
- 1
2 3 4 511 5 20
L1 Data Cache Accesses
float array[1000],sum = 0.0; PAPI_start_counters(events,1); for(int i=0; i<1000; i++) { sum += array[i]; } PAPI_stop_counters(counts,1);
21
PAPI L1 DCA
L1 DCache Accesses normalized against 1000
P P r
- P
4 A t
- m
C
- r
e 2 I s t a n b u l N e h a l e m E X N e h a l e m g c c 4 . 1 N e h a l e m g c c 4 . 3
1 2 3 4 5 6 Normalized Accesses
No Counter Available
22
PAPI L1 DCA
Expected Code
* 4020d8: f3 0f 58 00 addss (%rax),%xmm0 4020dc: 48 83 c0 04 add $0x4,%rax 4020e0: 48 39 d0 cmp %rdx,%rax 4020e3: 75 f3 jne 4020d8 <main+0x328>
Unexpected Code
* 401e18: f3 0f 10 44 24 0c movss 0xc(%rsp),%xmm0 * 401e1e: f3 0f 58 04 82 addss (%rdx,%rax,4),%xmm0 401e23: 48 83 c0 01 add $0x1,%rax 401e27: 48 3d e8 03 00 00 cmp $0x3e8,%rax * 401e2d: f3 0f 11 44 24 0c movss %xmm0,0xc(%rsp) 401e33: 75 e3 jne 401e18 <main+0x398>
23
L1 Data Cache Misses
- Allocate array as big as L1 DCache
- Walk through the array byte-by-byte
- Count misses with PAPI L1 DCM event
24
PAPI L1 DCM – Forward
P P r
- P
4 A t
- m
C
- r
e 2 C
- r
e 2 N
- P
r e f e t c h I s t a n b u l N e h a l e m N e h a l e m E X
0.00 0.25 0.50 0.75 1.00 1.25 Normalized Misses
No Counter Available
25
PAPI L1 DCM – Reverse
P P r
- P
4 A t
- m
C
- r
e 2 C
- r
e 2 N
- P
r e f e t c h I s t a n b u l N e h a l e m N e h a l e m E X
0.00 0.25 0.50 0.75 1.00 1.25 Normalized Misses
No Counter Available
26
PAPI L1 DCM – Random
P P r
- P
4 A t
- m
C
- r
e 2 C
- r
e 2 N
- P
r e f e t c h I s t a n b u l N e h a l e m N e h a l e m E X
0.00 0.25 0.50 0.75 1.00 1.25 Normalized Misses
No Counter Available
27
L1D Sources of Divergences
- Hardware Prefetching
- PAPI Measurement Noise
- Operating System Activity
- Non-LRU Cache Replacement
28
L2 Total Cache Misses
- Allocate array as big as L2 Cache
- Walk through the array byte-by-byte
- Count misses with PAPI L2 TCM event
29
PAPI L2 TCM – Forward
P P r
- P
4 A t
- m
C
- r
e 2 C
- r
e 2 N
- P
r e f e t c h I s t a n b u l N e h a l e m N e h a l e m E X
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Normalized L2 Misses
No Counter Available
30
PAPI L2 TCM – Reverse
P P r
- P
4 A t
- m
C
- r
e 2 C
- r
e 2 N
- P
r e f e t c h I s t a n b u l N e h a l e m N e h a l e m E X
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Normalized L2 Misses
No Counter Available
31
PAPI L2 TCM – Random
P P r
- P
4 A t
- m
C
- r
e 2 C
- r
e 2 N
- P
r e f e t c h I s t a n b u l N e h a l e m N e h a l e m E X
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Normalized L2 Misses
No Counter Available
32
L2 Sources of Divergences
- Hardware Prefetching
- PAPI Measurement Noise
- Operating System Activity
- Non-LRU Cache Replacement
- Cache Coherency Traffic
33
Conclusions
- Counters are hard to understand
- Automated testing is possible
34
Questions?
Pictures from a Maryland-style Crab Feast
35