Can We Understand Performance Counter Results? Vince Weaver ICL - - PowerPoint PPT Presentation

can we understand performance counter results
SMART_READER_LITE
LIVE PREVIEW

Can We Understand Performance Counter Results? Vince Weaver ICL - - PowerPoint PPT Presentation

Can We Understand Performance Counter Results? Vince Weaver ICL Lunch Talk 23 July 2010 How Do We Know if Counters are Working? Three common failures: Wrong counter (PAPI, Kernel, User) Counter works but gives wrong values Counter


slide-1
SLIDE 1

Can We Understand Performance Counter Results?

Vince Weaver ICL Lunch Talk 23 July 2010

slide-2
SLIDE 2

How Do We Know if Counters are Working?

Three common failures:

  • Wrong counter (PAPI, Kernel, User)
  • Counter works but gives wrong values
  • Counter is giving “right” values but documentation is

wrong

1

slide-3
SLIDE 3

Deterministic Events Easiest to Validate

  • Retired Instructions
  • Retired Branches
  • Retired Loads and Stores
  • Retired Multiplies and Divides
  • Retired µops
  • Retired Floating Point and SSE
  • Other ( fxch, cpuid, move operations, serializing

instructions, memory barriers, and not-taken branches)

2

slide-4
SLIDE 4

Ideal Deterministic Events

  • Results are same run-to-run
  • Event is frequent enough to be useful
  • The expected count can easily be determined by code

inspection

  • Available on many processors

3

slide-5
SLIDE 5

Retired Instruction Overcount

100 1000 250 Estimated Timer Frequency (Hz)

100Hz 250Hz 1000Hz

550MHz Pentium III

253.perlbmk.535 253.perlbmk.704 253.perlbmk.957 253.perlbmk.535 253.perlbmk.704 253.perlbmk.957

100 1000 250 Estimated Timer Frequency (Hz)

100Hz 250Hz 1000Hz

2.2GHz Phenom

253.perlbmk.704 176.gcc.200 253.perlbmk.957 176.gcc.expr 176.gcc.scilab

100 1000 250 Estimated Timer Frequency (Hz)

100Hz 250Hz 1000Hz

2.8GHz Pentium 4

253.perlbmk.535 176.gcc.166 176.gcc.166

4

slide-6
SLIDE 6

Tracking Down the Source of Overcounts

  • Work backward from existing benchmarks?
  • Assembly Language!

5

slide-7
SLIDE 7

Contributors to Instruction Count on x86 64

Expected Count +1 for every Hardware Interrupt +1 for each memory page touched +1 for first floating point ins Processor Errata Undocumented processor quirks

6

slide-8
SLIDE 8

Retired Instruction Results

machine Raw Results Adjusted Results Adjustments Made Expected 226,990,030 226,990,030 Core2 10,793±40 12±1 HW Int Atom 11,601±495

  • 43±12

HW Int Nehalem 11,794±1316 2±7 HW Int Nehalem-EX 11,915±9 6±2 HW Int Pentium D R 2,610,571±8 200,561±8 Instr Double Counts Pentium D C 10,794±28

  • 52±5

HW Int Phenom 310,601±11 11±0 HW Int, FP Except Istanbul 311,830±78 9±1 HW Int, FP Except Pin 2.51868e9±0 0±0 Count rep string as 1 Qemu

  • 16,410,000±0

Valgrind

  • 6,909,896±0

7

slide-9
SLIDE 9

Retired Stores Results

machine Raw Results Adjusted Results Adjustments Made Expected 24,060,000 24,060,000 Core2 0±0 0±0 Atom n/a n/a Nehalem 411,632±1483 410,014±1 HW Int Nehalem-EX 411,914±6 410,018±1 HW Int Pentium D

  • 12,880,000±0

Phenom n/a n/a Istanbul n/a n/a Pin 802,180,000±0 980,000±0 Count rep string as 1 Qemu n/a n/a Valgrind

  • 7,542,176±0

8

slide-10
SLIDE 10

Retired Floating Point

machine FP1 FP2 SSE Core2 73,500,376±140 40,299,997± 23,200,000± Atom 38,800,000± 0± 88,299,597± 792 Nehalem 50,150,648±140 17,199,998± 1 24,201,639± 957 Nehalem-EX 50,155,704±562 17,199,998± 2 24,007,005±197,401 Pentium D 100,400,262± 9 140,940,555±39,287 53,149,435±522,879 Phenom 26,600,001± 112,700,001± 15,800,000± Istanbul 26,600,001± 112,700,001± 15,800,000±

9

slide-11
SLIDE 11

Other Architectures

  • ARM – Cannot select only userspace events
  • ia64 – Loads, Stores, Instructions all deterministic
  • POWER – On Power6 Instructions is deterministic,

Branches is not

  • SPARC – on Niagara1, Instructions is deterministic

10

slide-12
SLIDE 12

Non-deterministic Events

  • Cache and Memory Related
  • Branch Predictor
  • Cycles
  • Stalls

11

slide-13
SLIDE 13

Simplistic Cache Model

1 2 3 4 511 5 12

slide-14
SLIDE 14

Simplistic Cache Model

1 2 3 4 511 5 13

slide-15
SLIDE 15

Simplistic Cache Model

  • 1

2 3 4 511 5 14

slide-16
SLIDE 16

Simplistic Cache Model

  • 1

2 3 4 511 5 15

slide-17
SLIDE 17

Simplistic Cache Model

  • 1

2 3 4 511 5 16

slide-18
SLIDE 18

Simplistic Cache Model

  • 1

2 3 4 511 5 17

slide-19
SLIDE 19

Simplistic Cache Model

  • 1

2 3 4 511 5 18

slide-20
SLIDE 20

Simplistic Cache Model

  • 1

2 3 4 511 5 19

slide-21
SLIDE 21

Simplistic Cache Model

  • 1

2 3 4 511 5 20

slide-22
SLIDE 22

L1 Data Cache Accesses

float array[1000],sum = 0.0; PAPI_start_counters(events,1); for(int i=0; i<1000; i++) { sum += array[i]; } PAPI_stop_counters(counts,1);

21

slide-23
SLIDE 23

PAPI L1 DCA

L1 DCache Accesses normalized against 1000

P P r

  • P

4 A t

  • m

C

  • r

e 2 I s t a n b u l N e h a l e m E X N e h a l e m g c c 4 . 1 N e h a l e m g c c 4 . 3

1 2 3 4 5 6 Normalized Accesses

No Counter Available

22

slide-24
SLIDE 24

PAPI L1 DCA

Expected Code

* 4020d8: f3 0f 58 00 addss (%rax),%xmm0 4020dc: 48 83 c0 04 add $0x4,%rax 4020e0: 48 39 d0 cmp %rdx,%rax 4020e3: 75 f3 jne 4020d8 <main+0x328>

Unexpected Code

* 401e18: f3 0f 10 44 24 0c movss 0xc(%rsp),%xmm0 * 401e1e: f3 0f 58 04 82 addss (%rdx,%rax,4),%xmm0 401e23: 48 83 c0 01 add $0x1,%rax 401e27: 48 3d e8 03 00 00 cmp $0x3e8,%rax * 401e2d: f3 0f 11 44 24 0c movss %xmm0,0xc(%rsp) 401e33: 75 e3 jne 401e18 <main+0x398>

23

slide-25
SLIDE 25

L1 Data Cache Misses

  • Allocate array as big as L1 DCache
  • Walk through the array byte-by-byte
  • Count misses with PAPI L1 DCM event

24

slide-26
SLIDE 26

PAPI L1 DCM – Forward

P P r

  • P

4 A t

  • m

C

  • r

e 2 C

  • r

e 2 N

  • P

r e f e t c h I s t a n b u l N e h a l e m N e h a l e m E X

0.00 0.25 0.50 0.75 1.00 1.25 Normalized Misses

No Counter Available

25

slide-27
SLIDE 27

PAPI L1 DCM – Reverse

P P r

  • P

4 A t

  • m

C

  • r

e 2 C

  • r

e 2 N

  • P

r e f e t c h I s t a n b u l N e h a l e m N e h a l e m E X

0.00 0.25 0.50 0.75 1.00 1.25 Normalized Misses

No Counter Available

26

slide-28
SLIDE 28

PAPI L1 DCM – Random

P P r

  • P

4 A t

  • m

C

  • r

e 2 C

  • r

e 2 N

  • P

r e f e t c h I s t a n b u l N e h a l e m N e h a l e m E X

0.00 0.25 0.50 0.75 1.00 1.25 Normalized Misses

No Counter Available

27

slide-29
SLIDE 29

L1D Sources of Divergences

  • Hardware Prefetching
  • PAPI Measurement Noise
  • Operating System Activity
  • Non-LRU Cache Replacement

28

slide-30
SLIDE 30

L2 Total Cache Misses

  • Allocate array as big as L2 Cache
  • Walk through the array byte-by-byte
  • Count misses with PAPI L2 TCM event

29

slide-31
SLIDE 31

PAPI L2 TCM – Forward

P P r

  • P

4 A t

  • m

C

  • r

e 2 C

  • r

e 2 N

  • P

r e f e t c h I s t a n b u l N e h a l e m N e h a l e m E X

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Normalized L2 Misses

No Counter Available

30

slide-32
SLIDE 32

PAPI L2 TCM – Reverse

P P r

  • P

4 A t

  • m

C

  • r

e 2 C

  • r

e 2 N

  • P

r e f e t c h I s t a n b u l N e h a l e m N e h a l e m E X

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Normalized L2 Misses

No Counter Available

31

slide-33
SLIDE 33

PAPI L2 TCM – Random

P P r

  • P

4 A t

  • m

C

  • r

e 2 C

  • r

e 2 N

  • P

r e f e t c h I s t a n b u l N e h a l e m N e h a l e m E X

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Normalized L2 Misses

No Counter Available

32

slide-34
SLIDE 34

L2 Sources of Divergences

  • Hardware Prefetching
  • PAPI Measurement Noise
  • Operating System Activity
  • Non-LRU Cache Replacement
  • Cache Coherency Traffic

33

slide-35
SLIDE 35

Conclusions

  • Counters are hard to understand
  • Automated testing is possible

34

slide-36
SLIDE 36

Questions?

Pictures from a Maryland-style Crab Feast

35