The PAPI libpfm4 Transition and unrelated Software Prefetch - - PowerPoint PPT Presentation
The PAPI libpfm4 Transition and unrelated Software Prefetch - - PowerPoint PPT Presentation
The PAPI libpfm4 Transition and unrelated Software Prefetch Research Vince Weaver ICL Lunch Talk 11 February 2011 Part I: The PAPI libpfm4 Transition 1 Layers of Abstraction PAPI_TOT_INS PAPI INSTRUCTION_RETIRED Event Name Translator
Part I: The PAPI libpfm4 Transition
1
Layers of Abstraction
Translator Event Name Operating System PAPI INSTRUCTION_RETIRED 0x5300c0 Hardware PAPI_TOT_INS
2
libpfm3
- Used by PAPI since version 3.0 for Linux:
perfctr, perfmon2, perf events
- No longer supported
- No support for newer chips
- Not really designed for perf events
3
libpfm4
- Still under development
- Supports newest processors
- Designed for perf events
- Just incompatible enough with libpfm3 to be annoying
4
Features Not Supported by Linux 2.6.38
but PAPI/libfpm4 will support once there is support
- AMD Lightweight Profiling (LWP)
- Intel HW Cycle-count Register
- Uncore Events (Intel, AMD 15h, Power)
- Nehalem Offcore Response
- Sampling Interfaces (IBS / PEBS)
- Newer Processors (Sandy Bridge, Bulldozer)
5
Can Current PAPI Handle All of These New Features?
6
Original Event Layout
31 PAPI Event PAPI_PRESET_MASK
PAPI L1 TCM
7
PAPI 3.0 (2004)
31 PAPI Event PAPI_PRESET_MASK PAPI_NATIVE_MASK
LAST LEVEL CACHE REFERENCES
8
PAPI 3.5 (2006)
31 PAPI Event PAPI_PRESET_MASK PAPI_NATIVE_MASK 15 UMASK EVENT
L2 RQSTS:SELF DEMAND MESI
9
PAPI 3.6 (2008)
31 PAPI Event PAPI_PRESET_MASK PAPI_NATIVE_MASK 15 UMASK 7 11 EVENT
BRANCH RETIRED:MMNP:MMNM:MMTP:MMTM
10
PAPI 4.0 (2010)
31 PAPI Event PAPI_PRESET_MASK PAPI_NATIVE_MASK 15 7 11 EVENT 25 UMASK COMPONENT
LM SENSORS.applesmc-isa-0300.temp10.temp10 input
11
PAPI 5.0???
31 PAPI Event PAPI_PRESET_MASK PAPI_NATIVE_MASK 15 7 11 EVENT 25 COMPONENT UMASK PMU
nhm::BR INST RETIRED:ALL BRANCHES nhm unc::UNC DRAM PAGE MISS ix86arch::UNHALTED CORE CYCLES perf::ext4:ext4 discard blocks
12
Not Enough Bits!
31 PAPI Event PAPI_PRESET_MASK PAPI_NATIVE_MASK 15 7 11 EVENT 25 COMPONENT UMASK PMU
PM DC PMC 9:lpid mask=0xff:lpid=0x22:pid mask=0x3fff:pid=0x1b2d:marking nhm::OFFCORE RESPONSE 0:DMND DATA RD:DMND RFO:REMOTE DRAM:LOCAL DRAM
13
Move to libpfm4 and String-based Events
- Have a dynamically updated table containing the event
names in use as full strings
- A 32-bit PAPI native event is assigned to each string,
allowing backward compatibility with current PAPI interface
- Must make sure that event name lookup is not on the
critical path to avoid performance regressions
14
Part II: Investigating Prefetching Using Hardware Performance Counters
15
Quick Look at Core2 HW Prefetch
- Instruction prefetcher
- L1 Data Cache Unit Prefetcher (streaming).
Ascending data accesses prefetch next line
- L1 Instruction Pointer Strided Prefetcher.
Looks for strided access from particular load instructions. Forward or Backward up to 2k apart
- L2 Data Prefetch Logic.
Fetches to L2 based on the L1 DCU
16
x86 SW Prefetch Instructions (AMD)
- PREFETCHNTA – SSE1, non temporal (use once)
- PREFETCHT0 – SSE1, prefetch to all levels
- PREFETCHT1 – SSE1, prefetch to L2 + higher
- PREFETCHT2 – SSE1, prefetch to L3 + higher
- PREFETCH – AMD 3DNOW! prefetch to L1
- PREFETCHW – AMD 3DNOW! prefetch for write
17
Investigating adding a PAPI PRF SW Pre-defined Event
- Can multiple machines count SW Prefetches?
- Does the behavior of the events match expectations?
- Will people use the preset?
18
Core2
- SSE PRE EXEC:NTA – counts NTA
- SSE PRE EXEC:L1 – counts T0
(fxsave+2, fxrstor+5)
- SSE PRE EXEC:L2 – counts T1/T2
- Problem: Only 2 counters available on Core2
19
AMD (Istanbul and Later)
- PREFETCH INSTRUCTIONS DISPATCHED:NTA
- PREFETCH INSTRUCTIONS DISPATCHED:LOAD
- PREFETCH INSTRUCTIONS DISPATCHED:STORE
- These events appear to be speculative, and won’t count
SW prefetches that conflict with HW prefetches
20
Atom
- PREFETCH:PREFETCHNTA
- PREFETCH:PREFETCHT0
- PREFETCH:SW L2
- These events will count SW prefetches, but numbers
counted vary in complex ways
21
Does anyone use SW Prefetch?
- gcc by default disables SW prefetch unless you specify
- fprefetch-loop-arrays
- icc disables unless you specify -xsse4.2 -op-prefetch=4
- glibc has hand-coded SW prefetch in memcpy()
- Prefetch can hurt behavior:
– Can throw out good cache lines, – Can bring lines in too soon, – Can interfere with the HW prefetcher
22
SW Prefetch Distribution
SPEC CPU 2000, Core2, gcc -fprefetch-loop-arrays
164.gzip.graphic 164.gzip.log 164.gzip.program 164.gzip.random 164.gzip.source 175.vpr.place 175.vpr.route 176.gcc.166 176.gcc.200 176.gcc.expr 176.gcc.integrate 176.gcc.scilab 181.mcf 186.crafty 197.parser 252.eon.cook 252.eon.kajiya 252.eon.rushmeier 253.perlbmk.535 253.perlbmk.704 253.perlbmk.850 253.perlbmk.957 253.perlbmk.diffmail 253.perlbmk.makerand 253.perlbmk.perfect 254.gap 255.vortex.1 255.vortex.2 255.vortex.3 256.bzip2.graphic 256.bzip2.program 256.bzip2.source 300.twolf
80B 60B 40B 20B Load Instructions Loads T0 T1/T2 NTA Load Distribution
N/A N/A N/A
1 6 8 . w u p w i s e F 7 7 1 7 1 . s w i m F 7 7 1 7 2 . m g r i d F 7 7 1 7 3 . a p p l u F 7 7 1 7 7 . m e s a C 1 7 8 . g a l g e l F 9 1 7 9 . a r t . 1 1 C 1 7 9 . a r t . 4 7 C 1 8 3 . e q u a k e C 1 8 7 . f a c e r e c F 9 1 8 8 . a m m p C 1 8 9 . l u c a s F 9 1 9 1 . f m a 3 d F 9 2 . s i x t r a c k F 7 7 3 1 . a p s i F 7 7
150B 100B 50B Load Instructions Loads T0 T1/T2 NTA Load Distribution
23
Normalized SW Prefetch Runtime
- n Core2 (Smaller is Better)
164.gzip.graphic 164.gzip.log 164.gzip.program 164.gzip.random 164.gzip.source 175.vpr.place 175.vpr.route 176.gcc.166 176.gcc.200 176.gcc.expr 176.gcc.integrate 176.gcc.scilab 181.mcf 186.crafty 197.parser 252.eon.cook 252.eon.kajiya 252.eon.rushmeier 253.perlbmk.535 253.perlbmk.704 253.perlbmk.850 253.perlbmk.957 253.perlbmk.diffmail 253.perlbmk.makerand 253.perlbmk.perfect 254.gap 255.vortex.1 255.vortex.2 255.vortex.3 256.bzip2.graphic 256.bzip2.program 256.bzip2.source 300.twolf
0.5 1 Normalized Runtime Integer SPEC CPU 2000 Normalized Runtime when SW Prefetch Enabled with -fprefetch-loop-arrays
N/A N/A N/A
1 6 8 . w u p w i s e F 7 7 1 7 1 . s w i m F 7 7 1 7 2 . m g r i d F 7 7 1 7 3 . a p p l u F 7 7 1 7 7 . m e s a C 1 7 8 . g a l g e l F 9 1 7 9 . a r t . 1 1 C 1 7 9 . a r t . 4 7 C 1 8 3 . e q u a k e C 1 8 7 . f a c e r e c F 9 1 8 8 . a m m p C 1 8 9 . l u c a s F 9 1 9 1 . f m a 3 d F 9 2 . s i x t r a c k F 7 7 3 1 . a p s i F 7 7
0.5 1 Normalized Runtime FP SPEC CPU 2000 Normalized Runtime when SW Prefetch Enabled with -fprefetch-loop-arrays
24
The HW Prefetcher on Core2 can be Disabled
25
Runtime with HW Prefetcher Disabled
Normalized against Runtime with HW Prefetcher Enabled
- n Core2 (Smaller is Better)
1 6 4 . g z i p . g r a p h i c 1 6 4 . g z i p . l
- g
1 6 4 . g z i p . p r
- g
r a m 1 6 4 . g z i p . r a n d
- m
1 6 4 . g z i p . s
- u
r c e 1 7 5 . v p r . p l a c e 1 7 5 . v p r . r
- u
t e 1 7 6 . g c c . 1 6 6 1 7 6 . g c c . 2 1 7 6 . g c c . e x p r 1 7 6 . g c c . i n t e g r a t e 1 7 6 . g c c . s c i l a b 1 8 1 . m c f 1 8 6 . c r a f t y 1 9 7 . p a r s e r 2 5 2 . e
- n
. c
- k
2 5 2 . e
- n
. k a j i y a 2 5 2 . e
- n
. r u s h m e i e r 2 5 3 . p e r l b m k . 5 3 5 2 5 3 . p e r l b m k . 7 4 2 5 3 . p e r l b m k . 8 5 2 5 3 . p e r l b m k . 9 5 7 2 5 3 . p e r l b m k . d i f f m a i l 2 5 3 . p e r l b m k . m a k e r a n d 2 5 3 . p e r l b m k . p e r f e c t 2 5 4 . g a p 2 5 5 . v
- r
t e x . 1 2 5 5 . v
- r
t e x . 2 2 5 5 . v
- r
t e x . 3 2 5 6 . b z i p 2 . g r a p h i c 2 5 6 . b z i p 2 . p r
- g
r a m 2 5 6 . b z i p 2 . s
- u
r c e 3 . t w
- l
f
0.5 1.5 1 Normalized Runtime plain w/ SW Prefetch Normalized Runtime when HW Prefetch Disabled
1.82 1.84
N/A N/A N/A
1 6 8 . w u p w i s e F 7 7 1 7 1 . s w i m F 7 7 1 7 2 . m g r i d F 7 7 1 7 3 . a p p l u F 7 7 1 7 7 . m e s a C 1 7 8 . g a l g e l F 9 1 7 9 . a r t . 1 1 C 1 7 9 . a r t . 4 7 C 1 8 3 . e q u a k e C 1 8 7 . f a c e r e c F 9 1 8 8 . a m m p C 1 8 9 . l u c a s F 9 1 9 1 . f m a 3 d F 9 2 . s i x t r a c k F 7 7 3 1 . a p s i F 7 7
0.5 1.5 1 2 Normalized Runtime plain w/ SW Prefetch Normalized Runtime when HW Prefetch Disabled
2.47 2.58 3.82 3.66
26
PAPI PRF SW Revisited
- Can multiple machines count SW Prefetches?
Yes.
- Does the behavior of the events match expectations?
Not always.
- Would people use the preset?
Maybe.
27
Questions?
vweaver1@eecs.utk.edu
28
Questions?
♥♥♥
- ♥♥♥
vweaver1@eecs.utk.edu
29