The PAPI libpfm4 Transition and unrelated Software Prefetch - PowerPoint PPT Presentation

The PAPI libpfm4 Transition and unrelated Software Prefetch Research Vince Weaver ICL Lunch Talk 11 February 2011

Part I: The PAPI libpfm4 Transition 1

Layers of Abstraction PAPI_TOT_INS PAPI INSTRUCTION_RETIRED Event Name Translator 0x5300c0 Operating System Hardware 2

libpfm3 • Used by PAPI since version 3.0 for Linux: perfctr, perfmon2, perf events • No longer supported • No support for newer chips • Not really designed for perf events 3

libpfm4 • Still under development • Supports newest processors • Designed for perf events • Just incompatible enough with libpfm3 to be annoying 4

Features Not Supported by Linux 2.6.38 but PAPI/libfpm4 will support once there is support • AMD Lightweight Profiling (LWP) • Intel HW Cycle-count Register • Uncore Events (Intel, AMD 15h, Power) • Nehalem Offcore Response • Sampling Interfaces (IBS / PEBS) • Newer Processors (Sandy Bridge, Bulldozer) 5

Can Current PAPI Handle All of These New Features? 6

Original Event Layout PAPI Event 31 0 PAPI_PRESET_MASK PAPI L1 TCM 7

PAPI 3.0 (2004) PAPI Event 31 0 PAPI_NATIVE_MASK PAPI_PRESET_MASK LAST LEVEL CACHE REFERENCES 8

PAPI 3.5 (2006) PAPI Event 31 15 0 UMASK EVENT PAPI_NATIVE_MASK PAPI_PRESET_MASK L2 RQSTS:SELF DEMAND MESI 9

PAPI 3.6 (2008) PAPI Event 31 15 11 7 0 UMASK EVENT PAPI_NATIVE_MASK PAPI_PRESET_MASK BRANCH RETIRED:MMNP:MMNM:MMTP:MMTM 10

PAPI 4.0 (2010) PAPI Event 31 25 15 11 7 0 COMPONENT UMASK EVENT PAPI_NATIVE_MASK PAPI_PRESET_MASK LM SENSORS.applesmc-isa-0300.temp10.temp10 input 11

PAPI 5.0??? PAPI Event 31 25 15 11 7 0 COMPONENT PMU UMASK EVENT PAPI_NATIVE_MASK PAPI_PRESET_MASK nhm::BR INST RETIRED:ALL BRANCHES nhm unc::UNC DRAM PAGE MISS ix86arch::UNHALTED CORE CYCLES perf::ext4:ext4 discard blocks 12

Not Enough Bits! PAPI Event 31 25 15 11 7 0 COMPONENT PMU UMASK EVENT PAPI_NATIVE_MASK PAPI_PRESET_MASK PM DC PMC 9:lpid mask=0xff:lpid=0x22:pid mask=0x3fff:pid=0x1b2d:marking nhm::OFFCORE RESPONSE 0:DMND DATA RD:DMND RFO:REMOTE DRAM:LOCAL DRAM 13

Move to libpfm4 and String-based Events • Have a dynamically updated table containing the event names in use as full strings • A 32-bit PAPI native event is assigned to each string, allowing backward compatibility with current PAPI interface • Must make sure that event name lookup is not on the critical path to avoid performance regressions 14

Part II: Investigating Prefetching Using Hardware Performance Counters 15

Quick Look at Core2 HW Prefetch • Instruction prefetcher • L1 Data Cache Unit Prefetcher (streaming). Ascending data accesses prefetch next line • L1 Instruction Pointer Strided Prefetcher. Looks for strided access from particular load instructions. Forward or Backward up to 2k apart • L2 Data Prefetch Logic. Fetches to L2 based on the L1 DCU 16

x86 SW Prefetch Instructions (AMD) • PREFETCHNTA – SSE1, non temporal (use once) • PREFETCHT0 – SSE1, prefetch to all levels • PREFETCHT1 – SSE1, prefetch to L2 + higher • PREFETCHT2 – SSE1, prefetch to L3 + higher • PREFETCH – AMD 3DNOW! prefetch to L1 • PREFETCHW – AMD 3DNOW! prefetch for write 17

Investigating adding a PAPI PRF SW Pre-defined Event • Can multiple machines count SW Prefetches? • Does the behavior of the events match expectations? • Will people use the preset? 18

Core2 • SSE PRE EXEC:NTA – counts NTA • SSE PRE EXEC:L1 – counts T0 ( fxsave +2, fxrstor +5) • SSE PRE EXEC:L2 – counts T1/T2 • Problem: Only 2 counters available on Core2 19

AMD (Istanbul and Later) • PREFETCH INSTRUCTIONS DISPATCHED:NTA • PREFETCH INSTRUCTIONS DISPATCHED:LOAD • PREFETCH INSTRUCTIONS DISPATCHED:STORE • These events appear to be speculative, and won’t count SW prefetches that conflict with HW prefetches 20

Atom • PREFETCH:PREFETCHNTA • PREFETCH:PREFETCHT0 • PREFETCH:SW L2 • These events will count SW prefetches, but numbers counted vary in complex ways 21

Does anyone use SW Prefetch? • gcc by default disables SW prefetch unless you specify -fprefetch-loop-arrays • icc disables unless you specify -xsse4.2 -op-prefetch=4 • glibc has hand-coded SW prefetch in memcpy() • Prefetch can hurt behavior: – Can throw out good cache lines, – Can bring lines in too soon, – Can interfere with the HW prefetcher 22

SW Prefetch Distribution SPEC CPU 2000, Core2, gcc -fprefetch-loop-arrays Load Distribution Loads T0 T1/T2 NTA 80B Load Instructions 60B 40B 20B N/A N/A N/A 164.gzip.graphic 164.gzip.program 164.gzip.log 164.gzip.random 164.gzip.source 175.vpr.place 175.vpr.route 176.gcc.166 176.gcc.200 176.gcc.expr 176.gcc.integrate 176.gcc.scilab 181.mcf 186.crafty 197.parser 252.eon.cook 252.eon.kajiya 252.eon.rushmeier 253.perlbmk.535 253.perlbmk.704 253.perlbmk.850 253.perlbmk.957 253.perlbmk.diffmail 253.perlbmk.makerand 253.perlbmk.perfect 254.gap 255.vortex.1 255.vortex.2 256.bzip2.graphic 255.vortex.3 256.bzip2.program 256.bzip2.source 300.twolf Load Distribution Loads T0 T1/T2 NTA 150B Load Instructions 100B 50B 7 7 7 7 C 0 C C C 0 C 0 0 7 7 7 7 7 7 9 9 9 9 7 7 F F F F a F 0 0 e F p F F F F s 1 7 k m e m d u l c s d k i e e 1 4 a s e m a c s r i p l m g . . u 3 p i w i t t r c a w g p l r r q a a a . a a a e u r s m 7 e . m . p a g . . c 8 l t 1 . 7 . . x u 1 . . 9 9 3 a 8 9 f 2 3 1 . . s i 0 w 7 8 7 7 8 f 1 8 1 7 7 . . 3 1 7 1 1 1 7 9 0 . 1 1 1 8 1 8 1 0 6 1 2 1 23

Normalized SW Prefetch Runtime on Core2 (Smaller is Better) Integer SPEC CPU 2000 Normalized Runtime when SW Prefetch Enabled with -fprefetch-loop-arrays Normalized Runtime 1 0.5 N/A N/A N/A 0 164.gzip.graphic 164.gzip.program 164.gzip.log 164.gzip.random 164.gzip.source 175.vpr.place 175.vpr.route 176.gcc.166 176.gcc.200 176.gcc.expr 176.gcc.integrate 176.gcc.scilab 181.mcf 186.crafty 197.parser 252.eon.cook 252.eon.kajiya 252.eon.rushmeier 253.perlbmk.535 253.perlbmk.704 253.perlbmk.850 253.perlbmk.957 253.perlbmk.diffmail 253.perlbmk.makerand 253.perlbmk.perfect 254.gap 255.vortex.1 255.vortex.2 256.bzip2.graphic 255.vortex.3 256.bzip2.program 256.bzip2.source 300.twolf FP SPEC CPU 2000 Normalized Runtime when SW Prefetch Enabled with -fprefetch-loop-arrays Normalized Runtime 1 0.5 0 7 7 7 7 C 0 C C C 0 C 0 0 7 7 7 7 7 7 9 9 9 9 7 7 a 0 0 e p F F F F F F F F F F s 1 7 k m e m d u e l a c s d k i e 1 4 s s i l e m a 3 c i r p m g . . u p i w g t t r c a a w p l r r q a a . a a a e u r s m a 7 e . m t . p g . . c 8 l x 1 . 7 9 9 . . u 1 . . . 3 a 8 9 f i 0 2 3 1 8 7 7 . s w 7 8 f 1 8 1 7 7 . . 3 1 7 1 1 1 7 1 9 0 . 1 1 8 1 8 1 0 6 1 2 1 24

The HW Prefetcher on Core2 can be Disabled 25

Runtime with HW Prefetcher Disabled Normalized against Runtime with HW Prefetcher Enabled on Core2 (Smaller is Better) Normalized Runtime when HW Prefetch Disabled plain w/ SW Prefetch 1.5 1.82 1.84 Normalized Runtime 1 0.5 N/A N/A N/A 0 c g m m e e e 6 0 r e b f y r k a r 5 4 0 7 l d t p 1 2 3 c m e f p c e e i c l i o c c t 6 0 t a t o y 3 0 5 5 a n a . . . i c o h a o u x a m f s i e x x x h a l r a 1 2 l a o i e 5 7 8 9 m a g r w p . r d u o e r i r j f e e e p r u p l . . c . r c a m . . . . r r . a g n o p r c c . g 1 c a k k k k f e 4 t t t a g o t i c s . k f e r r r . r z o . . c c e 8 . p n m m m m i r o 0 g a s r r c . 6 . h d k p 5 o o o g s g r . p p g g t c 1 . o n r . 0 . p r p g n 8 7 s b b b b . a . 2 v v v . p 2 . . v . . c e o k k 3 p 4 . p v 6 6 . i 9 u m . . . 2 . p z i . . 6 . g 1 . e r l r l r l r l m m 5 5 5 2 p i 6 i 5 c 1 r p z i z g 5 7 7 . 2 . . e e e e . 5 5 5 p i g 1 z 7 7 c 6 2 n b k b i z g 7 1 1 5 p p p p 2 2 2 z g . 1 1 g 7 5 o l m l z i b . . 4 1 2 . . . . r r b 4 . 4 . 1 2 e 3 3 3 3 e e b . 4 6 6 b . 6 6 6 . 5 5 5 5 p 6 . 6 1 7 2 p l 6 5 1 1 2 2 2 2 r . 5 1 . e 3 2 1 5 3 2 5 p 5 2 5 2 2 . 2 3 5 2 Normalized Runtime when HW Prefetch Disabled plain w/ SW Prefetch 2.47 2.58 3.82 3.66 2 Normalized Runtime 1.5 1 0.5 0 7 7 7 7 C 0 C C C 0 C 0 0 7 7 7 7 7 7 9 9 9 9 7 7 F F F F a F 0 0 e F p F F F F s 1 7 k m e m d u e l 1 4 a c s d k i e m s s i l m . . u e a 3 c i r p g t t p i w g r r q r a c a a w p . l e a m 7 a a a e u m r p s a c . t . g . . 8 l x 1 . . . 7 9 9 . a . f u 1 2 3 . 3 8 9 . i 0 1 8 7 7 f 1 s w 7 7 8 . 1 8 3 7 7 1 1 7 . . 1 1 1 1 9 0 1 1 8 8 1 0 6 1 2 1 26

The PAPI libpfm4 Transition and unrelated Software Prefetch - PowerPoint PPT Presentation

The PAPI libpfm4 Transition and unrelated Software Prefetch Research Vince Weaver ICL Lunch Talk 11 February 2011 Part I: The PAPI libpfm4 Transition 1 Layers of Abstraction PAPI_TOT_INS PAPI INSTRUCTION_RETIRED Event Name Translator

PAPI S: VI RGI NI A RE -E NT RY COAL I T I ON PAPI SPAPI S: Virg inia Re -e ntry

PAPI-PERMIS Integration Project Proposal David Chadwick d.w.chadwick@salford.ac.uk Background

Phase Transition in 3SAT Yi Zhou Phase Transition in 3SAT Phase Transition in 3SAT Fine Grained

Analysis and Optimization of a Molecular Dynamics Code using PAPI and the Vampir Toolchain May

METERING IN AFRICA AN OVERVIEW SHAWN PAPI Member of AFSEC TC13 PT1, IEC TC13 WG14, SANC TC13

Measuring Energy and Power with PAPI Vince Weaver vweaver1@eecs.utk.edu 11 May 2012 Power and

Performance Index (PAPI) at Provincial Level The Perception of Citizens Hanoi, Jan 2010 VFF /

Practical pluggable types via the Checker Framework Matthew Papi, Mahmood Ali, Telmo Correa Jr.,

Does your tool support PAPI SDEs yet? 13 th Scalable Tools Workshop Anthony Danalis, Heike

PAPI-NUMA: Middleware to Support Hardware Sampling IVONNE LOPEZ AND SHIRLEY

REGULARITY FOR SINGULAR RISK-NEUTRAL VALUATION EQUATIONS Kolmogorov Equations in Physics and

Denominator identities and Lie superalgebras Paolo Papi Sapienza Universit` a di Roma We find

Good Transition Planning and Coordination Transition Transition Children in Wales Plan Plan

THE TRANSITION YEAR PROGRAMME AN OVERVIEW TRANSITION YEAR Transition Year is a one year school

Transition Year Overview November 2019 Transition Year Transition Year is a one year school

Strengthening Smooth Transition Strengthening Smooth Transition Strengthening Smooth Transition

Fantastic DNS records and where to find them Demystifying systemd-resolved and how it is

Examples in Planning for Non-Transmission Alternatives (NTAs) Kenneth Sahm White Director

Annual General Meeting 2 Disclaimer This presentation has been prepared by Argo Service Company

- binomial inequality Pure In linear inequality or logs - log Xo Ye - { Lilli 2 E Bi Yi { ( Li

Outline Shelahs classification theory and NTP 2 Examples of fields with NTP 2 Implications of

DUNE timing system Stoyan Trilov, University of Bristol DUNE UK Meeting 11/12/2019 1 Outline

Recursive inventory management Martin F. Krafft madduck@debian.org 13 Aug 2013 @ DebConf 13,

Zones - Containers Server Consolidation Run multiple workloads on system Improve utilization of