The PAPI libpfm4 Transition and unrelated Software Prefetch - - PowerPoint PPT Presentation

the papi libpfm4 transition
SMART_READER_LITE
LIVE PREVIEW

The PAPI libpfm4 Transition and unrelated Software Prefetch - - PowerPoint PPT Presentation

The PAPI libpfm4 Transition and unrelated Software Prefetch Research Vince Weaver ICL Lunch Talk 11 February 2011 Part I: The PAPI libpfm4 Transition 1 Layers of Abstraction PAPI_TOT_INS PAPI INSTRUCTION_RETIRED Event Name Translator


slide-1
SLIDE 1

The PAPI libpfm4 Transition

and unrelated

Software Prefetch Research

Vince Weaver

ICL Lunch Talk

11 February 2011

slide-2
SLIDE 2

Part I: The PAPI libpfm4 Transition

1

slide-3
SLIDE 3

Layers of Abstraction

Translator Event Name Operating System PAPI INSTRUCTION_RETIRED 0x5300c0 Hardware PAPI_TOT_INS

2

slide-4
SLIDE 4

libpfm3

  • Used by PAPI since version 3.0 for Linux:

perfctr, perfmon2, perf events

  • No longer supported
  • No support for newer chips
  • Not really designed for perf events

3

slide-5
SLIDE 5

libpfm4

  • Still under development
  • Supports newest processors
  • Designed for perf events
  • Just incompatible enough with libpfm3 to be annoying

4

slide-6
SLIDE 6

Features Not Supported by Linux 2.6.38

but PAPI/libfpm4 will support once there is support

  • AMD Lightweight Profiling (LWP)
  • Intel HW Cycle-count Register
  • Uncore Events (Intel, AMD 15h, Power)
  • Nehalem Offcore Response
  • Sampling Interfaces (IBS / PEBS)
  • Newer Processors (Sandy Bridge, Bulldozer)

5

slide-7
SLIDE 7

Can Current PAPI Handle All of These New Features?

6

slide-8
SLIDE 8

Original Event Layout

31 PAPI Event PAPI_PRESET_MASK

PAPI L1 TCM

7

slide-9
SLIDE 9

PAPI 3.0 (2004)

31 PAPI Event PAPI_PRESET_MASK PAPI_NATIVE_MASK

LAST LEVEL CACHE REFERENCES

8

slide-10
SLIDE 10

PAPI 3.5 (2006)

31 PAPI Event PAPI_PRESET_MASK PAPI_NATIVE_MASK 15 UMASK EVENT

L2 RQSTS:SELF DEMAND MESI

9

slide-11
SLIDE 11

PAPI 3.6 (2008)

31 PAPI Event PAPI_PRESET_MASK PAPI_NATIVE_MASK 15 UMASK 7 11 EVENT

BRANCH RETIRED:MMNP:MMNM:MMTP:MMTM

10

slide-12
SLIDE 12

PAPI 4.0 (2010)

31 PAPI Event PAPI_PRESET_MASK PAPI_NATIVE_MASK 15 7 11 EVENT 25 UMASK COMPONENT

LM SENSORS.applesmc-isa-0300.temp10.temp10 input

11

slide-13
SLIDE 13

PAPI 5.0???

31 PAPI Event PAPI_PRESET_MASK PAPI_NATIVE_MASK 15 7 11 EVENT 25 COMPONENT UMASK PMU

nhm::BR INST RETIRED:ALL BRANCHES nhm unc::UNC DRAM PAGE MISS ix86arch::UNHALTED CORE CYCLES perf::ext4:ext4 discard blocks

12

slide-14
SLIDE 14

Not Enough Bits!

31 PAPI Event PAPI_PRESET_MASK PAPI_NATIVE_MASK 15 7 11 EVENT 25 COMPONENT UMASK PMU

PM DC PMC 9:lpid mask=0xff:lpid=0x22:pid mask=0x3fff:pid=0x1b2d:marking nhm::OFFCORE RESPONSE 0:DMND DATA RD:DMND RFO:REMOTE DRAM:LOCAL DRAM

13

slide-15
SLIDE 15

Move to libpfm4 and String-based Events

  • Have a dynamically updated table containing the event

names in use as full strings

  • A 32-bit PAPI native event is assigned to each string,

allowing backward compatibility with current PAPI interface

  • Must make sure that event name lookup is not on the

critical path to avoid performance regressions

14

slide-16
SLIDE 16

Part II: Investigating Prefetching Using Hardware Performance Counters

15

slide-17
SLIDE 17

Quick Look at Core2 HW Prefetch

  • Instruction prefetcher
  • L1 Data Cache Unit Prefetcher (streaming).

Ascending data accesses prefetch next line

  • L1 Instruction Pointer Strided Prefetcher.

Looks for strided access from particular load instructions. Forward or Backward up to 2k apart

  • L2 Data Prefetch Logic.

Fetches to L2 based on the L1 DCU

16

slide-18
SLIDE 18

x86 SW Prefetch Instructions (AMD)

  • PREFETCHNTA – SSE1, non temporal (use once)
  • PREFETCHT0 – SSE1, prefetch to all levels
  • PREFETCHT1 – SSE1, prefetch to L2 + higher
  • PREFETCHT2 – SSE1, prefetch to L3 + higher
  • PREFETCH – AMD 3DNOW! prefetch to L1
  • PREFETCHW – AMD 3DNOW! prefetch for write

17

slide-19
SLIDE 19

Investigating adding a PAPI PRF SW Pre-defined Event

  • Can multiple machines count SW Prefetches?
  • Does the behavior of the events match expectations?
  • Will people use the preset?

18

slide-20
SLIDE 20

Core2

  • SSE PRE EXEC:NTA – counts NTA
  • SSE PRE EXEC:L1 – counts T0

(fxsave+2, fxrstor+5)

  • SSE PRE EXEC:L2 – counts T1/T2
  • Problem: Only 2 counters available on Core2

19

slide-21
SLIDE 21

AMD (Istanbul and Later)

  • PREFETCH INSTRUCTIONS DISPATCHED:NTA
  • PREFETCH INSTRUCTIONS DISPATCHED:LOAD
  • PREFETCH INSTRUCTIONS DISPATCHED:STORE
  • These events appear to be speculative, and won’t count

SW prefetches that conflict with HW prefetches

20

slide-22
SLIDE 22

Atom

  • PREFETCH:PREFETCHNTA
  • PREFETCH:PREFETCHT0
  • PREFETCH:SW L2
  • These events will count SW prefetches, but numbers

counted vary in complex ways

21

slide-23
SLIDE 23

Does anyone use SW Prefetch?

  • gcc by default disables SW prefetch unless you specify
  • fprefetch-loop-arrays
  • icc disables unless you specify -xsse4.2 -op-prefetch=4
  • glibc has hand-coded SW prefetch in memcpy()
  • Prefetch can hurt behavior:

– Can throw out good cache lines, – Can bring lines in too soon, – Can interfere with the HW prefetcher

22

slide-24
SLIDE 24

SW Prefetch Distribution

SPEC CPU 2000, Core2, gcc -fprefetch-loop-arrays

164.gzip.graphic 164.gzip.log 164.gzip.program 164.gzip.random 164.gzip.source 175.vpr.place 175.vpr.route 176.gcc.166 176.gcc.200 176.gcc.expr 176.gcc.integrate 176.gcc.scilab 181.mcf 186.crafty 197.parser 252.eon.cook 252.eon.kajiya 252.eon.rushmeier 253.perlbmk.535 253.perlbmk.704 253.perlbmk.850 253.perlbmk.957 253.perlbmk.diffmail 253.perlbmk.makerand 253.perlbmk.perfect 254.gap 255.vortex.1 255.vortex.2 255.vortex.3 256.bzip2.graphic 256.bzip2.program 256.bzip2.source 300.twolf

80B 60B 40B 20B Load Instructions Loads T0 T1/T2 NTA Load Distribution

N/A N/A N/A

1 6 8 . w u p w i s e F 7 7 1 7 1 . s w i m F 7 7 1 7 2 . m g r i d F 7 7 1 7 3 . a p p l u F 7 7 1 7 7 . m e s a C 1 7 8 . g a l g e l F 9 1 7 9 . a r t . 1 1 C 1 7 9 . a r t . 4 7 C 1 8 3 . e q u a k e C 1 8 7 . f a c e r e c F 9 1 8 8 . a m m p C 1 8 9 . l u c a s F 9 1 9 1 . f m a 3 d F 9 2 . s i x t r a c k F 7 7 3 1 . a p s i F 7 7

150B 100B 50B Load Instructions Loads T0 T1/T2 NTA Load Distribution

23

slide-25
SLIDE 25

Normalized SW Prefetch Runtime

  • n Core2 (Smaller is Better)

164.gzip.graphic 164.gzip.log 164.gzip.program 164.gzip.random 164.gzip.source 175.vpr.place 175.vpr.route 176.gcc.166 176.gcc.200 176.gcc.expr 176.gcc.integrate 176.gcc.scilab 181.mcf 186.crafty 197.parser 252.eon.cook 252.eon.kajiya 252.eon.rushmeier 253.perlbmk.535 253.perlbmk.704 253.perlbmk.850 253.perlbmk.957 253.perlbmk.diffmail 253.perlbmk.makerand 253.perlbmk.perfect 254.gap 255.vortex.1 255.vortex.2 255.vortex.3 256.bzip2.graphic 256.bzip2.program 256.bzip2.source 300.twolf

0.5 1 Normalized Runtime Integer SPEC CPU 2000 Normalized Runtime when SW Prefetch Enabled with -fprefetch-loop-arrays

N/A N/A N/A

1 6 8 . w u p w i s e F 7 7 1 7 1 . s w i m F 7 7 1 7 2 . m g r i d F 7 7 1 7 3 . a p p l u F 7 7 1 7 7 . m e s a C 1 7 8 . g a l g e l F 9 1 7 9 . a r t . 1 1 C 1 7 9 . a r t . 4 7 C 1 8 3 . e q u a k e C 1 8 7 . f a c e r e c F 9 1 8 8 . a m m p C 1 8 9 . l u c a s F 9 1 9 1 . f m a 3 d F 9 2 . s i x t r a c k F 7 7 3 1 . a p s i F 7 7

0.5 1 Normalized Runtime FP SPEC CPU 2000 Normalized Runtime when SW Prefetch Enabled with -fprefetch-loop-arrays

24

slide-26
SLIDE 26

The HW Prefetcher on Core2 can be Disabled

25

slide-27
SLIDE 27

Runtime with HW Prefetcher Disabled

Normalized against Runtime with HW Prefetcher Enabled

  • n Core2 (Smaller is Better)

1 6 4 . g z i p . g r a p h i c 1 6 4 . g z i p . l

  • g

1 6 4 . g z i p . p r

  • g

r a m 1 6 4 . g z i p . r a n d

  • m

1 6 4 . g z i p . s

  • u

r c e 1 7 5 . v p r . p l a c e 1 7 5 . v p r . r

  • u

t e 1 7 6 . g c c . 1 6 6 1 7 6 . g c c . 2 1 7 6 . g c c . e x p r 1 7 6 . g c c . i n t e g r a t e 1 7 6 . g c c . s c i l a b 1 8 1 . m c f 1 8 6 . c r a f t y 1 9 7 . p a r s e r 2 5 2 . e

  • n

. c

  • k

2 5 2 . e

  • n

. k a j i y a 2 5 2 . e

  • n

. r u s h m e i e r 2 5 3 . p e r l b m k . 5 3 5 2 5 3 . p e r l b m k . 7 4 2 5 3 . p e r l b m k . 8 5 2 5 3 . p e r l b m k . 9 5 7 2 5 3 . p e r l b m k . d i f f m a i l 2 5 3 . p e r l b m k . m a k e r a n d 2 5 3 . p e r l b m k . p e r f e c t 2 5 4 . g a p 2 5 5 . v

  • r

t e x . 1 2 5 5 . v

  • r

t e x . 2 2 5 5 . v

  • r

t e x . 3 2 5 6 . b z i p 2 . g r a p h i c 2 5 6 . b z i p 2 . p r

  • g

r a m 2 5 6 . b z i p 2 . s

  • u

r c e 3 . t w

  • l

f

0.5 1.5 1 Normalized Runtime plain w/ SW Prefetch Normalized Runtime when HW Prefetch Disabled

1.82 1.84

N/A N/A N/A

1 6 8 . w u p w i s e F 7 7 1 7 1 . s w i m F 7 7 1 7 2 . m g r i d F 7 7 1 7 3 . a p p l u F 7 7 1 7 7 . m e s a C 1 7 8 . g a l g e l F 9 1 7 9 . a r t . 1 1 C 1 7 9 . a r t . 4 7 C 1 8 3 . e q u a k e C 1 8 7 . f a c e r e c F 9 1 8 8 . a m m p C 1 8 9 . l u c a s F 9 1 9 1 . f m a 3 d F 9 2 . s i x t r a c k F 7 7 3 1 . a p s i F 7 7

0.5 1.5 1 2 Normalized Runtime plain w/ SW Prefetch Normalized Runtime when HW Prefetch Disabled

2.47 2.58 3.82 3.66

26

slide-28
SLIDE 28

PAPI PRF SW Revisited

  • Can multiple machines count SW Prefetches?

Yes.

  • Does the behavior of the events match expectations?

Not always.

  • Would people use the preset?

Maybe.

27

slide-29
SLIDE 29

Questions?

vweaver1@eecs.utk.edu

28

slide-30
SLIDE 30

Questions?

♥♥♥

  • ♥♥♥

vweaver1@eecs.utk.edu

29