Testing AutoFDO for Geant4 Nathalie Rauschmayr IT-CF-FPP With help - - PowerPoint PPT Presentation

testing autofdo for geant4
SMART_READER_LITE
LIVE PREVIEW

Testing AutoFDO for Geant4 Nathalie Rauschmayr IT-CF-FPP With help - - PowerPoint PPT Presentation

Testing AutoFDO for Geant4 Nathalie Rauschmayr IT-CF-FPP With help from Benedikt Hegner and Shahzad Malik Muzaffar 1/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr Introduction Idea: Autotuning Compile 2/33 Testing AutoFDO for Geant4


slide-1
SLIDE 1

Testing AutoFDO for Geant4

Nathalie Rauschmayr

IT-CF-FPP With help from Benedikt Hegner and Shahzad Malik Muzaffar

1/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-2
SLIDE 2

Introduction

Idea: Autotuning

Compile

2/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-3
SLIDE 3

Introduction

Idea: Autotuning

Compile Run

3/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-4
SLIDE 4

Introduction

Idea: Autotuning

Compile Feedback Run

4/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-5
SLIDE 5

Introduction

Idea: Autotuning

Compile Feedback Run

Concept exists already for some time: Profile Guided Optimization

4/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-6
SLIDE 6

Introduction

Why it helps to improve performance:

5/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-7
SLIDE 7

Introduction

Why it helps to improve performance: LHC code consists of a lot of branches/dependencies

Figure: Example from Geant4: G4MTRunManager::InitializePhysics()

5/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-8
SLIDE 8

Introduction

Profile Guided Optimization is useful for:

  • Code that contains a lot of branches that are difficult to predict at compile

time

  • Performance sensitive code
  • When running the same code over and over again

6/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-9
SLIDE 9

Introduction

Profile Guided Optimization:

  • Uses profiling to improve runtime performance
  • Analyses code sections that are frequently executed
  • Based on profiles the compiler might change:
  • Inlining
  • Virtual Call Speculation
  • Register allocation
  • Basic Block Optimization
  • Function Layout
  • Conditional Branch Optimization
  • Dead Code Separation

7/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-10
SLIDE 10

Introduction

Two approaches for Profile Guided Optimization (PGO):

  • Modify binary (instrumentation)
  • Monitor unaltered binary (sampling with perf)
  • AutoFDO transforms perf-profiles into the format that can be used by

gcc/clang for Feedback Directed Optimization (FDO)

  • Developed by Google https://github.com/google/autofdo

8/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-11
SLIDE 11

Difference between sampling and instrumentation

Instrumentation based PGO:

gcc -fprofile-generate test.c -o test test.gcno test.gcda gcc -fprofile-use test.c -o test Instrumentation Run Recompile Production Environment

9/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-12
SLIDE 12

Difference between sampling and instrumentation

Instrumentation based PGO:

gcc -fprofile-generate test.c -o test test.gcno test.gcda gcc -fprofile-use test.c -o test Instrumentation Run Recompile Production Environment

Disadvantages:

  • Tedious dual-compilation
  • Produces a lot of small output files (in case of Geant4: 1698 files, each smaller

than 100KB)

  • Cannot run easily in production environment
  • Instrumented binary might be significantly slower

9/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-13
SLIDE 13

Difference between sampling and instrumentation

Sampling Based FDO (AutoFDO):

Create production binary Run production binary with perf Convert perf-profile Recompile with converted perf-profile

10/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-14
SLIDE 14

Difference between sampling and instrumentation

Sampling Based FDO (AutoFDO):

gcc -O3 -ggdb

  • frecord-compilation-info-in-elf
  • D DEBUG test.c -o test

perf record -b -e cpu/event=0xc4,umask=0x20, name=br inst retired near taken, period=1000009/pp ./test create gcov --binary=./test

  • -profile=perf.data --gcov=binary.gcov
  • gcov version=1

gcc -O3 -fauto-profile=test.gcov test.c -o test Create production binary Run production binary with perf Convert perf-profile Recompile with converted perf-profile

11/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-15
SLIDE 15

Difference between sampling and instrumentation

AutoFDO compared to instrumentation based PGO:

  • Profile data can be obtained in production environment
  • Works on optimized builds
  • It provides a tool to merge profiles from multiple runs
  • Only one output file per run

12/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-16
SLIDE 16

General Caveats

  • The sample needs to be representative for the typical usage scenarios
  • Otherwise: PGO could possible slow down the performance
  • Need many profiles and runs
  • Unbiased branches

13/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-17
SLIDE 17

Testcases

Applications:

  • CMS Detector Simulation (FullCMS)
  • Simulation step of CMSSW using static build of Geant4 (cmsRun)

Input data/workflow needs to be representative:

  • How many events needed as training data?
  • What if job configuration changes?
  • What if job type changes?

14/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-18
SLIDE 18

Testcases

Training data Run Number of Events FullCMS run FullCMS run 100,500,1k cmsRun config1 cmsRun config1 20, 50, 100 cmsRun config1 cmsRun config2 20, 50, 100 cmsRun config2 cmsRun config2 20, 50, 100 FullCMS run cmsRun config2 1k FullCMS: Geant4 example with particle gun cmsRun config1: TTbar event generation and simulation (CMSSW 7 3 1) cmsRun config2: Wjets event generation and simulation (CMSSW 7 3 1)

15/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-19
SLIDE 19

CMS Full Detector Simulation

Training data Run Number of Events FullCMS run 100 events FullCMS 100, 500, 1k FullCMS run 500 events FullCMS 100, 500, 1k FullCMS run 1k events FullCMS 100, 500, 1k

Normal AutoFDO 100 events AutoFDO 500 events AutoFDO 1000 events 130 140 150 160 170 Runtime in [s] Processing 100 events Normal AutoFDO 100 events AutoFDO 500 events AutoFDO 1000 Events 600 650 700 Runtime in [s] Processing 500 events

  • 8.9%
  • 10.4%
  • 11.5%
  • 9.8%
  • 9.5% -10.2%

16/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-20
SLIDE 20

CMS Full Detector Simulation

Normal AutoFDO 100 events AutoFDO 500 events AutoFDO 1000 events 1,150 1,200 1,250 1,300 1,350 1,400 Runtime in [s] Processing 1000 events

  • 10.3% -10.7% -11.4%

17/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-21
SLIDE 21

Simulation step of CMSSW using BigProducts

Used CMSSW 7 3 1:

  • SLC6, kernel 3.16
  • gcc 4.8
  • It uses BigProducts by default (developed by Shazhad)
  • pluginSimulation.so: linked against static Geant4 libraries
  • Obtain perf-profile for cmsRun, but then optimize only pluginSimulation.so

Testcase: TTbar

  • Step 1: Event generation and simulation
  • 20, 50, 100 events

18/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-22
SLIDE 22

Simulation step of CMSSW using BigProducts

Training data Run Number of Events cmsRun 20 events config1 cmsRun config1 20, 50, 100 cmsRun 50 events config1 cmsRun config1 20, 50, 100 cmsRun 100 events config1 cmsRun config1 20, 50, 100

Normal AutoFDO 20 events AutoFDO 50 events AutoFDO 100 events 520 540 560 580 Runtime in [s] Processing 20 events Normal AutoFDO 20 events AutoFDO 50 events AutoFDO 100 Events 1,250 1,300 1,350 Runtime in [s] Processing 50 events

  • 7.1%
  • 7.8%
  • 8.4%
  • 6.1%
  • 6.5%
  • 7.0%

19/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-23
SLIDE 23

Simulation step of CMSSW using BigProducts

Normal AutoFDO 20 events AutoFDO 50 events AutoFDO 100 events 2,500 2,600 2,700 Runtime in [s] Processing 100 events

  • 7.4%
  • 6.5%
  • 7.4%

20/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-24
SLIDE 24

Simulation step of CMSSW using BigProducts

cmsRun config2: took Pythia configurations from Wjet Pt 3000 3500 14TeV cfi.py in CMSSW 8 1 X

Training data Run Number of Events cmsRun 100 events config1 cmsRun config2 20, 50, 100

Normal AutoFDO 100 events 1,600 1,650 1,700 1,750 1,800 1,850 Runtime in [s] Processing 20 events Normal AutoFDO 100 events 3,800 4,000 4,200 4,400 Runtime in [s] Processing 50 events Normal AutoFDO 100 events 7,500 8,000 8,500 Runtime in [s] Processing 100 events

  • 8.9%
  • 12.5%
  • 11.9%

21/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-25
SLIDE 25

Simulation step of CMSSW using BigProducts

Training data Run Number of Events cmsRun 100 events config1 cmsRun config2 20, 50, 100 cmsRun 100 events config2 cmsRun config2 20, 50, 100

Normal AutoFDO 100 events AutoFDO 100 events 1,600 1,650 1,700 1,750 1,800 1,850 Runtime in [s] Processing 20 events Normal AutoFDO 100 events AutoFDO 100 events 3,800 4,000 4,200 4,400 Runtime in [s] Processing 50 events Normal AutoFDO 100 events AutoFDO 100 events 7,500 8,000 8,500 Runtime in [s] Processing 100 events

  • 8.9%
  • 12.5%
  • 11.9%
  • 7.3%
  • 9.2%
  • 8.5%

22/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-26
SLIDE 26

Simulation step of CMSSW using BigProducts

Training data Run Number of Events fullcms 100 events cmsRun job config2 20, 50, 100

Normal AutoFDO 100 events 540 550 560 570 580 Runtime in [s] Processing 20 events Normal AutoFDO 100 events 1,260 1,280 1,300 1,320 1,340 1,360 1,380 Runtime in [s] Processing 50 events Normal AutoFDO 100 events 2,550 2,600 2,650 2,700 2,750 Runtime in [s] Processing 100 events

  • 3.8%
  • 4.8%
  • 5.1%

23/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-27
SLIDE 27

AutoFDO gives useful insight

Google-gcc provides the flag -fcheck-branch-annotation:

  • bjcopy -O binary --set-section-flags .gnu.switches.text.branch.annotation=alloc
  • j .gnu.switches.text.branch.annotation

libG4processes.so libAnnotated

Example Output:

G4EnhancedVecAllocator.hh;122;146;0;10000;9550;d9a18bb69d5efaf3d9068625ec56d66a G4EnhancedVecAllocator.hh;137;8389;0;225;450;6a740d527b3f213d4868919fc7d9710c G4EnhancedVecAllocator.hh;135;8389;0;10000;9550;a17d8feb82daee40febb118864576dc9

24/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-28
SLIDE 28

AutoFDO gives useful insight

Google-gcc provides the flag -fcheck-branch-annotation:

  • bjcopy -O binary --set-section-flags .gnu.switches.text.branch.annotation=alloc
  • j .gnu.switches.text.branch.annotation

libG4processes.so libAnnotated

Example Output:

G4EnhancedVecAllocator.hh;122;146;0;10000;9550;d9a18bb69d5efaf3d9068625ec56d66a G4EnhancedVecAllocator.hh;137;8389;0;225;450;6a740d527b3f213d4868919fc7d9710c G4EnhancedVecAllocator.hh;135;8389;0;10000;9550;a17d8feb82daee40febb118864576dc9

  • File
  • Line
  • Basic block count
  • Annotated
  • Measured branch probability
  • Assumed branch probability
  • Hash value

24/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-29
SLIDE 29

AutoFDO gives useful insight

G 4 P h

  • t
  • N

u c l e a r C r

  • s

s S e c t i

  • n

. c c : 1 6 5 G 4 N u c l e i M

  • d

e l . c c : 1 3 3 2 s t l u n i n i t i a l i z e d . h : 7 4 G 4 h B r e m s s t r a h l u n g M

  • d

e l . c c : 8 7 G 4 C

  • k

P a i r i n g C

  • r

r e c t i

  • n

s . h h : 5 6 v e c t

  • r

. t c c : 1 8 4 G 4 M u B r e m s s t r a h l u n g M

  • d

e l . c c : 3 G 4 I n u c l P a r t i c l e . h h : 8 3 G 4 E l e c t r

  • N

u c l e a r C r

  • s

s S e c t i

  • n

. c c : 2 3 2 7 l

  • c

a l e f a c e t s . h : 8 6 7 G 4 E m C

  • r

r e c t i

  • n

s . c c : 3 9 1 G 4 V E n e r g y L

  • s

s P r

  • c

e s s . c c : 1 9 9 G 4 F a s t V e c t

  • r

. h h : 6 8 G 4 V E n e r g y L

  • s

s P r

  • c

e s s . c c : 1 4 1 2 G 4 V E n e r g y L

  • s

s P r

  • c

e s s . c c : 1 1 4 3 G 4 V E n e r g y L

  • s

s P r

  • c

e s s . c c : 1 1 4 2 G 4 U n i v e r s a l F l u c t u a t i

  • n

. c c : 2 2 4 G 4 U n i v e r s a l F l u c t u a t i

  • n

. c c : 2 1 3 G 4 P

  • i

s s

  • n

. h h : 5 7 G 4 P r

  • c

e s s M a n a g e r . c c : 2 7 3 G 4 T r a n s p

  • r

t a t i

  • n

. c c : 7 3 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 ·10 4 Basic block counts 20 events 50 events 100 events

25/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-30
SLIDE 30

Compiler heuristics are not always accurate for branch probabilities

G 4 P h

  • t
  • N

u c l e a r C r

  • s

s S e c t i

  • n

. c c : 1 6 5 G 4 N u c l e i M

  • d

e l . c c : 1 3 3 2 s t l u n i n i t i a l i z e d . h : 7 4 G 4 h B r e m s s t r a h l u n g M

  • d

e l . c c : 8 7 G 4 C

  • k

P a i r i n g C

  • r

r e c t i

  • n

s . h h : 5 6 v e c t

  • r

. t c c : 1 8 4 G 4 M u B r e m s s t r a h l u n g M

  • d

e l . c c : 3 G 4 I n u c l P a r t i c l e . h h : 8 3 G 4 E l e c t r

  • N

u c l e a r C r

  • s

s S e c t i

  • n

. c c : 2 3 2 7 l

  • c

a l e f a c e t s . h : 8 6 7 G 4 E m C

  • r

r e c t i

  • n

s . c c : 3 9 1 G 4 V E n e r g y L

  • s

s P r

  • c

e s s . c c : 1 9 9 G 4 F a s t V e c t

  • r

. h h : 6 8 G 4 V E n e r g y L

  • s

s P r

  • c

e s s . c c : 1 4 1 2 G 4 V E n e r g y L

  • s

s P r

  • c

e s s . c c : 1 1 4 3 G 4 V E n e r g y L

  • s

s P r

  • c

e s s . c c : 1 1 4 2 G 4 U n i v e r s a l F l u c t u a t i

  • n

. c c : 2 2 4 G 4 U n i v e r s a l F l u c t u a t i

  • n

. c c : 2 1 3 G 4 P

  • i

s s

  • n

. h h : 5 7 G 4 P r

  • c

e s s M a n a g e r . c c : 2 7 3 G 4 T r a n s p

  • r

t a t i

  • n

. c c : 7 3 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 ·10 4 Branch probability 20 events 50 events 100 events without profile

26/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-31
SLIDE 31

Summary

  • Quite decent speedups: 5-13%
  • Some fixes needed (shared libraries, gcc-flag)
  • Stable against change of simulation scenario
  • Easy deployment

1 Start perf together with the job 2 Gather profiles 3 Convert and merge profiles 4 Add compiler flag in CMake scripts

27/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-32
SLIDE 32

Backup Slides

28/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-33
SLIDE 33

Applied Optimizations

Perf profile delivered as number one hotspot: G4NeutronHPInelasticCompFS::SelectExitChannel with 5.9 % of br inst retired near taken

29/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-34
SLIDE 34

Applied Optimizations

Information from gcc:

gcc

  • fauto-profile=/data/nrauschm/CMSSW_7_3_1/output.gcov -fopt-info-optimized

G4NeutronHPInelasticCompFS.cc:182:5: note: Unroll loop 9 times G4NeutronHPInelasticCompFS.cc:168:3: note: Unroll loop 6 times

30/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-35
SLIDE 35

Problems encountered

  • Google gcc 4.8 has been merged with gcc main branch
  • But: normal gcc is missing the flag -frecord-compilation-info-in-elf
  • Flag creates a new section header and records compiler command line
  • ptions

>>>readelf -S fullcms | less There are 46 section headers, starting at offset 0x176209c0: Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align [ 0] NULL 0000000000000000 00000000 0000000000000000 0000000000000000 [ 1] .note.ABI-tag NOTE 0000000000400190 00000190 0000000000000020 0000000000000000 A 4 [...] [29] .gnu.switches.tex PROGBITS 0000000000000000 01712830 0000000000a00569 0000000000000000 1

31/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-36
SLIDE 36

Problems encountered

  • The information in .gnu.switches.text is used to build the module map
  • create gcov dumps the module map to a file (ending with .imports)

>>>head

  • utput.gcov.imports

/data/geant4.10.01.p03/source/event/src/G4EventManager.cc: /data/geant4.10.01.p03/ /data/geant4.10.01.p03/source/event/src/G4SmartTrackStack.cc: /data/geant4.10.01.p03/ /data/geant4.10.01.p03/source/event/src/G4StackManager.cc: /data/geant4.10.01.p03/ /data/geant4.10.01.p03/source/externals/clhep/src/Evaluator.cc: /data/geant4.10.01.p03/ /data/geant4.10.01.p03/source/externals/clhep/src/LorentzRotation.cc /data/geant4.10.01.p03/source/externals/clhep/src/LorentzVector.cc /data/geant4.10.01.p03/source/externals/clhep/src/LorentzVectorL.cc

32/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-37
SLIDE 37

Problems encountered

  • Apart from the module map AutoFDO creates a symbol map
  • Symbol map does not contain symbols coming from shared libraries
  • It limits the usability:
  • Statically linked libraries required
  • Optimize only the library causing the largest hotspots
  • icc-files are not recognized (fixed)
  • recent versions of perf could be problematic because data format is

different

  • works best with sampling the Last Branch Records (LBR)
  • LBR: Collection of register pairs that store source and destination addresses
  • f recently executed branches (currently only Intel CPUs)

33/33 Testing AutoFDO for Geant4 Nathalie Rauschmayr