Testing AutoFDO for Geant4 Nathalie Rauschmayr IT-CF-FPP With help - - PowerPoint PPT Presentation

testing autofdo for geant4
SMART_READER_LITE
LIVE PREVIEW

Testing AutoFDO for Geant4 Nathalie Rauschmayr IT-CF-FPP With help - - PowerPoint PPT Presentation

Testing AutoFDO for Geant4 Nathalie Rauschmayr IT-CF-FPP With help from Benedikt Hegner and Shahzad Malik Muzaffar 1/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr Introduction Idea: Autotuning Compile 2/29 Testing AutoFDO for Geant4


slide-1
SLIDE 1

Testing AutoFDO for Geant4

Nathalie Rauschmayr

IT-CF-FPP With help from Benedikt Hegner and Shahzad Malik Muzaffar

1/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-2
SLIDE 2

Introduction

Idea: Autotuning

Compile

2/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-3
SLIDE 3

Introduction

Idea: Autotuning

Compile Run

3/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-4
SLIDE 4

Introduction

Idea: Autotuning

Compile Feedback Run

4/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-5
SLIDE 5

Introduction

Idea: Autotuning

Compile Feedback Run

Concept exists already for some time: Profile Guided Optimization

4/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-6
SLIDE 6

Introduction

Profile Guided Optimization is useful for:

  • Code that contains a lot of branches that are difficult to predict at compile

time

  • Performance sensitive code
  • When running the same code over and over again

5/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-7
SLIDE 7

Introduction

Profile Guided Optimization:

  • Uses profiling to improve runtime performance
  • Analyses code sections that are frequently executed
  • Based on profiles the compiler might change:
  • Inlining
  • Virtual Call Speculation
  • Register allocation
  • Basic Block Optimization
  • Function Layout
  • Conditional Branch Optimization
  • Dead Code Separation

6/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-8
SLIDE 8

Introduction

Two approaches for Profile Guided Optimization (PGO):

  • Modify binary (instrumentation)
  • Monitor unaltered binary (sampling with perf)
  • AutoFDO transforms perf-profiles into the format that can be used by

gcc/clang for Feedback Directed Optimization (FDO)

  • Developed by Google https://github.com/google/autofdo

7/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-9
SLIDE 9

Difference between sampling and instrumentation

Instrumentation based PGO:

gcc -fprofile-generate test.c -o test test.gcno test.gcda gcc -fprofile-use test.c -o test Instrumentation Run Recompile Production Environment

8/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-10
SLIDE 10

Difference between sampling and instrumentation

Instrumentation based PGO:

gcc -fprofile-generate test.c -o test test.gcno test.gcda gcc -fprofile-use test.c -o test Instrumentation Run Recompile Production Environment

Disadvantages:

  • Tedious dual-compilation
  • Produces a lot of small output files (in case of Geant4: 1698 files, each smaller

than 100KB)

  • Cannot run easily in production environment
  • Instrumented binary might be significantly slower

8/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-11
SLIDE 11

Difference between sampling and instrumentation

Sampling Based FDO (AutoFDO):

Create production binary Run production binary with perf Convert perf-profile Recompile with converted perf-profile

9/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-12
SLIDE 12

Difference between sampling and instrumentation

Sampling Based FDO (AutoFDO):

gcc -O3 -ggdb

  • frecord-compilation-info-in-elf
  • D DEBUG test.c -o test

perf record -b -e cpu/event=0xc4,umask=0x20, name=br inst retired near taken, period=1000009/pp ./test create gcov --binary=./test

  • -profile=perf.data --gcov=binary.gcov
  • gcov version=1

gcc -O3 -fauto-profile=test.gcov test.c -o test Create production binary Run production binary with perf Convert perf-profile Recompile with converted perf-profile

10/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-13
SLIDE 13

Difference between sampling and instrumentation

AutoFDO compared to instrumentation based PGO:

  • Profile data can be obtained in production environment
  • Works on optimized builds
  • It provides a tool to merge profiles from multiple runs
  • Only one output file per run

11/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-14
SLIDE 14

General Caveats

  • The sample needs to be representative for the typical usage scenarios
  • Otherwise: PGO could possible slow down the performance
  • Need many profiles and runs
  • Unbiased branches

12/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-15
SLIDE 15

Testcases

Applications:

  • CMS Detector Simulation (FullCMS)
  • Simulation step of CMSSW using static build of Geant4 (cmsRun)

Input data/workflow needs to be representative:

  • How many events needed as training data?
  • What if job configuration changes?
  • What if job type changes?

13/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-16
SLIDE 16

Testcases

Training data Run Number of Events FullCMS run FullCMS run 100,500,1k cmsRun config1 cmsRun config1 20, 50, 100 cmsRun config1 cmsRun config2 20, 50, 100 cmsRun config2 cmsRun config2 20, 50, 100 FullCMS run cmsRun config2 1k FullCMS: Geant4 example with particle gun cmsRun config1: TTbar event generation and simulation (CMSSW 7 3 1) cmsRun config2: Wjets event generation and simulation (CMSSW 7 3 1)

14/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-17
SLIDE 17

CMS Full Detector Simulation

Training data Run Number of Events FullCMS run 100 events FullCMS 100, 500, 1k FullCMS run 500 events FullCMS 100, 500, 1k FullCMS run 1k events FullCMS 100, 500, 1k

Normal AutoFDO 100 events AutoFDO 500 events AutoFDO 1000 events 130 140 150 160 170 Runtime in [s] Processing 100 events Normal AutoFDO 100 events AutoFDO 500 events AutoFDO 1000 Events 600 650 700 Runtime in [s] Processing 500 events

  • 8.9%
  • 10.4%
  • 11.5%
  • 9.8%
  • 9.5% -10.2%

15/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-18
SLIDE 18

CMS Full Detector Simulation

Normal AutoFDO 100 events AutoFDO 500 events AutoFDO 1000 events 1,150 1,200 1,250 1,300 1,350 1,400 Runtime in [s] Processing 1000 events

  • 10.3% -10.7% -11.4%

16/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-19
SLIDE 19

Simulation step of CMSSW using BigProducts

Used CMSSW 7 3 1:

  • SLC6, kernel 3.16
  • gcc 4.8
  • It uses BigProducts by default (developed by Shazhad)
  • pluginSimulation.so: linked against static Geant4 libraries
  • Obtain perf-profile for cmsRun, but then optimize only pluginSimulation.so

Testcase: TTbar

  • Step 1: Event generation and simulation
  • 20, 50, 100 events

17/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-20
SLIDE 20

Simulation step of CMSSW using BigProducts

Training data Run Number of Events cmsRun 20 events config1 cmsRun config1 20, 50, 100 cmsRun 50 events config1 cmsRun config1 20, 50, 100 cmsRun 100 events config1 cmsRun config1 20, 50, 100

Normal AutoFDO 20 events AutoFDO 50 events AutoFDO 100 events 520 540 560 580 Runtime in [s] Processing 20 events Normal AutoFDO 20 events AutoFDO 50 events AutoFDO 100 Events 1,250 1,300 1,350 Runtime in [s] Processing 50 events

  • 7.1%
  • 7.8%
  • 8.4%
  • 6.1%
  • 6.5%
  • 7.0%

18/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-21
SLIDE 21

Simulation step of CMSSW using BigProducts

Normal AutoFDO 20 events AutoFDO 50 events AutoFDO 100 events 2,500 2,600 2,700 Runtime in [s] Processing 100 events

  • 7.4%
  • 6.5%
  • 7.4%

19/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-22
SLIDE 22

Simulation step of CMSSW using BigProducts

cmsRun config2: took Pythia configurations from Wjet Pt 3000 3500 14TeV cfi.py in CMSSW 8 1 X

Training data Run Number of Events cmsRun 100 events config1 cmsRun config2 20, 50, 100

Normal AutoFDO 100 events 1,600 1,650 1,700 1,750 1,800 1,850 Runtime in [s] Processing 20 events Normal AutoFDO 100 events 3,800 4,000 4,200 4,400 Runtime in [s] Processing 50 events Normal AutoFDO 100 events 7,500 8,000 8,500 Runtime in [s] Processing 100 events

  • 8.9%
  • 12.5%
  • 11.9%

20/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-23
SLIDE 23

Simulation step of CMSSW using BigProducts

Training data Run Number of Events cmsRun 100 events config1 cmsRun config2 20, 50, 100 cmsRun 100 events config2 cmsRun config2 20, 50, 100

Normal AutoFDO 100 events AutoFDO 100 events 1,600 1,650 1,700 1,750 1,800 1,850 Runtime in [s] Processing 20 events Normal AutoFDO 100 events AutoFDO 100 events 3,800 4,000 4,200 4,400 Runtime in [s] Processing 50 events Normal AutoFDO 100 events AutoFDO 100 events 7,500 8,000 8,500 Runtime in [s] Processing 100 events

  • 8.9%
  • 12.5%
  • 11.9%
  • 7.3%
  • 9.2%
  • 8.5%

21/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-24
SLIDE 24

Simulation step of CMSSW using BigProducts

Training data Run Number of Events fullcms 100 events cmsRun job config2 20, 50, 100

Normal AutoFDO 100 events 540 550 560 570 580 Runtime in [s] Processing 20 events Normal AutoFDO 100 events 1,260 1,280 1,300 1,320 1,340 1,360 1,380 Runtime in [s] Processing 50 events Normal AutoFDO 100 events 2,550 2,600 2,650 2,700 2,750 Runtime in [s] Processing 100 events

  • 3.8%
  • 4.8%
  • 5.1%

22/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-25
SLIDE 25

AutoFDO gives useful insight

Google-gcc provides the flag -fcheck-branch-annotation:

  • bjcopy -O binary --set-section-flags .gnu.switches.text.branch.annotation=alloc
  • j .gnu.switches.text.branch.annotation

libG4processes.so libAnnotated

Example Output:

G4EnhancedVecAllocator.hh;122;146;0;10000;9550;d9a18bb69d5efaf3d9068625ec56d66a G4EnhancedVecAllocator.hh;137;8389;0;225;450;6a740d527b3f213d4868919fc7d9710c G4EnhancedVecAllocator.hh;135;8389;0;10000;9550;a17d8feb82daee40febb118864576dc9

23/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-26
SLIDE 26

AutoFDO gives useful insight

Google-gcc provides the flag -fcheck-branch-annotation:

  • bjcopy -O binary --set-section-flags .gnu.switches.text.branch.annotation=alloc
  • j .gnu.switches.text.branch.annotation

libG4processes.so libAnnotated

Example Output:

G4EnhancedVecAllocator.hh;122;146;0;10000;9550;d9a18bb69d5efaf3d9068625ec56d66a G4EnhancedVecAllocator.hh;137;8389;0;225;450;6a740d527b3f213d4868919fc7d9710c G4EnhancedVecAllocator.hh;135;8389;0;10000;9550;a17d8feb82daee40febb118864576dc9

  • File
  • Line
  • Basic block count
  • Annotated
  • Measured branch probability
  • Assumed branch probability
  • Hash value

23/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-27
SLIDE 27

AutoFDO gives useful insight

G 4 P h

  • t
  • N

u c l e a r C r

  • s

s S e c t i

  • n

. c c : 1 6 5 G 4 N u c l e i M

  • d

e l . c c : 1 3 3 2 s t l u n i n i t i a l i z e d . h : 7 4 G 4 h B r e m s s t r a h l u n g M

  • d

e l . c c : 8 7 G 4 C

  • k

P a i r i n g C

  • r

r e c t i

  • n

s . h h : 5 6 v e c t

  • r

. t c c : 1 8 4 G 4 M u B r e m s s t r a h l u n g M

  • d

e l . c c : 3 G 4 I n u c l P a r t i c l e . h h : 8 3 G 4 E l e c t r

  • N

u c l e a r C r

  • s

s S e c t i

  • n

. c c : 2 3 2 7 l

  • c

a l e f a c e t s . h : 8 6 7 G 4 E m C

  • r

r e c t i

  • n

s . c c : 3 9 1 G 4 V E n e r g y L

  • s

s P r

  • c

e s s . c c : 1 9 9 G 4 F a s t V e c t

  • r

. h h : 6 8 G 4 V E n e r g y L

  • s

s P r

  • c

e s s . c c : 1 4 1 2 G 4 V E n e r g y L

  • s

s P r

  • c

e s s . c c : 1 1 4 3 G 4 V E n e r g y L

  • s

s P r

  • c

e s s . c c : 1 1 4 2 G 4 U n i v e r s a l F l u c t u a t i

  • n

. c c : 2 2 4 G 4 U n i v e r s a l F l u c t u a t i

  • n

. c c : 2 1 3 G 4 P

  • i

s s

  • n

. h h : 5 7 G 4 P r

  • c

e s s M a n a g e r . c c : 2 7 3 G 4 T r a n s p

  • r

t a t i

  • n

. c c : 7 3 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 ·10 4 Basic block counts 20 events 50 events 100 events

24/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-28
SLIDE 28

Compiler heuristics are not always accurate for branch probabilities

G 4 P h

  • t
  • N

u c l e a r C r

  • s

s S e c t i

  • n

. c c : 1 6 5 G 4 N u c l e i M

  • d

e l . c c : 1 3 3 2 s t l u n i n i t i a l i z e d . h : 7 4 G 4 h B r e m s s t r a h l u n g M

  • d

e l . c c : 8 7 G 4 C

  • k

P a i r i n g C

  • r

r e c t i

  • n

s . h h : 5 6 v e c t

  • r

. t c c : 1 8 4 G 4 M u B r e m s s t r a h l u n g M

  • d

e l . c c : 3 G 4 I n u c l P a r t i c l e . h h : 8 3 G 4 E l e c t r

  • N

u c l e a r C r

  • s

s S e c t i

  • n

. c c : 2 3 2 7 l

  • c

a l e f a c e t s . h : 8 6 7 G 4 E m C

  • r

r e c t i

  • n

s . c c : 3 9 1 G 4 V E n e r g y L

  • s

s P r

  • c

e s s . c c : 1 9 9 G 4 F a s t V e c t

  • r

. h h : 6 8 G 4 V E n e r g y L

  • s

s P r

  • c

e s s . c c : 1 4 1 2 G 4 V E n e r g y L

  • s

s P r

  • c

e s s . c c : 1 1 4 3 G 4 V E n e r g y L

  • s

s P r

  • c

e s s . c c : 1 1 4 2 G 4 U n i v e r s a l F l u c t u a t i

  • n

. c c : 2 2 4 G 4 U n i v e r s a l F l u c t u a t i

  • n

. c c : 2 1 3 G 4 P

  • i

s s

  • n

. h h : 5 7 G 4 P r

  • c

e s s M a n a g e r . c c : 2 7 3 G 4 T r a n s p

  • r

t a t i

  • n

. c c : 7 3 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 ·10 4 Branch probability 20 events 50 events 100 events without profile

25/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-29
SLIDE 29

Problems encountered

  • Google gcc 4.8 has been merged with gcc main branch, but normal gcc is missing

the flag -frecord-compilation-info-in-elf

  • Flag creates a new section: Object files may have distinct compile options. The

new section preserves compile options for every object file.

>>>readelf -S -W libG4tracking.so | less There are 37 section headers, starting at offset 0x615a8: Section Headers: [Nr] Name Type Address Off Size ES Flg Lk Inf Al [ 0] NULL 0000000000000000 000000 000000 00 [ 1] .hash HASH 0000000000000190 000190 0010ec 04 A 2 8 [...] [26] .gnu.switches.text.quote_paths PROGBITS 0000000000000000 051440 0006bb 00 1 [27] .gnu.switches.text.bracket_paths PROGBITS 0000000000000000 051afb 007c71 00 1 [28] .gnu.switches.text.system_paths PROGBITS 0000000000000000 05976c 003330 00 1 [29] .gnu.switches.text.cpp_defines PROGBITS 0000000000000000 05ca9c 00117e 00 1 [30] .gnu.switches.text.cpp_includes PROGBITS 0000000000000000 05dc1a 0006bb 00 1 [31] .gnu.switches.text.cl_args PROGBITS 0000000000000000 05e2d5 0029b0 00 1 [32] .gnu.switches.text.lipo_info PROGBITS 0000000000000000 060c85 0006f4 00 1 26/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-30
SLIDE 30

Problems encountered

  • The information in .gnu.switches.text is used to build the module map
  • create gcov dumps the module map to a file (ending with .imports)

>>>head

  • utput.gcov.imports

/data/geant4.10.01.p03/source/event/src/G4EventManager.cc: /data/geant4.10.01.p03/ /data/geant4.10.01.p03/source/event/src/G4SmartTrackStack.cc: /data/geant4.10.01.p03/ /data/geant4.10.01.p03/source/event/src/G4StackManager.cc: /data/geant4.10.01.p03/ /data/geant4.10.01.p03/source/externals/clhep/src/Evaluator.cc: /data/geant4.10.01.p03/ /data/geant4.10.01.p03/source/externals/clhep/src/LorentzRotation.cc /data/geant4.10.01.p03/source/externals/clhep/src/LorentzVector.cc /data/geant4.10.01.p03/source/externals/clhep/src/LorentzVectorL.cc

27/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-31
SLIDE 31

Problems encountered

  • Apart from the module map AutoFDO creates a symbol map
  • Symbol map does not contain symbols coming from shared libraries
  • It limits the usability:
  • Statically linked libraries required
  • Optimize only the library causing the largest hotspots
  • icc-files are not recognized (fixed)
  • recent versions of perf could be problematic because data format is

different

  • works best with sampling the Last Branch Records (LBR)
  • LBR: Collection of register pairs that store source and destination addresses
  • f recently executed branches (currently only Intel CPUs)

28/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr

slide-32
SLIDE 32

Summary

  • Quite decent speedups: 5-13%
  • Some fixes needed (shared libraries, gcc-flag)
  • Stable against change of simulation scenario
  • Easy deployment

1 Start perf together with the job 2 Gather profiles 3 Convert and merge profiles 4 Add compiler flag in CMake scripts

29/29 Testing AutoFDO for Geant4 Nathalie Rauschmayr