Practical Experience with Practical Experience with Practical - PowerPoint PPT Presentation

Practical Experience with Practical Experience with Practical Experience with Practical Experience with performance monitoring performance monitoring performance monitoring performance monitoring Ryszard Jurga CERN openlab March 29, 2006

Agenda � Introduction • perfctr • Pentium 4/Xeon � Monitoring tool • sampling, multiplexing � Sample measurements • Geant4 (test40), Atlas Simulation, make • lxbatch � Applications • Profiling � Conclusions

Introduction � Special on-chip hardware of modern CPU • Direct access to CPU resources such as branch prediction, data and instruction caches, floating point instructions, memory operations • Event detectors, counters Itanium2: 4 counters, 100+ monitorable events, two set of registers: PMC, • PMD Pentrium4,Xeon : 44 event detectors, 18 counters • • Linux interfaces and libraries: Part of kernel in order to per-thread and per-system measurements • Perfmon2 • – uniform across all hardware platforms – events multiplexing – the number of fully supported processors are very low except Itanium – kernel 2.6 (integrated for Itanium) perfctr •

perfctr � version 2.6.19 • per-thread and system-wide measurements, • user and kernel domain, • Support for a lot of CPU (P MMX/Pro/II/III/IV/Xeon/Celeron…), no support for Itanium • kernels 2.4 & 2.6, • No multiplexing, • Almost no documentation apart from comments in source files, • Require a deep understanding of performance monitoring features of every processors

Pentium 4 Performance Monitoring Features •44 event detectors, 9 pairs of counters •2 control registers (ESCR, CCCR) •2 classes of events: •Non-retirement events – those from Intel documentation that occur any time during execution (1 counter) •At-retirement events – those that occurred on execution path and their results were committed in architectural state (1 or 2 counters) •multiplexing from B. Sprunt “Pentium 4 Performance-Monitoring Features”

Monitoring tool - gpfmon CYC – CPU cycles � uses perfctr, TOT – Instructions completed � enables multiplexing, BR_TP – Branch taken predicted BR_TM – Branch taken mispredicted � user and kernel domain, L2LM – L2 load missed � per single or total CPU, L2SM – L2 store missed FP – Floating point instructions � events: SDS – scalar instructions LD – load intstructions ST – store instructions CYC TOT BR_TP BR_TM L2LM L2SM BR – BR_TP+BR_TM LDST - LD+ST CYC TOT L2SM FP LD L2LM CYC TOT L2LM L2SM SDS ST CYC TOT L2LM L2SM LDST BR

Sw sampling vs. perfctr sampling � test40 BR_TP BR_TM CYC TOT • 4 sets, 3 times, sp 1s BR_TP BR_TM CYC TOT BR_TP BR_TM CYC TOT 1,2 jobs • BR_TP BR_TM CYC TOT 3 jobs • BR_TP BR_TM CYC TOT X X ∑ sampling error % − X WS S - the value of counter without sw sampling WS 7 X X - the value of counter with sw sampling S WS * 100 % n – the number of collected samples n 6 5 Collected Collected 1job_av 4 2job_av samples % samples % 1job_max 3 2jobs_max 2jobs 1job 2 CYC 98.88 98.52 1 TOT 99.00 98.9 0 TOT B R_T M B R_T P F P LD ST BR_TP 99.06 99.09 CYC TOT BR_TP BR_TM FP LD BR_TM 97.05 94.31 FP 99.08 98.87 LD 99.03 98.84 ST 99.49 98.97 L2LM 99.71 97.45 0 50 100 150 200 250 300 L2SM 51.51 10

Sw sampling vs. perfctr sampling •3jobs 420s 1 540s 0.9 0.8 0.7 0.6 0.5 31% 0.4 0.3 0.2 0.1 1.8E+12 0 0 100 200 300 400 500 600 1.6E+12 1.4E+12 CYC 1170442842782 1528876572499 1.2E+12 910449938595 910053742885 TOT 1E+12 141095332033 141023149439 FP 8E+11 340282127317 340126643068 LOAD 6E+11 36364751649 36374788916 L2LM 28% 4E+11 L2LS 7802195 10010569 2E+11 0 CYC TOT FP LD L2LM

Multiplexing � test40 CYC TOT BR_TP BR_TM L2LM L2SM • 4 sets, 3 times, sp1s CYC TOT L2SM FP LD L2LM BR_TP BR_TM CYC TOT CYC TOT 1,2 jobs • L2LM L2SM SDS ST CYC TOT L2LM L2SM LDST BR X X X ∑ - the value of counter without sw sampling − WS S WS X - the value of counter with sw sampling X S WS * 100 % n – the number of collected samples samples % samples % n 2jobs 1job 1job 2jobs CYC 98.75 98.09 average % max % average % max % TOT 98.86 86.01 TOT 0.12 1.38 0.19 16.65 BR_TP 99.69 99.51 BR_TP 0.07 5.64 0.12 5.48 BR_TM 98.73 96.84 BR_TM 0.08 11.85 0.13 11.49 FP 98.82 98.63 FP 0.10 0.98 0.15 1.12 LD 99.07 98.89 LD ST 98.89 98.29 0.10 3.14 0.16 3.52 L2LM 99.43 86.05 ST 0.09 4.55 0.15 4.45 L2LS 46.58 8.49

test40 Total 1job 2jobs instructions 1 0.5 0.9 0.4 0.8 0.7 0.3 0.6 CPU1 0.5 CPU2 CPU1 Series3 0.2 CPU2 CPU1+CPU2 0.4 0.3 270s 0.1 0.2 0.1 0 0 50 100 150 200 250 300 0 0 50 100 150 200 250 300 -0.1 3jobs 3jobs 1.2 1 0.9 420s 540s 1 0.8 0.7 0.8 0.6 0.6 0.5 CPU1 CPU1 CPU2 CPU2 Series3 Series3 0.4 0.4 0.3 0.2 0.2 0.1 0 0 100 200 300 400 500 600 0 0 100 200 300 400 500 600 -0.2 -0.1

Geant4 Atlas Simulations Total instructions/cycle Total instructions 1 0.9 0.8 16067552642403 0.7 Cycles 0.6 0.5 Total inst 2216977123726 0.4 0.3 INS/CYC 0.138 0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 -0.1 FP/cycle Floating-point instructions 0.16 0.14 0.12 402251034688 FP 0.1 0.08 FP/TOT 18.14% 0.06 FP/CYC 0.025 0.04 0.02 0 0 1000 2000 3000 4000 5000 6000 -0.02

Geant4 Atlas Simulations Memory LD/cycle L2LM/cycle 63% 0.5 0.02 0.45 0.4 0.015 LD 848049506780 0.35 0.3 LD/TOT 38.25% 0.01 0.25 LD/CYC 0.053 0.2 0.005 0.15 L2LM 61010720039 0.1 0 L2LM/LD 7.19% 0.05 0 1000 2000 3000 4000 5000 6000 0 0 1000 2000 3000 4000 5000 6000 -0.005 -0.05 ST/cycle L2SM/cycle 0.003 0.25 0.0025 ST 548061694948 0.2 0.002 ST/TOT 24.72% 0.15 ST/CYC 0.034 0.0015 0.1 L2SM 737751425 0.001 0.05 L2SM/ST 0.135% 0.0005 0 0 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 -0.0005 -0.05

Geant4 Atlas Simulations Branches taken predicted/cycle 0.14 Branches 0.12 0.1 10% 0.08 0.06 0.04 0.02 0 218342330220 BR_TP 0 1000 2000 3000 4000 5000 6000 7000 -0.02 BR_TM 5964007356 BR_TP/TOT 9.85% Branches taken mispredicted/cycle BR_TM/TOT 0.269% 0.008 0.007 0.006 0.005 0.004 0.003 0.002 0.001 0 0 1000 2000 3000 4000 5000 6000 -0.001

make make –j1 make –j2 Total instructions/cycle Total instructions/cycle 2 0.7 Total instructions 1.8 0.6 1.6 0.5 1.4 CYC 1328309944643 673216187945 1.2 0.4 TOT 586734515764 586734515764 1 IN S /C Y C IN S /C Y C 0.3 0.8 INS/CYC 0.44 0.87 0.2 0.6 0.4 0.1 0.2 0 0 50 100 150 200 250 300 350 400 450 500 0 0 50 100 150 200 250 300 -0.1 -0.2 s s 97% LD/cycle 0.7 LD/cycle Load instructions 0.2 0.6 0.5 0.15 LD 193925962348 192317045567 LD/TOT 33.1% 32.7% 0.4 0.1 LD/CYC 0.146 0.286 0.3 0.05 0.2 0 0 50 100 150 200 250 300 350 400 450 500 0.1 -0.05 0 0 50 100 150 200 250 300 -0.1

lxbatch monitoring cycles � 14 machines 3.5E+15 � 3E+15 running from 2 day to 2 weeks 2.5E+15 � Nocona(10), Irwindale (4) 2E+15 1.5E+15 � 2.8GHz 1E+15 � 5E+14 1MB L2(10) 2MB L2(4) 0 � avr 5401 5403 5404 5405 5406 5501 5502 5504 5505 5509 6104 6105 6106 6108 SL3 (kernel 2.4)

lxbatch Instructions/cycle 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 r 1 3 4 5 6 1 2 4 5 9 4 5 6 8 Float/total [%] v 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a 4 4 4 4 4 5 5 5 5 5 1 1 1 1 5 5 5 5 5 5 5 5 5 5 6 6 6 6 25 20 15 10 5 0 r 1 3 4 5 6 1 2 4 5 9 4 5 6 8 v 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a 4 4 4 4 4 5 5 5 5 5 1 1 1 1 5 5 5 5 5 5 5 5 5 5 6 6 6 6

lxbatch - memory operations Load+Store/total [%] 70 60 50 40 30 20 10 0 avr 5401 5403 5404 5405 5406 5501 5502 5504 5505 5509 6104 6105 6106 6108

lxbatch - memory operations Load/total [%] L2LM/LD [%] 45 12 40 10 35 30 8 25 6 20 15 4 10 2 5 0 0 r r 1 3 4 5 6 1 2 4 5 9 4 5 6 8 v 1 3 4 5 6 1 2 4 5 9 4 5 6 8 v 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a a 4 4 4 4 4 5 5 5 5 5 1 1 1 1 4 4 4 4 4 5 5 5 5 5 1 1 1 1 5 5 5 5 5 5 5 5 5 5 6 6 6 6 5 5 5 5 5 5 5 5 5 5 6 6 6 6 ST/total [%] L2SM/ST [%] 25 20 18 20 16 14 15 12 10 10 8 6 5 4 2 0 0 avr 5401 5403 5404 5405 5406 5501 5502 5504 5505 5509 6104 6105 6106 6108 r 1 3 4 5 6 1 2 4 5 9 4 5 6 8 v 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a 4 4 4 4 4 5 5 5 5 5 1 1 1 1 5 5 5 5 5 5 5 5 5 5 6 6 6 6

Practical Experience with Practical Experience with Practical - PowerPoint PPT Presentation

Practical Experience with Practical Experience with Practical Experience with Practical Experience with performance monitoring performance monitoring performance monitoring performance monitoring Ryszard Jurga CERN openlab March 29, 2006

Change from a Practical Perspective Change from a Practical Perspective Change from a Practical

Real-World applications of Boosting Yoav Freund UCSD Practical Advantages of AdaBoost

Practical Bioinformatics Mark Voorhies 5/15/2015 Mark Voorhies Practical Bioinformatics

CSpace CSpace CSpace CSpace A More Practical and A More Practical and A

ARDUINO & ELECTRONICS PRACTICAL PRACTICAL SESSION 1 Part of SmartProducts ARDUINO &

The Air-Brake: A Practical Presentation of the Modern The Air-Brake: A Practical Presentation of

Practical Bioinformatics Mark Voorhies 4/16/2018 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 4/9/2018 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 5/12/2015 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 6/3/2013 Mark Voorhies Practical Bioinformatics

PRACTICAL CHURCH ENERGY ISSUES Rebecca Cadie, Architect ARPL Architects PRACTICAL APPLICATION

Practical Bioinformatics Mark Voorhies 5/ 24/ 2013 Mark Voorhies Practical Bioinformatics

Session IV Practical Issues Thomas J. Leeper Government Department London School of Economics

Practical Bioinformatics Mark Voorhies 5/23/2019 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 5/21/2019 Mark Voorhies Practical Bioinformatics Change

Practical Analog Filters Overview Types of practical filters Filter specifications

Analysis of Overhead in Dynamic Java Performance Monitoring Vojtch Hork, Jaroslav Kotr,

Data Monitoring and Performance of the NOvA Detectors Teresa Lackey Indiana University 6 June

2016 Performance-Related Accountability Requirement Public Health Accreditation Board Measure

Application Performance Monitoring: Trade-Off between Overhead Reduction and Maintainability Jan

lecture 19 Input / Output (I/O) - system bus - input e.g. keyboard, mouse - output

PCI Express Support in QEmu Isaku Yamahata <yamahata@private.email.ne.jp>

UMBC A B M A L T F O U M B C I M Y O R T 1 (Jan. 30th, 2002) I E S R C E

IN INSIDE A COMPUTER COMPUTER BU BUS S ARCHITECTURE ECE 422 DATA COMMUNICATIONS &

Practical Experience with Practical Experience with Practical - PowerPoint PPT Presentation

Practical Experience with Practical Experience with Practical Experience with Practical Experience with performance monitoring performance monitoring performance monitoring performance monitoring Ryszard Jurga CERN openlab March 29, 2006

Change from a Practical Perspective Change from a Practical Perspective Change from a Practical

Real-World applications of Boosting Yoav Freund UCSD Practical Advantages of AdaBoost

Practical Bioinformatics Mark Voorhies 5/15/2015 Mark Voorhies Practical Bioinformatics

CSpace CSpace CSpace CSpace A More Practical and A More Practical and A

ARDUINO &amp; ELECTRONICS PRACTICAL PRACTICAL SESSION 1 Part of SmartProducts ARDUINO &amp;

The Air-Brake: A Practical Presentation of the Modern The Air-Brake: A Practical Presentation of

Practical Bioinformatics Mark Voorhies 4/16/2018 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 4/9/2018 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 5/12/2015 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 6/3/2013 Mark Voorhies Practical Bioinformatics

PRACTICAL CHURCH ENERGY ISSUES Rebecca Cadie, Architect ARPL Architects PRACTICAL APPLICATION

Practical Bioinformatics Mark Voorhies 5/ 24/ 2013 Mark Voorhies Practical Bioinformatics

Session IV Practical Issues Thomas J. Leeper Government Department London School of Economics

Practical Bioinformatics Mark Voorhies 5/23/2019 Mark Voorhies Practical Bioinformatics

Practical Bioinformatics Mark Voorhies 5/21/2019 Mark Voorhies Practical Bioinformatics Change

Practical Analog Filters Overview Types of practical filters Filter specifications

Analysis of Overhead in Dynamic Java Performance Monitoring Vojtch Hork, Jaroslav Kotr,

Data Monitoring and Performance of the NOvA Detectors Teresa Lackey Indiana University 6 June

2016 Performance-Related Accountability Requirement Public Health Accreditation Board Measure

Application Performance Monitoring: Trade-Off between Overhead Reduction and Maintainability Jan

lecture 19 Input / Output (I/O) - system bus - input e.g. keyboard, mouse - output

PCI Express Support in QEmu Isaku Yamahata &lt;yamahata@private.email.ne.jp&gt;

UMBC A B M A L T F O U M B C I M Y O R T 1 (Jan. 30th, 2002) I E S R C E

IN INSIDE A COMPUTER COMPUTER BU BUS S ARCHITECTURE ECE 422 DATA COMMUNICATIONS &amp;

ARDUINO & ELECTRONICS PRACTICAL PRACTICAL SESSION 1 Part of SmartProducts ARDUINO &

PCI Express Support in QEmu Isaku Yamahata <yamahata@private.email.ne.jp>

IN INSIDE A COMPUTER COMPUTER BU BUS S ARCHITECTURE ECE 422 DATA COMMUNICATIONS &