Practical Experience with Practical Experience with Practical Experience with Practical Experience with performance monitoring performance monitoring performance monitoring performance monitoring
Ryszard Jurga CERN openlab
March 29, 2006
Practical Experience with Practical Experience with Practical - - PowerPoint PPT Presentation
Practical Experience with Practical Experience with Practical Experience with Practical Experience with performance monitoring performance monitoring performance monitoring performance monitoring Ryszard Jurga CERN openlab March 29, 2006
March 29, 2006
PMD
– uniform across all hardware platforms – events multiplexing – the number of fully supported processors are very low except Itanium – kernel 2.6 (integrated for Itanium)
from B. Sprunt “Pentium 4 Performance-Monitoring Features”
from Intel documentation
CYC TOT
BR_TP BR_TM
CYC TOT
FP LD
CYC TOT
SDS ST
CYC TOT
LDST BR
L2SM L2SM L2SM L2SM L2LM L2LM L2LM L2LM
CYC – CPU cycles TOT – Instructions completed BR_TP – Branch taken predicted BR_TM – Branch taken mispredicted L2LM – L2 load missed L2SM – L2 store missed FP – Floating point instructions SDS – scalar instructions LD – load intstructions ST – store instructions BR – BR_TP+BR_TM LDST - LD+ST
sampling error % 1 2 3 4 5 6 7 1job_av 2job_av 1job_max 2jobs_max
TOT B R_T P B R_T M F P LD ST10 51.51 L2SM 99.71 99.49 99.03 99.08 97.05 99.06 99.00 98.88
Collected samples % 1job
97.45 L2LM 98.97 ST 98.84 LD 98.87 FP 94.31 99.09 98.9 98.52
Collected samples % 2jobs
BR_TM BR_TP TOT CYC CYC TOT
BR_TP BR_TM
CYC TOT
BR_TP BR_TM
CYC TOT
BR_TP BR_TM
CYC TOT
BR_TP BR_TM
CYC TOT
BR_TP BR_TM
CYC TOT BR_TP BR_TM FP LD
50 100 150 200 250 300
n – the number of collected samples
% 100 * n X X X
WS S WS
−
WS
X
S
X
2E+11 4E+11 6E+11 8E+11 1E+12 1.2E+12 1.4E+12 1.6E+12 1.8E+12 CYC TOT FP LD L2LM
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 200 300 400 500 600
10010569 36374788916 340126643068 141023149439 910053742885 1528876572499 7802195
L2LS
36364751649
L2LM
340282127317
LOAD
141095332033
FP
910449938595
TOT
1170442842782
CYC
8.49 46.58 L2LS 99.43 98.89 99.07 98.82 98.73 99.69 98.86 98.75
samples % 1job
86.05 L2LM 98.29 ST 98.89 LD 98.63 FP 96.84 99.51 86.01 98.09
samples % 2jobs
BR_TM BR_TP TOT CYC CYC TOT
BR_TP BR_TM
CYC TOT
BR_TP BR_TM
CYC TOT
FP LD
CYC TOT
SDS ST
CYC TOT
LDST BR
L2SM L2SM L2SM L2SM L2LM L2LM L2LM L2LM 4.45 0.15 4.55 0.09 ST 3.52 0.16 3.14 0.10 LD 1.12 0.15 0.98 0.10 FP 11.49 0.13 11.85 0.08 BR_TM 5.48 0.12 5.64 0.07 BR_TP 16.65 0.19 1.38 0.12 TOT max % average % max % average % 2jobs 1job
% 100 * n X X X
WS S WS
−
WS
X
S
X
n – the number of collected samples
0.1 0.2 0.3 0.4 0.5 50 100 150 200 250 300
CPU1 CPU2 CPU1+CPU2
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 50 100 150 200 250 300
CPU1 CPU2 Series3
0.2 0.4 0.6 0.8 1 1.2 100 200 300 400 500 600
CPU1 CPU2 Series3
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 200 300 400 500 600
CPU1 CPU2 Series3
0.138 INS/CYC
2216977123726
Total inst
16067552642403
Cycles 0.025 FP/CYC 18.14% FP/TOT
402251034688
FP
Total instructions/cycle
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1000 2000 3000 4000 5000 6000
FP/cycle
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 1000 2000 3000 4000 5000 6000
7.19% L2LM/LD 61010720039 L2LM 0.053 LD/CYC 38.25% LD/TOT 848049506780 LD 0.135% L2SM/ST 737751425 L2SM
0.034
ST/CYC 24.72% ST/TOT 548061694948 ST
LD/cycle
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 1000 2000 3000 4000 5000 6000
L2LM/cycle
0.005 0.01 0.015 0.02 1000 2000 3000 4000 5000 6000 L2SM/cycle
0.0005 0.001 0.0015 0.002 0.0025 0.003 1000 2000 3000 4000 5000 6000
ST/cycle
0.05 0.1 0.15 0.2 0.25 1000 2000 3000 4000 5000 6000
0.269%
BR_TM/TOT
9.85%
BR_TP/TOT
5964007356
BR_TM
218342330220
BR_TP
Branches taken predicted/cycle
0.02 0.04 0.06 0.08 0.1 0.12 0.14 1000 2000 3000 4000 5000 6000 7000
Branches taken mispredicted/cycle
0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 1000 2000 3000 4000 5000 6000
Total instructions/cycle
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
50 100 150 200 250 300 s IN S /C Y CLD/cycle
0.1 0.2 0.3 0.4 0.5 0.6 0.7
50 100 150 200 250 3000.286 32.7% 192317045567 0.146 LD/CYC 33.1% LD/TOT 193925962348 LD 0.87 586734515764 673216187945 0.44 INS/CYC 586734515764 TOT 1328309944643 CYC
Total instructions/cycle
0.1 0.2 0.3 0.4 0.5 0.6 0.7
50 100 150 200 250 300 350 400 450 500 s IN S /C Y Cmake –j1 make –j2
LD/cycle
0.05 0.1 0.15 0.2
50 100 150 200 250 300 350 400 450 500cycles
5E+14 1E+15 1.5E+15 2E+15 2.5E+15 3E+15 3.5E+15 5401 5403 5404 5405 5406 5501 5502 5504 5505 5509 6104 6105 6106 6108 avr
Instructions/cycle
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 4 1 5 4 3 5 4 4 5 4 5 5 4 6 5 5 1 5 5 2 5 5 4 5 5 5 5 5 9 6 1 4 6 1 5 6 1 6 6 1 8 a v r Float/total [%] 5 10 15 20 25 5 4 1 5 4 3 5 4 4 5 4 5 5 4 6 5 5 1 5 5 2 5 5 4 5 5 5 5 5 9 6 1 4 6 1 5 6 1 6 6 1 8 a v r
10 20 30 40 50 60 70 5401 5403 5404 5405 5406 5501 5502 5504 5505 5509 6104 6105 6106 6108 avr
Load/total [%]
5 10 15 20 25 30 35 40 45 5 4 1 5 4 3 5 4 4 5 4 5 5 4 6 5 5 1 5 5 2 5 5 4 5 5 5 5 5 9 6 1 4 6 1 5 6 1 6 6 1 8 a v r
L2LM/LD [%]
2 4 6 8 10 12 5 4 1 5 4 3 5 4 4 5 4 5 5 4 6 5 5 1 5 5 2 5 5 4 5 5 5 5 5 9 6 1 4 6 1 5 6 1 6 6 1 8 a v r
ST/total [%] 5 10 15 20 25 5401 5403 5404 5405 5406 5501 5502 5504 5505 5509 6104 6105 6106 6108 avr L2SM/ST [%]
2 4 6 8 10 12 14 16 18 20 5 4 1 5 4 3 5 4 4 5 4 5 5 4 6 5 5 1 5 5 2 5 5 4 5 5 5 5 5 9 6 1 4 6 1 5 6 1 6 6 1 8 a v r
branches taken predicted/total [%]
2 4 6 8 10 12 14 5 4 1 5 4 3 5 4 4 5 4 5 5 4 6 5 5 1 5 5 2 5 5 4 5 5 5 5 5 9 6 1 4 6 1 5 6 1 6 6 1 8 a v r branches/total [%]
2 4 6 8 10 12 14 5 4 1 5 4 3 5 4 4 5 4 5 5 4 6 5 5 1 5 5 2 5 5 4 5 5 5 5 5 9 6 1 4 6 1 5 6 1 6 6 1 8 a v r
branches taken mispredicted/total [%]
0.05 0.1 0.15 0.2 0.25 0.3 5 4 1 5 4 3 5 4 4 5 4 5 5 4 6 5 5 1 5 5 2 5 5 4 5 5 5 5 5 9 6 1 4 6 1 5 6 1 6 6 1 8 a v r
Profile Information ============================================== Class : PAPI Event : PAPI_TOT_CYC (Total cycles) Period : 50000 Samples : 719 Domain : user Run Time : 17.52 (seconds) Min Self % : (all) Module Summary
376 52.29% 52.29% /usr/bin/python 178 24.76% 77.05% /lib/ld-2.3.2.so 159 22.11% 99.17% /lib/tls/libc-2.3.2.so 4 0.56% 99.72% /lib/tls/libpthread-0.60.so 1 0.14% 99.86% /lib/libdl-2.3.2.so 1 0.14% 100.00% /lib/libutil-2.3.2.so Function Summary
376 52.29% 52.29% ??
110 15.30% 67.59% do_lookup_versioned 40 5.56% 73.16% _int_malloc 31 4.31% 77.47% strcmp 22 3.06% 80.53% _dl_lookup_versioned_symbol 19 2.64% 83.17% memcpy 16 2.23% 85.40% __libc_malloc 11 1.53% 86.93% free 7 0.97% 87.90% _int_free 7 0.97% 88.87% strlen 6 0.83% 89.71% memset 6 0.83% 90.54% do_lookup 5 0.70% 91.24% malloc_consolidate 5 0.70% 91.93% __mempcpy 4 0.56% 92.49% __i686.get_pc_thunk.bx 3 0.42% 92.91% strerror_r 3 0.42% 93.32% mremap_chunk 3 0.42% 93.74% _int_realloc 2 0.28% 94.02% .L969 2 0.28% 94.30% realloc 2 0.28% 94.58% mallopt Profile Information ============================================================= Class : PAPI Event : PAPI_TOT_CYC (Total cycles) Period : 50000 Samples : 721514 Domain : user Run Time : 17.60 (seconds) Min Self % : (all) Module Summary
465515 64.52% 64.52% /afs/cern.ch/user/o/oplaatl3/testdll/libhello2.so.1 255433 35.40% 99.92% /afs/cern.ch/user/o/oplaatl3/testdll/libhello1.so.1 391 0.05% 99.98% /usr/bin/python 145 0.02% 100.00% /lib/tls/libc-2.3.2.so 26 0.00% 100.00% /lib/ld-2.3.2.so 4 0.00% 100.00% /lib/tls/libpthread-0.60.so Function Summary
255433 35.40% 35.40% hello(int*) 254920 35.33% 70.73% sum(int*) 210595 29.19% 99.92% count(int*, int)
392 0.05% 99.98% ?? 36 0.00% 99.98% _int_malloc 22 0.00% 99.98% memcpy 13 0.00% 99.99% __libc_malloc 11 0.00% 99.99% free 10 0.00% 99.99% do_lookup_versioned 7 0.00% 99.99% strcmp 6 0.00% 99.99% __open_nocancel 5 0.00% 99.99% _int_free 4 0.00% 99.99% memset 4 0.00% 99.99% malloc_consolidate