Oversubscription on Multicore Processors
Costin Iancu, Steven Hofmeyr, Filip Blagojević, Yili Zheng Lawrence Berkeley National Laboratory Parallel & Distributed Processing (IPDPS), 2010
1 / 11
Oversubscription on Multicore Processors Costin Iancu, Steven - - PowerPoint PPT Presentation
Oversubscription on Multicore Processors Costin Iancu, Steven Hofmeyr, Filip Blagojevi, Yili Zheng Lawrence Berkeley National Laboratory Parallel & Distributed Processing (IPDPS), 2010 1 / 11 Motivation Increasingly parallel and
1 / 11
2 / 11
3 / 11
Processor Clock GHz Cores L1 data/instr L2 cache L3 cache Memory/core NUMA Tigerton Intel Xeon E7310 1.6 16 (4x4) 32K/32K 4M / 2 cores none 2GB no Barcelona AMD Opteron 8350 2 16 (4x4) 64K/64K 512K / core 2M / socket 4GB socket Nehalem Intel Xeon E5530 2.4 16 (2x4x2) 32K/32K 256K / core 8M / socket 1.5G / core socket 4 / 11
0 10 20 30 40 50 60
1/core 2/core 4/core 1/core 2/core 4/core 1/core 2/core 4/core UPC OpenMP MPI
Time (microsec) Barrier Performance ‐ AMD Barcelona
1 2 4 8 16
160 5 / 11
0 10 20 30 40 50 60
1/core 2/core 4/core 1/core 2/core 4/core 1/core 2/core 4/core UPC OpenMP MPI
Time (microsec) Barrier Performance ‐ AMD Barcelona
1 2 4 8 16
160
0.1 1 10 100 1000 10000 A B C A B C A B C A B C A B C A B C A B C Inter-barrier time (ms) UPC NPB 2.4 Barrier Stats, 16 threads 3777 17877 17877 13 13 13 56 140 50 91 91 91 378 1114 1240 13677 13677 13677 7688 7688 7688 bt sp mg is ft ep cg
5 / 11
UPC Tigerton
0.5 1 1.5 2
248 248 248
Performance relative to 1/core ep
C B A
24 248 248
ft
C B A
248 248 248
is
C B A
4 4 4
sp
C B A
248 248 248
mg
C B A
24 248 248
cg
CFS PSX yield PIN C B A
6 / 11
UPC Tigerton
0.5 1 1.5 2
248 248 248
Performance relative to 1/core ep
C B A
24 248 248
ft
C B A
248 248 248
is
C B A
4 4 4
sp
C B A
248 248 248
mg
C B A
24 248 248
cg
CFS PSX yield PIN C B A
UPC Barcelona
0.5 1 1.5 2
248 248 248
Performance relative to 1/core ep
C B A
24 248 248
ft
C B A
248 248 248
is
C B A
4 4 4
sp
C B A
248 248 248
mg
C B A
24 248 248
cg
CFS PSX yield PIN C B A
6 / 11
0.1 0.2 0.3
248 248 248
C B A
24 248 248
C B A
248 248 248
C B A
4 4 4
C B A
248 248 248
C B A
24 248 248
C B A
7 / 11
0.2 0.4
248 248 248
C B A
24 248 248
C B A
248 248 248
C B A
4 4 4
C B A
248 248 248
C B A
24 248 248
C B A
8 / 11
MPI Tigerton
0.5 1 1.5 2
24 24 24
Performance relative to 1/core ep
C B A
2 4 2 4 2 4
ft
C B A
2 4 2 4 2 4
is
C B A
4 4 4
sp
C B A
2 4 2 4 2 4
mg
C B A
2 4 2 4 2 4
cg
CFS PSX yield PIN C B A
Overall decrease by 10 % Caused by barrier overhead (cp. modified UPC)
9 / 11
MPI Tigerton
0.5 1 1.5 2
24 24 24
Performance relative to 1/core ep
C B A
2 4 2 4 2 4
ft
C B A
2 4 2 4 2 4
is
C B A
4 4 4
sp
C B A
2 4 2 4 2 4
mg
C B A
2 4 2 4 2 4
cg
CFS PSX yield PIN C B A
Overall decrease by 10 % Caused by barrier overhead (cp. modified UPC)
OMP Nehalem
0.5 1 1.5 2
2 4 8 2 4 8 2 4 8 2 4 8
Performance relative to 1/core ep
S C B A
2 4 8 2 4 8 2 4 8 2 4 8
ft
S C B A
2 4 8 2 4 8 2 4 8 2 4 8
is
S C B A
2 4 8 2 4 8 2 4 8 2 4 8
sp
S C B A
2 4 8 2 4 8 2 4 8 2 4 8
mg
S C B A
2 4 8 2 4 8 2 4 8 2 4 8
cg
CFS PSX yield PIN S C B A
Slight degradation Best performance with OMP_STATIC KMP_BLOCKTIME
9 / 11
10 / 11
11 / 11