Oversubscription on Multicore Processors Costin Iancu, Steven - - PowerPoint PPT Presentation

oversubscription on multicore processors
SMART_READER_LITE
LIVE PREVIEW

Oversubscription on Multicore Processors Costin Iancu, Steven - - PowerPoint PPT Presentation

Oversubscription on Multicore Processors Costin Iancu, Steven Hofmeyr, Filip Blagojevi, Yili Zheng Lawrence Berkeley National Laboratory Parallel & Distributed Processing (IPDPS), 2010 1 / 11 Motivation Increasingly parallel and


slide-1
SLIDE 1

Oversubscription on Multicore Processors

Costin Iancu, Steven Hofmeyr, Filip Blagojević, Yili Zheng Lawrence Berkeley National Laboratory Parallel & Distributed Processing (IPDPS), 2010

1 / 11

slide-2
SLIDE 2

Motivation

Increasingly parallel and asymmetric hardware (architecture + performance) Existing runtimes in competitive environments Partitioning vs. sharing on real hardware

2 / 11

slide-3
SLIDE 3

Oversubscription

+ Compensate for data and control dependencies Decrease resource contention Improve CPU utilization − Overhead for migration, context switching and lost hardware state (negligible) Slower synchronization due to increased contention

3 / 11

slide-4
SLIDE 4

Setup

MPI (MPICH 2), UPC, OpenMP Synchronization: poll + yield Linux 2.6.27, 2.6.28, 2.6.30 Intel compiler with −O3 NPB without load imbalances (separate paper)

Processor Clock GHz Cores L1 data/instr L2 cache L3 cache Memory/core NUMA Tigerton Intel Xeon E7310 1.6 16 (4x4) 32K/32K 4M / 2 cores none 2GB no Barcelona AMD Opteron 8350 2 16 (4x4) 64K/64K 512K / core 2M / socket 4GB socket Nehalem Intel Xeon E5530 2.4 16 (2x4x2) 32K/32K 256K / core 8M / socket 1.5G / core socket 4 / 11

slide-5
SLIDE 5

Benchmark Characteristics

0
 10
 20
 30
 40
 50
 60


1/core
 2/core
 4/core
 1/core
 2/core
 4/core
 1/core
 2/core
 4/core
 UPC
 OpenMP
 MPI


Time
(microsec)
 Barrier
Performance
‐
AMD
Barcelona


1
 2
 4
 8
 16


160
 5 / 11

slide-6
SLIDE 6

Benchmark Characteristics

0
 10
 20
 30
 40
 50
 60


1/core
 2/core
 4/core
 1/core
 2/core
 4/core
 1/core
 2/core
 4/core
 UPC
 OpenMP
 MPI


Time
(microsec)
 Barrier
Performance
‐
AMD
Barcelona


1
 2
 4
 8
 16


160


0.1 1 10 100 1000 10000 A B C A B C A B C A B C A B C A B C A B C Inter-barrier time (ms) UPC NPB 2.4 Barrier Stats, 16 threads 3777 17877 17877 13 13 13 56 140 50 91 91 91 378 1114 1240 13677 13677 13677 7688 7688 7688 bt sp mg is ft ep cg

5 / 11

slide-7
SLIDE 7

UPC — UMA vs. NUMA

UPC Tigerton

0.5 1 1.5 2

248 248 248

Performance relative to 1/core ep

C B A

24 248 248

ft

C B A

248 248 248

is

C B A

4 4 4

sp

C B A

248 248 248

mg

C B A

24 248 248

cg

CFS PSX yield PIN C B A

sched_yield: default vs. POSIX Pinning affects variance (120 % vs. 10 %) and memory affinity

6 / 11

slide-8
SLIDE 8

UPC — UMA vs. NUMA

UPC Tigerton

0.5 1 1.5 2

248 248 248

Performance relative to 1/core ep

C B A

24 248 248

ft

C B A

248 248 248

is

C B A

4 4 4

sp

C B A

248 248 248

mg

C B A

24 248 248

cg

CFS PSX yield PIN C B A

UPC Barcelona

0.5 1 1.5 2

248 248 248

Performance relative to 1/core ep

C B A

24 248 248

ft

C B A

248 248 248

is

C B A

4 4 4

sp

C B A

248 248 248

mg

C B A

24 248 248

cg

CFS PSX yield PIN C B A

sched_yield: default vs. POSIX Pinning affects variance (120 % vs. 10 %) and memory affinity Small overall effect (± 2 % avg) EP: computationally intensive FT, IS: improvement up to 46 % SP, MG: problem size ↔ granularity CG: degradation up to 44 %

6 / 11

slide-9
SLIDE 9

Balance

Balance UPC Tigerton

  • 0.3
  • 0.2
  • 0.1

0.1 0.2 0.3

248 248 248

Improvement over 1/core ep

C B A

  • 0.3
  • 0.2
  • 0.1
0.1 0.2 0.3

24 248 248

ft

C B A

  • 0.3
  • 0.2
  • 0.1
0.1 0.2 0.3

248 248 248

is

C B A

  • 0.3
  • 0.2
  • 0.1
0.1 0.2 0.3

4 4 4

sp

C B A

  • 0.3
  • 0.2
  • 0.1
0.1 0.2 0.3

248 248 248

mg

C B A

  • 0.3
  • 0.2
  • 0.1
0.1 0.2 0.3

24 248 248

cg

C B A

Figure 5. Changes in balance on UMA, reported

as the ratio between the lowest and highest user time across all cores compared to the 1/core setting.

7 / 11

slide-10
SLIDE 10

Cache Miss Rate (LLC / L2)

Cache miss rate UPC Tigerton

  • 0.4
  • 0.2

0.2 0.4

248 248 248

Improvement over 1/core ep

C B A

  • 0.4
  • 0.2
0.2 0.4

24 248 248

ft

C B A

  • 0.4
  • 0.2
0.2 0.4

248 248 248

is

C B A

  • 0.4
  • 0.2
0.2 0.4

4 4 4

sp

C B A

  • 0.4
  • 0.2
0.2 0.4

248 248 248

mg

C B A

  • 0.4
  • 0.2
0.2 0.4

24 248 248

cg

C B A

Figure 6. Changes in the total number of cache

misses per 1000 instructions, across all cores com- pared to 1/core. The EP miss rate is very low.

8 / 11

slide-11
SLIDE 11

MPI and OpenMP

MPI Tigerton

0.5 1 1.5 2

24 24 24

Performance relative to 1/core ep

C B A

2 4 2 4 2 4

ft

C B A

2 4 2 4 2 4

is

C B A

4 4 4

sp

C B A

2 4 2 4 2 4

mg

C B A

2 4 2 4 2 4

cg

CFS PSX yield PIN C B A

Overall decrease by 10 % Caused by barrier overhead (cp. modified UPC)

9 / 11

slide-12
SLIDE 12

MPI and OpenMP

MPI Tigerton

0.5 1 1.5 2

24 24 24

Performance relative to 1/core ep

C B A

2 4 2 4 2 4

ft

C B A

2 4 2 4 2 4

is

C B A

4 4 4

sp

C B A

2 4 2 4 2 4

mg

C B A

2 4 2 4 2 4

cg

CFS PSX yield PIN C B A

Overall decrease by 10 % Caused by barrier overhead (cp. modified UPC)

OMP Nehalem

0.5 1 1.5 2

2 4 8 2 4 8 2 4 8 2 4 8

Performance relative to 1/core ep

S C B A

2 4 8 2 4 8 2 4 8 2 4 8

ft

S C B A

2 4 8 2 4 8 2 4 8 2 4 8

is

S C B A

2 4 8 2 4 8 2 4 8 2 4 8

sp

S C B A

2 4 8 2 4 8 2 4 8 2 4 8

mg

S C B A

2 4 8 2 4 8 2 4 8 2 4 8

cg

CFS PSX yield PIN S C B A

Slight degradation Best performance with OMP_STATIC KMP_BLOCKTIME

0 Improvement up to 10 % for fine-grained benchmarks ∞ Best overall performance

9 / 11

slide-13
SLIDE 13

Competitive Environments

Sharing (best effort) vs. Partitioning (isolated on sockets) One thread per core

Overall 33 %/23 % improvement with sharing for UPC/OpenMP

  • n Barcelona (CMP) but no difference for Nehalem (SMT)

Better for application with differing behavior

Oversubscription . . .

improves benefits of sharing for CMP changes relative order of performance for UPC, MPI, OpenMP

Imbalanced sharing possible

10 / 11

slide-14
SLIDE 14

Conclusion

“Intuitively, oversubscription increases diversity in the system and decreases the potential for resource conflicts.” “All of our results and analysis indicate that the best predictor

  • f application behavior when oversubscribing is the average

inter-barrier interval. Applications with barriers executed every few ms are affected, while coarser grained applications are

  • blivious or their performance improves.”

“We expect the benefits of oversubscription to be even more pronounced for irregular applications that suffer from load imbalance.”

11 / 11