Oversubscription on Multicore Processors Costin Iancu, Steven - PowerPoint PPT Presentation

Oversubscription on Multicore Processors Costin Iancu, Steven Hofmeyr, Filip Blagojević, Yili Zheng Lawrence Berkeley National Laboratory Parallel & Distributed Processing (IPDPS), 2010 1 / 11

Motivation Increasingly parallel and asymmetric hardware (architecture + performance) Existing runtimes in competitive environments Partitioning vs. sharing on real hardware 2 / 11

Oversubscription + Compensate for data and control dependencies Decrease resource contention Improve CPU utilization − Overhead for migration, context switching and lost hardware state ( negligible ) Slower synchronization due to increased contention 3 / 11

Setup MPI (MPICH 2), UPC, OpenMP Synchronization: poll + yield Linux 2.6.27, 2.6.28, 2.6.30 Intel compiler with − O3 NPB without load imbalances (separate paper) Processor Clock GHz Cores L1 data/instr L2 cache L3 cache Memory/core NUMA Tigerton Intel Xeon E7310 1.6 16 (4x4) 32K/32K 4M / 2 cores none 2GB no Barcelona AMD Opteron 8350 2 16 (4x4) 64K/64K 512K / core 2M / socket 4GB socket Nehalem Intel Xeon E5530 2.4 16 (2x4x2) 32K/32K 256K / core 8M / socket 1.5G / core socket 4 / 11

Benchmark Characteristics Barrier Performance ‐ AMD Barcelona  160  60  Time (microsec)  50  40  1  30  2  20  4  10  8  0  16  1/core  2/core  4/core  1/core  2/core  4/core  1/core  2/core  4/core  UPC  OpenMP  MPI  5 / 11

Benchmark Characteristics Barrier Performance ‐ AMD Barcelona  160  60  Time (microsec)  50  40  1  30  2  20  4  10  8  0  16  1/core  2/core  4/core  1/core  2/core  4/core  1/core  2/core  4/core  UPC  OpenMP  MPI  UPC NPB 2.4 Barrier Stats, 16 threads 10000 13 50 1000 13 Inter-barrier time (ms) 140 13 100 7688 91 56 13677 1240 7688 10 91 13677 17877 7688 91 1114 17877 378 13677 1 3777 0.1 A B C A B C A B C A B C A B C A B C A B C cg ep ft is mg sp bt 5 / 11

UPC — UMA vs. NUMA 2 UPC Tigerton Performance relative to 1/core CFS PSX yield PIN 1.5 sched_yield: default vs. POSIX 1 Pinning affects variance ( 120 % vs. 0.5 10 % ) and memory affinity 0 248 248 248 24 248 248 248 248 248 4 4 4 248 248 248 24 248 248 A B C A B C A B C A B C A B C A B C ep ft is sp mg cg 6 / 11

UPC — UMA vs. NUMA 2 UPC Tigerton Performance relative to 1/core CFS PSX yield PIN 1.5 sched_yield: default vs. POSIX 1 Pinning affects variance ( 120 % vs. 0.5 10 % ) and memory affinity 0 248 248 248 24 248 248 248 248 248 4 4 4 248 248 248 24 248 248 A B C A B C A B C A B C A B C A B C ep ft is sp mg cg Small overall effect ( ± 2 % avg) EP: computationally intensive UPC Barcelona 2 Performance relative to 1/core CFS FT, IS: improvement up to 46 % PSX yield PIN 1.5 SP, MG: problem size ↔ 1 granularity 0.5 CG: degradation up to 44 % 0 248 248 248 24 248 248 248 248 248 4 4 4 248 248 248 24 248 248 A B C A B C A B C A B C A B C A B C ep ft is sp mg cg 6 / 11

Balance Balance UPC Tigerton 0.3 0.3 0.3 0.3 0.3 0.3 Improvement over 1/core 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0 0 0 0 0 0 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 248 248 248 24 248 248 248 248 248 4 4 4 248 248 248 24 248 248 A B C A B C A B C A B C A B C A B C ep ft is sp mg cg Figure 5. Changes in balance on UMA, reported as the ratio between the lowest and highest user time across all cores compared to the 1/core setting. 7 / 11

Cache Miss Rate (LLC / L2) Cache miss rate UPC Tigerton 0.4 Improvement over 1/core 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0 0 0 0 0 0 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 248 248 248 24 248 248 248 248 248 4 4 4 248 248 248 24 248 248 A B C A B C A B C A B C A B C A B C ep ft is sp mg cg Figure 6. Changes in the total number of cache misses per 1000 instructions, across all cores compared to 1/core. The EP miss rate is very low. 8 / 11

MPI and OpenMP 2 MPI Tigerton Performance relative to 1/core CFS PSX yield PIN 1.5 Overall decrease by 10 % 1 Caused by barrier overhead (cp. modified UPC) 0.5 0 24 24 24 2 4 2 4 2 4 2 4 2 4 2 4 4 4 4 2 4 2 4 2 4 2 4 2 4 2 4 A B C A B C A B C A B C A B C A B C ep ft is sp mg cg 9 / 11

MPI and OpenMP 2 MPI Tigerton Performance relative to 1/core CFS PSX yield PIN 1.5 Overall decrease by 10 % 1 Caused by barrier overhead (cp. modified UPC) 0.5 0 24 24 24 2 4 2 4 2 4 2 4 2 4 2 4 4 4 4 2 4 2 4 2 4 2 4 2 4 2 4 A B C A B C A B C A B C A B C A B C ep ft is sp mg cg OMP Nehalem 2 Performance relative to 1/core CFS PSX yield Slight degradation PIN 1.5 Best performance with OMP_STATIC 1 KMP_BLOCKTIME 0 Improvement up to 10 % for 0.5 fine-grained benchmarks 0 ∞ Best overall performance 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 A B C S A B C S A B C S A B C S A B C S A B C S ep ft is sp mg cg 9 / 11

Competitive Environments Sharing (best effort) vs. Partitioning (isolated on sockets) One thread per core Overall 33 % / 23 % improvement with sharing for UPC/OpenMP on Barcelona (CMP) but no difference for Nehalem (SMT) Better for application with differing behavior Oversubscription . . . improves benefits of sharing for CMP changes relative order of performance for UPC, MPI, OpenMP Imbalanced sharing possible 10 / 11

Conclusion “Intuitively, oversubscription increases diversity in the system and decreases the potential for resource conflicts.” “All of our results and analysis indicate that the best predictor of application behavior when oversubscribing is the average inter-barrier interval. Applications with barriers executed every few ms are affected, while coarser grained applications are oblivious or their performance improves.” “We expect the benefits of oversubscription to be even more pronounced for irregular applications that suffer from load imbalance.” 11 / 11

Oversubscription on Multicore Processors Costin Iancu, Steven - PowerPoint PPT Presentation

Oversubscription on Multicore Processors Costin Iancu, Steven Hofmeyr, Filip Blagojevi, Yili Zheng Lawrence Berkeley National Laboratory Parallel & Distributed Processing (IPDPS), 2010 1 / 11 Motivation Increasingly parallel and

Serenity MESOS OVERSUBSCRIPTION MODULE Szymon Konefa SOFTWARE ENGINEER INTEL CORPORATION

Oversubscription Dimensioning of Next- Generation PONs With Different Service Levels 8/24/2016

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Multicore Processors Raul Queiroz Feitosa Parts of these slides are from the support material

rMPI: Message Passing on Multicore Processors with On-Chip Interconnect 19. Oktober 2009

BRU: Bandwidth Regulation Unit for Real-Time Multicore Processors Farzad Farshchi , Qijing

Addressing Shared Resource Contention in Multicore Processors via Scheduling ASPLOS 10

HETEROGENEOUS MULTICORE PROCESSORS A LEXANDER V ITKALOV ENGRC 350 Novem ber 2 1 ,2 0 0 5 1

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

Computer Architecture Summer 2020 Multicore Dan Sorin and Tyler Bletsch Duke University

The Impact of Process Placement and Oversubscription on Application Performance: A Case Study for

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly

Jill Larsen EVP & CHRO Medidata Solutions Agenda AI Defined Embracing AI AI Myths AI

C ONTACT U S 610-650-1848 schedule@imwealthpartners.com Securities offered through SA Stone

Neural Networks in Business Forecasting Amir Shokri Amirsh.nll@gmail.com Abstract Neural

RENEWABLES 2.83/2.813 T. Gutowski with S. Sahni Readings (in bold)

Building Blocks Operating Systems, Processes, Threads Dr Mark Bull, EPCC markb@epcc.ed.ac.uk

dCUDA: Hardware Supported Overlap of Computation and Communication Tobias Gysi, Jeremia Br, and

QoS-Aware Admission Control in Heterogeneous Datacenters Christina Delimitrou and Christos

ICT and Development ICT and Development Week 10 March 28 - 30 1 Computers and Society

Oversubscription on Multicore Processors Costin Iancu, Steven - PowerPoint PPT Presentation

Oversubscription on Multicore Processors Costin Iancu, Steven Hofmeyr, Filip Blagojevi, Yili Zheng Lawrence Berkeley National Laboratory Parallel & Distributed Processing (IPDPS), 2010 1 / 11 Motivation Increasingly parallel and

Serenity MESOS OVERSUBSCRIPTION MODULE Szymon Konefa SOFTWARE ENGINEER INTEL CORPORATION

Oversubscription Dimensioning of Next- Generation PONs With Different Service Levels 8/24/2016

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Multicore Processors Raul Queiroz Feitosa Parts of these slides are from the support material

rMPI: Message Passing on Multicore Processors with On-Chip Interconnect 19. Oktober 2009

BRU: Bandwidth Regulation Unit for Real-Time Multicore Processors Farzad Farshchi , Qijing

Addressing Shared Resource Contention in Multicore Processors via Scheduling ASPLOS 10

HETEROGENEOUS MULTICORE PROCESSORS A LEXANDER V ITKALOV ENGRC 350 Novem ber 2 1 ,2 0 0 5 1

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

Computer Architecture Summer 2020 Multicore Dan Sorin and Tyler Bletsch Duke University

The Impact of Process Placement and Oversubscription on Application Performance: A Case Study for

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly

Jill Larsen EVP &amp; CHRO Medidata Solutions Agenda AI Defined Embracing AI AI Myths AI

C ONTACT U S 610-650-1848 schedule@imwealthpartners.com Securities offered through SA Stone

Neural Networks in Business Forecasting Amir Shokri Amirsh.nll@gmail.com Abstract Neural

RENEWABLES 2.83/2.813 T. Gutowski with S. Sahni Readings (in bold)

Building Blocks Operating Systems, Processes, Threads Dr Mark Bull, EPCC markb@epcc.ed.ac.uk

dCUDA: Hardware Supported Overlap of Computation and Communication Tobias Gysi, Jeremia Br, and

QoS-Aware Admission Control in Heterogeneous Datacenters Christina Delimitrou and Christos

ICT and Development ICT and Development Week 10 March 28 - 30 1 Computers and Society

Jill Larsen EVP & CHRO Medidata Solutions Agenda AI Defined Embracing AI AI Myths AI