An improvement of OpenMP pipeline parallelism with the BatchQueue - PowerPoint PPT Presentation

An improvement of OpenMP pipeline parallelism with the BatchQueue algorithm Thomas Preud’homme Team REGAL Advisors: Julien Sopena et Ga¨ el Thomas Supervisor: Bertil Folliot June 10, 2013 1 / 40

Moore’s law in modern CPU Moore’s law: Number of transistors on chips doubles every 2 years Now: CPU frequency stagnate, number of cores increases ⇒ parallelism is needed to take advantage of multi-core systems 2 / 40

Classical paradigms of parallel programming Several paradigms of parallel programming already exist: Task parallelism Data parallelism E.g.: multitasking E.g.: array/matrix processing Limit : needs independent tasks Limit : needs independent data 3 / 40

Task and data dependencies: video edition example Some modern applications require complex computation but cannot use task or data parallelism due to dependencies. ⇒ eg. audio and video processing Example of video edition: decode a frame into a bitmap image 1 rotate the image 2 trim the image 3 dependencies “task”: transformations depend on result of previous transformations in the chain “data”: frame decoding depends on previously decoded frames 4 / 40

Pipeline parallelism to the rescue Method to increase the number of images processed per second: Split frame processing in 3 sub-tasks: decoding 1 rotation 2 trimming 3 Perform each sub-task on different cores Make images flow from one sub-task to another ⇒ Sub-tasks performed in parallel for different images 5 / 40

Pipeline parallelism: general case General principle Divide a sequential code in several sub-tasks Execute each sub-task on different cores Make data flow from one sub-task to another ⇒ Sub-tasks run in parallel on different parts of the flow 6 / 40

Efficiency of pipeline parallelism 7 / 40

Efficiency of pipeline parallelism Performance improvement with 6 cores instead of 3: Latency: slower by 3 T comm Throughput: about 2 times faster 7 / 40

Efficiency of pipeline parallelism In the general case, performance for n cores is: Latency: T task + ( n − 1 ) T comm Throughput: 1 output every T subtask + T comm ⇒ 1 output every T task + T comm n Problem Communication time limits the speedup 7 / 40

Pipeline parallelism: limits On n cores, one processing done every T task + T comm n Communication time limits the speedup ! ⇒ Need for efficient inter-core communication 8 / 40

Problem statement Problem 1 Current communication algorithms perform badly for inter-core communication Problem 2 Changing the communication algorithm of all/many programs doing pipeline parallelism is impractical Contributions Two-fold solution: BatchQueue: queue optimized for inter-core communication Automated usage of BatchQueue for pipeline parallelism 9 / 40

Contribution 1 BatchQueue: queue optimized for inter-core communication 10 / 40

Lamport: principle Data exchanged by reads and writes in a shared buffer ⇒ data read/written sequentially, cycling at end of buffer 2 indices to memorize where to read/write next in the buffer ⇒ filling of buffer detected via indices comparison 11 / 40

Cache consistency Caches with same data must be kept consistent Consistency maintained by a hardware component: MOESI MOESI cache consistency protocol Memory in caches divided in lines ⇒ Consistency enforced at cache line level Lines in each cache have a consistency status: M odified, O wned, E xclusive, S hared, I nvalid MOESI ensures only one line is in Modified or Owned state ⇒ Implements a Read/Write exclusion . 3 problems of performance arise from using MOESI 12 / 40

Cache consistency protocol: cost Communication required to update cache lines and their status ⇒ Cache consistency = slowdown 2 sources of communication Write from Shared or Owned: invalidate remote cache lines Read from Invalid: broadcast to find up-to-date line Modify line in shared state Read line in exclusive state 13 / 40

Lamport: cache friendliness 3 shared variables: buf, prod idx and cons idx Lockless algorithm tailored to single core systems high reliance on memory consistency 1 - synchronization for each production and consumption - 2 variables needed for synchronization 14 / 40

Cache consistency: further slowdown False sharing problem Per cache line consistency status ⇒ data sharing detected at cache line level ⇒ accesses to � data from same cache line appears concurrent 15 / 40

Lamport: cache friendliness prod idx and cons idx may point to nearby entries Lockless algorithm tailored to single core systems high reliance on memory consistency 1 - synchronization for each production and consumption - 2 variables needed for synchronization false sharing 2 - producer and consumer often work on nearby entries 16 / 40

False sharing due to prefetch Prefetch consists in fetching data before they are needed read + disjoint write access in same cache line = false sharing ⇒ Prefetch can create false sharing 17 / 40

Lamport: cache friendliness All entries read and written sequentially Lockless algorithm tailored to single core systems High reliance on memory consistency 1 - synchronization for each production and consumption - 2 variables needed for synchronization False sharing 2 - producer and consumer often work on nearby entries Undesirable prefetch 3 - prefetch may create false sharing on distant entries 18 / 40

State-of-the-art algorithms on multi-cores Quantity False Wrong of sharing sharing prefetch Lamport [Lam83] All variables shared KO KO FastForward [GMV08] Only buffer KO KO CSQ [ZOYB09] N global variables OK KO MCRingBuffer [LBC10] 2 global variables OK KO Objectives 3 problems to solve: Problem 1: excessive synchronization 1 Problem 2: false sharing of data 2 Problem 3: undesirable prefetch 3 19 / 40

An improvement of OpenMP pipeline parallelism with the BatchQueue - PowerPoint PPT Presentation

An improvement of OpenMP pipeline parallelism with the BatchQueue algorithm Thomas Preudhomme Team REGAL Advisors: Julien Sopena et Ga el Thomas Supervisor: Bertil Folliot June 10, 2013 1 / 40 Moores law in modern CPU Moores

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Threaded Programming Lecture 6: Further topics in OpenMP Overview Nested parallelism

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Shared Memory Programming with OpenMP Lecture 7: Further topics Nested parallelism Unlike

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

Class Project How deep is the ocean? Tsunami of December

Runion pour dbut de thse de Luis Henrique Benetti Ramos Universit de Bordeaux 28-29

Spanwise generalized Stokes layer and turbulent drag reduction M.Quadrio 1 & P. Ricco 2 1

June 2020 Investor Presentation Safe harbor FORWARD-LOOKING STATEMENTS This presentation

QUARTERLY REPORT 2 nd Quarter 2018: April, May and June Presented September 17, 2018 Kim Buttram,

https://www.ucu.org.uk/why-we-are-taking-action-over-USS For details, see our paper at

Mains Dimmers And how (not) to work with mains voltage Tim Clark (eclipse) October 17, 2014 Tim

illm Technology www.illumtechnology.com Better Better illm Technology Agenda Who We Are

An improvement of OpenMP pipeline parallelism with the BatchQueue - PowerPoint PPT Presentation

An improvement of OpenMP pipeline parallelism with the BatchQueue algorithm Thomas Preudhomme Team REGAL Advisors: Julien Sopena et Ga el Thomas Supervisor: Bertil Folliot June 10, 2013 1 / 40 Moores law in modern CPU Moores

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Threaded Programming Lecture 6: Further topics in OpenMP Overview Nested parallelism

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Shared Memory Programming with OpenMP Lecture 7: Further topics Nested parallelism Unlike

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

Class Project How deep is the ocean? Tsunami of December

Runion pour dbut de thse de Luis Henrique Benetti Ramos Universit de Bordeaux 28-29

Spanwise generalized Stokes layer and turbulent drag reduction M.Quadrio 1 &amp; P. Ricco 2 1

June 2020 Investor Presentation Safe harbor FORWARD-LOOKING STATEMENTS This presentation

QUARTERLY REPORT 2 nd Quarter 2018: April, May and June Presented September 17, 2018 Kim Buttram,

https://www.ucu.org.uk/why-we-are-taking-action-over-USS For details, see our paper at

Mains Dimmers And how (not) to work with mains voltage Tim Clark (eclipse) October 17, 2014 Tim

illm Technology www.illumtechnology.com Better Better illm Technology Agenda Who We Are

Spanwise generalized Stokes layer and turbulent drag reduction M.Quadrio 1 & P. Ricco 2 1