TDDD56: Multicore and GPU programming Lesson 1 Introduction to - PowerPoint PPT Presentation

The ABA problem Thread 0 The scenario starts ◮ old → A Thread 0 pops A, preempted before ◮ new → B CAS(head, old=A, new=B) ◮ pool → A → null Thread 1 pops A, succeeds Thread 1 Thread 2 pops B, succeeds ◮ old → C ◮ new → A Thread 1 pushes A, succeeds ◮ pool → null Thread 0 performs Thread 2 CAS(head, old=A, new=B) ◮ old → B The shared stack should be empty, but it ◮ new → C points to B in Thread 2’s recycling bin ◮ pool → B → null shared → B → null Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 17 / 40

The ABA problem shared stack head A null head null head null head B null thread 0 pool thread 1 pool thread 2 pool Figure: The shared stack should be empty, but points to B in Thread 2’s recycling bin Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 18 / 40

Lab 2: Directions Implement a stack and protect it using locks Implement a CAS-based stack ◮ A CAS assembly implementation is provided in the lab skeleton Use pthread synchronization to make several threads to preempt each other in order to play one ABA scenario Use a ABA-free performance test to compare performance of a lock-based and CAS-based concurrent stack Get more details and hints in the lab compendium Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 19 / 40

Lab 3: Parallel sorting Implement or optimize an existing sequential sort implementation Parallelize with shared memory approach (pthread or openMP) Paralleize with Dataflow (Drake) Test your sorting implementation with various situations ◮ Random, ascending, descending or constant input ◮ Small and big input sizes ◮ Other tricky situations you may imagine Built-in sorting functions (qsort(), std::sort()) are forbidden ◮ May rewrite it for better performance Lab demo: describe the important techniques that accelerate your implementation. Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 20 / 40

Base sequential sort pivot = array[size / 2]; for (i = 0; i < size; i++) { if (array[i] < pivot) { left[left size] = array[i]; left size++; } else if (array[i] > pivot) { right[right size] = array[i]; right size++; } else pivot count++; } simple quicksort(left, left size); simple quicksort(right, right size); memcpy(array, left, left size * sizeof ( int )); for (i = left size; i < left size + pivot count; i++) array[i] = pivot; memcpy(array + left size + pivot count, right, right size * sizeof ( int )); Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 21 / 40

Base sequential sort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 22 / 40

Parallelization opportunities Parallelization opportunities ◮ Recursive calls ◮ Computing pivots ◮ Merging, if necessary Smart solutions challenging to implement ◮ In-place quicksort: false sharing ◮ Parallel sampling/merging: synchronization ◮ Follow the KISS rule Avoid spawning more threads than the computer has cores Use data locality with caches and cache lines Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 23 / 40

Simple parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 24 / 40

Parallel Quicksort sort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 25 / 40

Parallel Quicksort sort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Choose pivot to divide buffer into unequal parts Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 25 / 40

Parallel Quicksort sort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Choose pivot to divide buffer into unequal parts Partition and recurse into 3 parts (sample sort) Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 25 / 40

Parallel Quicksort sort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Choose pivot to divide buffer into unequal parts Partition and recurse into 3 parts (sample sort) Makes implementation harder Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 25 / 40

Mergesort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 26 / 40

Simple Mergesort Parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 27 / 40

Parallel Mergesort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 28 / 40

Parallel Mergesort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Divide buffer into 2 unequal parts Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 28 / 40

Parallel Mergesort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Divide buffer into 2 unequal parts Partition and recurse into 3 parts and 3-way merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 28 / 40

Pipelined parallel mergesort Classic parallelism: start a task when the previous one is done Core 1 Core 2 Core 3 Core 4 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 29 / 40

Pipelined parallel mergesort Classic parallelism: start a task when the previous one is done Core 1 Core 2 Core 3 Core 4 Pipeline parallelism: Run next merging task as soon as possible Core 1 Core 2 Core 3 Core 4 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 29 / 40

Pipelined parallel mergesort Classic parallelism: start a task when the previous one is done Core 1 Core 2 Core 3 Core 4 Pipeline parallelism: Run next merging task as soon as possible Core 1 Core 2 Core 3 Core 4 Even more speedup Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 29 / 40

Pipelined parallel mergesort Classic parallelism: start a task when the previous one is done Core 1 Core 2 Core 3 Core 4 Pipeline parallelism: Run next merging task as soon as possible Core 1 Core 2 Core 3 Core 4 Even more speedup Difficult to implement manually Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 29 / 40

Pipeline parallelism Related research since the 60’ Program verifiability Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 30 / 40

Pipeline parallelism Related research since the 60’ Program verifiability Parallelism is a mere “consequence” Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 30 / 40

Pipeline parallelism Related research since the 60’ Program verifiability Parallelism is a mere “consequence” Sequential tasks communicating through channels Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 30 / 40

Pipeline parallelism Related research since the 60’ Program verifiability Parallelism is a mere “consequence” Sequential tasks communicating through channels Theories: Kahn Networks, (Synchronous) Data Flow, Communicating Sequential Processes 1 1 1 1 + + 1 1 1 1 D 1 1 1 1 *k *k D *k *k 1 1 1 1 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 30 / 40

Pipeline parallelism Related research since the 60’ Program verifiability Parallelism is a mere “consequence” Sequential tasks communicating through channels Theories: Kahn Networks, (Synchronous) Data Flow, Communicating Sequential Processes Languages: Streamit, CAL, Esterel 1 1 1 1 + + 1 1 1 1 D 1 1 1 1 *k *k D *k *k 1 1 1 1 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 30 / 40

Classic versus stream Most programming languages unsuitable to parallelism Abstract a single, universal instruction pointer Abstract a single, universal address space Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 31 / 40

Classic versus stream Most programming languages unsuitable to parallelism Abstract a single, universal instruction pointer Abstract a single, universal address space Difficult to read with several threads in mind Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 31 / 40

Classic versus stream Most programming languages unsuitable to parallelism Abstract a single, universal instruction pointer Abstract a single, universal address space Difficult to read with several threads in mind Annotations (OpenMP) not helping with high number of cores Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 31 / 40

Classic versus stream Most programming languages unsuitable to parallelism Abstract a single, universal instruction pointer Abstract a single, universal address space Difficult to read with several threads in mind Annotations (OpenMP) not helping with high number of cores Stream programming (Mostly) sequential tasks Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 31 / 40

Classic versus stream Most programming languages unsuitable to parallelism Abstract a single, universal instruction pointer Abstract a single, universal address space Difficult to read with several threads in mind Annotations (OpenMP) not helping with high number of cores Stream programming (Mostly) sequential tasks Actual parallelism: scheduling Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 31 / 40

Classic versus stream Most programming languages unsuitable to parallelism Abstract a single, universal instruction pointer Abstract a single, universal address space Difficult to read with several threads in mind Annotations (OpenMP) not helping with high number of cores Stream programming (Mostly) sequential tasks Actual parallelism: scheduling No universal shared memory Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 31 / 40

Classic versus stream Most programming languages unsuitable to parallelism Abstract a single, universal instruction pointer Abstract a single, universal address space Difficult to read with several threads in mind Annotations (OpenMP) not helping with high number of cores Stream programming (Mostly) sequential tasks Actual parallelism: scheduling No universal shared memory Natural to pipeline parallelism Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 31 / 40

Classic versus stream Most programming languages unsuitable to parallelism Abstract a single, universal instruction pointer Abstract a single, universal address space Difficult to read with several threads in mind Annotations (OpenMP) not helping with high number of cores Stream programming (Mostly) sequential tasks Actual parallelism: scheduling No universal shared memory Natural to pipeline parallelism Communications with on-chip memories: on-chip pipelining Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 31 / 40

Back to parallel merge Classic parallelism Core 1 Core 2 Core 3 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 32 / 40

Back to parallel merge Classic parallelism Core 1 Core 2 Core 3 Pipelining (4 initial sorting tasks) Core 1 Core 2 Core 3 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 32 / 40

Back to parallel merge Classic parallelism Core 1 Core 2 Core 3 Pipelining (4 initial sorting tasks) Core 1 Core 2 Core 3 Pipelining (8 initial sorting tasks ) Core 1 Core 2 Core 3 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 32 / 40

TDDD56: Multicore and GPU programming Lesson 1 Introduction to - PowerPoint PPT Presentation

TDDD56: Multicore and GPU programming Lesson 1 Introduction to laboratory work Nicolas Melot nicolas.melot (at) liu.se Linkping university (Sweden) November 4, 2015 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4,

Multicore DSP Architecture and Programming O. Dahl 1 1 Electrical Engineering, Linkping

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

TDDD56 Lab 3: Skeleton programming with SkePU August Ernstsson august.ernstsson@liu.se Labs

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

TDDD56 Lab 3: Skeleton programming with SkePU August Ernstsson august.ernstsson@liu.se C++11

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA

Xen Summit 18 April, 2007 TJ Watson Research Center The Xen-API Ewan Mellor ewan@xensource.com

Improving Model Selection by Employing the Test Data Max Westphal, Werner Brannath University of

The evolutions of spinning bodies moving in rotating black hole spacetimes Zoltn Keresztes

Practical Enterprise Integration Realising the Benefits of a Strong Canonical Architecture Andrew

Fourier-pseudospectral method for Cahn-Hilliard Equation on GPU Kangping Zhu Courant Institute

Reduced Manifolds and Trajectory Curvature J. M. Powers Department of Aerospace and Mechanical

Longstanding Discrepancies in Stratospheric Water Vapor Measurements Revisited During the 2011

How can different campuses inter-operate? From ora et labora to collaborate and federate

TDDD56: Multicore and GPU programming Lesson 1 Introduction to - PowerPoint PPT Presentation

TDDD56: Multicore and GPU programming Lesson 1 Introduction to laboratory work Nicolas Melot nicolas.melot (at) liu.se Linkping university (Sweden) November 4, 2015 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4,

Multicore DSP Architecture and Programming O. Dahl 1 1 Electrical Engineering, Linkping

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

TDDD56 Lab 3: Skeleton programming with SkePU August Ernstsson august.ernstsson@liu.se Labs

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

TDDD56 Lab 3: Skeleton programming with SkePU August Ernstsson august.ernstsson@liu.se C++11

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

CS 240A: Shared Memory &amp; Multicore Programming with Cilk++ Multicore and NUMA

Xen Summit 18 April, 2007 TJ Watson Research Center The Xen-API Ewan Mellor ewan@xensource.com

Improving Model Selection by Employing the Test Data Max Westphal, Werner Brannath University of

The evolutions of spinning bodies moving in rotating black hole spacetimes Zoltn Keresztes

Practical Enterprise Integration Realising the Benefits of a Strong Canonical Architecture Andrew

Fourier-pseudospectral method for Cahn-Hilliard Equation on GPU Kangping Zhu Courant Institute

Reduced Manifolds and Trajectory Curvature J. M. Powers Department of Aerospace and Mechanical

Longstanding Discrepancies in Stratospheric Water Vapor Measurements Revisited During the 2011

How can different campuses inter-operate? From ora et labora to collaborate and federate

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA