A Comparison Of Shared Memory Parallel Programming Models Jace A - PowerPoint PPT Presentation

A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1

Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86 programming model identical to 1982 Symmetric Multi- Processor programming model Unwillingness to adopt new languages Users want to leverage existing investments in  OpenMP/MicroTasking – Data parallelism code  Pthreads – Task parallelism Prefer to incrementally migrate to parallelism  Wishful Thinking – Mixed task and data parallelism

Expressing Synchronization and Parallelism Parallelism and Synchronization are Orthogonal Parallelism Explicit Implicit Pthreads/ Code Vectorization OpenMP Synchronization Data MTA Dataflow Synchronization and parallelism primitives can be mixed OpenMP mixed with Atomic Memory Operations MTA synchronization mixed with OpenMP parallel loops OpenMP mixed with Pthread Mutex Locks OpenMP or Pthreads mixed with Vectorization All of the above mixed with MPI, UPC, CAF, etc. 3

Thread-Centric versus Data-Centric TASK: Map millions of degrees of parallelism onto tens of (thousands of) processors Thread-Centric Data-Centric Manage Threads Manage Data Dependencies Compiler is already doing this for ILP, loop Optimizes for specific machine optimization, and vectorization organization Optimizes for concurrency, which is Requires careful scheduling of performance portable moving data to/from thread Moving task to data is a natural option for Difficult to load balance dynamic and load balancing nested parallel regions

Lock-Free and Wait-Free Algorithms Lock Free and Wait Free Algorithms... Don’t really exist Only embarrassingly Parallel algorithms don’t use synchronization Compare And Swap... is not lock free or wait-free has no concurrency is a synchronization primitive which corresponds to mutex try-lock in the Pthreads programming model 5

Compare And Swap LOCK# CMPXCHG – x86 Locked Compare and Exchange Programming Idioms Similar to mutex try-lock Mutex locks can spin try-lock or yield to the OS/runtime So Called Lock-Free Algorithms Manually coded secondary lock handler Manually coded tertiary lock handler... All this try-lock handler work is not algorithmically efficient... It’s Lock-Free Turtles all the way down... Implementation Instruction in all i386 and later processors Efficient for processors sharing caches and memory controllers Not efficient or fair for non-uniform machine organizations Atomic Memory Operations Do Not Scale It is not possible to go 10,000 way parallel on one piece of data. 6

Thread-Centric Parallel Regions OpenMP Pthreads Implied fork/join scaffolding Fully explicit scaffolding Parallel regions are separate Forking Threads from loops One new thread per Unannotated loops: Every PthreadCreate() thread executes all iterations Loops or trees required Annotated loops: Loops are to start multiple threads decomposed among existing threads Flow control PthreadBarrier Joining Threads Mutex Lock Exit parallel region Joining Threads Barriers PthreadJoin return() 7

Data-Centric Parallel Regions on XMT Multiple loops, with different trip counts, after restructuring a reduction, all in a single parallel region: Parallel region 1 in foo | void foo(int n, double* restrict a, Multiple processor implementation Requesting at least 50 streams | double* restrict b, | double* restrict c, Loop 2 in foo in region 1 | double* restrict d) In parallel phase 1 | { Dynamically scheduled, variable chunks, min size = 26 | int i, j; Compiler generated | double sum = 0.0; | Loop 3 in foo at line 7 in loop 2 | for (i = 0; i < n; i++) Loop summary: 1 loads, 0 stores, 1 floating point operations 3 P:$ | sum += a[i]; 1 instructions, needs 50 streams for full utilization ** reduction moved out of 1 loop pipelined | Loop 4 in foo in region 1 | for (j = 0; j < n/2; j++) In parallel phase 2 5 P | b[j] = c[j] + d[j] * sum; Dynamically scheduled, variable chunks, min size = 8 | } Compiler generated Loop 5 in foo at line 10 in loop 4 Loop summary: 2 loads, 1 stores, 2 floating point operations 3 instructions, needs 44 streams for full utilization pipelined 8

Parallel Histogram Thread Centric Time = N elements PARALLEL-DO i = 0 .. Nelements-1 • Critical region around update to j = 0 counts array while(j < Nbins && elem[i] < binmax[j]) • Serial bottleneck in critical region j++ BEGIN CRITICAL-REGION • Wastes potential concurrency counts[j]++  Only 1 thread at a time END CRITICAL-REGION Data Centric Time = N elements /N bins Updates to count table are atomic: PARALLEL-DO i = 0 .. Nelements-1 • Requires abundant fine grained j = 0 synchronization while(j < Nbins && elem[i] < binmax[j]) All concurrency can be exploited: j++ • Maximum concurrency limited INT_FETCH_ADD(counts[j], 1)  Updates to number of bins are atomic • Every bin updated simultaneously

Linked List Manipulation Insertion Sort Time = N*N/2 – Inserting from same side every time Time = N*N/4 – Insert at head or tail, whichever is nearer Unlimited concurrency during search Concurrency during manipulations Thread Parallelism: One update to list at a time Data Parallelism: Between each pair of elements Grow list length by 50% on every step Two phase insertion (search, then lock and modify) 10

Two-Phase List Insertion (OpenMP) Only One Insertion at a 1. Find site to insert at time Do not lock list Unlimited number of search threads Traverse list serially More threads means... more failed inter-node 2. Perform List Update locks more repeated Enter Critical Region (wasted) searches Re-Confirm site is unchanged Wallclock Time = N Update list pointers Parallel Search – 1 End Critical Region Serial updates – N 11

Two-Phase List Insertion (MTA) N/4 Maximum concurrent insertions 1. Find site to insert at Do not lock list Insert between every pair Traverse list serially of nodes More insertion points 2. Perform List Update means fewer failed inter- Lock two elements inserted node locks between Total Time = logN Acquire locks in lexicographical order to avoid deadlocks Confirm link between nodes is unchanged Update link pointers of nodes Unlock two elements in reverse order of locking 12

Global HACHAR – Initial Data Structure Region Head Non-full Region Chain Use “two-step Next Free Slot = ∞ Length Next Free Slot = 0 acquire” on length, region linked list Table / Chunk size pointers, chain 0 pointers. Use int_fetch_add on “next free 0 slot” to allocate list node. Region 0 Region 1

Global HACHAR – Two items inserted Non-full Region Head Region Chain “locked” length Next Free Slot = ∞ Length Next Free Slot = 0 and inserted into “head of list” Potential 1 Word Word Id Hash Function contention only on length List node Range 1 shows example for Bag Of Words Region 0 Region 1

Global HACHAR – Collisions Non-full Region Head Region Lookup: walk Chain Next Free Slot = ∞ chain, no Length Next Free Slot = x locking Malloc and free Table / Chunk size limited to the 3 few region Word Word Id buffer Growing a chain requires lock of 1 only last pointer (int_fetch_add length) Region 0 Region 1

Global HACHAR – Region Overflow Non-full Region Head Region Chain Next Free Slot = ∞ Next Free Slot = ∞ Length Next Free Slot = 1 Table / Chunk size 3 Word Word Id 2 Region 0 Region 1 Region 2

Once Per Thread vs. Once Per Iteration Parallel loop, one iteration per Allocating Storage Once Per Thread thread malloc hoisted out of inner PARALLEL-DO i = 0 .. Nthreads-1 float *p = malloc(...) loop, or fused into outer loop int n_iters = Nelements / Nthreads DO j = i*n_iters .. max(N,(i+1)*n_iters) p[j] = ... x[j] ... x[j] = ... p[j] ... ENDDO Block of serial iterations free(p); END-PARALLEL-DO XMT Abominable Kludge OpenMP Non-portable pragma Parallel regions are separate semantics from loops Requires separate Loop decomposition idiom loops, possibly parallel region already captured in conventional pragma syntax 17

Synchronization is a Natural Part of Parallelism Synchronization cannot be avoided, it must I’m a lock free scalable be made efficient and easy to use parallel algorithm! “Lock Free” algorithms aren’t... Sequential execution of atomic ops (compare and swap, fetch and add) Hidden lock semantics in compiler or hardware (change either and you have a bug) Communication-free loop level data parallelism (ie: vectorization) is a limited kind of parallelism Synchronization needed for many purposes Atomicity Enforce order dependencies Manage threads Must be abundant Don’t want to worry about allocating or rationing synchronization variables 18

A Comparison Of Shared Memory Parallel Programming Models Jace A - PowerPoint PPT Presentation

A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86 programming model identical to 1982

Threaded Programming Lecture 1: Concepts Overview Shared memory systems Basic Concepts

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Programming with Shared Memory In a shared memory system, any memory location can be accessible by

Case Studies in Asynchronous, Message-Driven Shared Memory Programming Pritish Jetley Parallel

OpenMP: a shared-memory parallel programming model Eduard Ayguad Computer Sciences Department

28. Parallel Programming II 28.1 Shared Memory, Concurrency Shared Memory, Concurrency,

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction to Parallel Programming using OpenMP Shared Memory Parallel Programming Part I

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Adolescent Substance Use and Interventions Tom Freese, PhD Sherry Larkins, PhD May 17, 2011

Attacking Kerberos Deployments Breaking the Intranet Rachel Engel, Brad Hill and Scott Stender

Runion PLEPU : 16 nov. 2016 Olivier Deschamps LPC Clermont-F d UBP/CNRS/IN2P3 1 Radiative

Digital Signatures for Flows and Multicasts by Chung Kei Wong and Simon S. Lam in IEEE/ACM

If you gain the respect and confidence of readers, and they find you easy to get at and

Part D Payment Modernization Model Model Overview Centers for Medicare & Medicaid Services

Introduction LTC, Inc. I-1 Prescription Drug Event Data Foundations Training July 2007 PURPOSE

Sources of Data to Supplement PDE Data PLAN CHARACTERISTICS FILE Kyoungrae Jung, Ph.D.

A Comparison Of Shared Memory Parallel Programming Models Jace A - PowerPoint PPT Presentation

A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86 programming model identical to 1982

Threaded Programming Lecture 1: Concepts Overview Shared memory systems Basic Concepts

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Programming with Shared Memory In a shared memory system, any memory location can be accessible by

Case Studies in Asynchronous, Message-Driven Shared Memory Programming Pritish Jetley Parallel

OpenMP: a shared-memory parallel programming model Eduard Ayguad Computer Sciences Department

28. Parallel Programming II 28.1 Shared Memory, Concurrency Shared Memory, Concurrency,

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction to Parallel Programming using OpenMP Shared Memory Parallel Programming Part I

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Adolescent Substance Use and Interventions Tom Freese, PhD Sherry Larkins, PhD May 17, 2011

Attacking Kerberos Deployments Breaking the Intranet Rachel Engel, Brad Hill and Scott Stender

Runion PLEPU : 16 nov. 2016 Olivier Deschamps LPC Clermont-F d UBP/CNRS/IN2P3 1 Radiative

Digital Signatures for Flows and Multicasts by Chung Kei Wong and Simon S. Lam in IEEE/ACM

If you gain the respect and confidence of readers, and they find you easy to get at and

Part D Payment Modernization Model Model Overview Centers for Medicare &amp; Medicaid Services

Introduction LTC, Inc. I-1 Prescription Drug Event Data Foundations Training July 2007 PURPOSE

Sources of Data to Supplement PDE Data PLAN CHARACTERISTICS FILE Kyoungrae Jung, Ph.D.

Part D Payment Modernization Model Model Overview Centers for Medicare & Medicaid Services