Parallel Programming Prof. Jess Labarta BSC & UPC Barcelona, - PowerPoint PPT Presentation

Parallel Programming Prof. Jesús Labarta BSC & UPC Barcelona, Jujy 1st 2019

What am I doing here ? Already used in Mateo12 2

“As below, so above” • Leverage computer architecture background … • … in higher levels of the system stack • Looking for further insight

The Programming model osmotic membrane Applications Applications What is the right degree of PM: High‐level, clean, abstract interface porosity ? Power to the runtime Power to the runtime ISA / API 4 4

Integrate concurrency and data Single mechanism Concurrency: Dependences built from data accesses Lookahead: About instantiating work Locality & data management From data accesses 5

OmpSs • A forerunner for OpenMP + Task reductions + Taskwait + Task priorities + Taskloop + Task + Task dependences + Taskloop dependences prototyping dependences + OMPT impl. + Data affinity prototyping + Multideps + Commutative Today 6

Important topics/practices • Regions • Nesting • Taskloops + dependences • Hints • Taskify communications: MPI Interoperability • Malleability • Homogenize Heterogeneity • Hierarchical “acceleration” • Memory management & Locality 7

Regions • Precise nD subarray accesses • “Complex” analysis but … • Enabler for … • Recursion void gs (float A[ ( NB+2)*BS][(NB+2)*BS]) { • Flexible nesting int it,i,j; • Taskloop dependences for (it=0; it<NITERS; it++) • Data management for (i=0; i<N-2; i+=BS) for (j=0; j<N-2; j+=BS) • locality gs_tile(&A[i][j]); • layout } #pragma omp task \ in(A[0][1;BS], A[BS+1][1;BS], \ A[1;BS][0], A[1:BS][BS+1]) \ inout(A[1;BS][1;BS]) void gs_tile (float A[N][N]) { for (int i=1; i <= BS; i++) for (int j=1; j <= BS; j++) A[i][j] = 0.2*(A[i][j] + A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]); } 8

Nesting • Top down • Every level contributes • Flattening dependence graph • Increase concurrency • Take out runtime overhead from critical path • Granularity control • final clauses, runtime J. M. Perez, et all, "Improving the Integration of Task Nesting and Dependencies in OpenMP" IDPS 2017 9

Nesting • Top down • Every level contributes • Flattening dependence graph • Increase concurrency • Take out runtime overhead from critical path • Granularity control • final clauses, runtime J. M. Perez, et all, "Improving the Integration of Task Nesting and Dependencies in OpenMP" IPDPS 2017 10

Nesting • Top down • Every level contributes • Flattening dependence graph • Increase concurrency • Take out runtime overhead from critical path • Granularity control • final clauses, runtime J. M. Perez, et all, "Improving the Integration of Task Nesting and Dependencies in OpenMP" IDPS 2017 11

Taskloops & dependences • Dependences • Intra loop • Inter loops • Dynamic granularities • Guided T1 • Runtime T3 T4 ... T2 • Combination • Enabled by regions support TN 12

Taskifying MPI calls • MPI: a “fairly sequential” model • Taskifuying MPI calls • Opportunities • Overlap/out of order execution • Provide laxity for communications • Migrate/aggregate load balance issues • Risk to introduce deadlocks • TAMPI ffts physics • Virtualize “communication resource” IFS weather code kernel. ECMWF V. Marjanovic et al, “Overlapping Communication and Computation by using a Hybrid MPI/SMPSs Approach” ICS 2010 K. Sala et al, "Extending TAMPI to support asynch MPI primitives”. OpenMPCon18 13

Exploiting malleability • Malleability • Omp_get_thread_num, Thread private, large parallels …. ECHAM • Dynamic Load Balance & Resource management • Intra/inter process/application • Library (DLB) • Runtime interception (MPIP, OMPT, …) https://pm.bsc.es/dlb • API to hint resource demands • Core reallocation policy • Opportunity to fight Amdalh’s law • Productive / Easy !!! • Hybridize only imbalanced regions • Nx1 “LeWI: A Runtime Balancing Algorithm for Nested Parallelism”. M.Garcia et al. ICPP09 “Hints to improve automatic load balancing with LeWI for hybrid applications” JPDC2014 14

Homogenizing Heterogeneity • Performance heterogeneity • ISA heterogeneity • Several non coherent address spaces 15

On the OmpSs road FFTlib (QE miniapp) Top down Lulesh Nesting NT‐CHEM Taskify communications Taskify communications Top down Top down Alya Dynamic Load Balance (DLB) Commutative ‐ multideps 16 16

DMRG structure • Density Matrix Renormalization Group app in condensed matter physics (ORNL) • Skeleton T Y[N]; • 3 nested loop for (i) • Reduction on large array for (j) • Huge variability of op cost for (k) Y[i] += M[k] op X[j] • Real miniapp • Different sizes of Y entries 17

OpenMP parallelizations • OpenMP parallelizations • Reduction • Based on full array privatization • Using reduction clauses • Nested parallels • Worksharings / Tasks • Synchronization at end of parallels exposes cost of load imbalances at all levels • Overheads at fine levels Par. i • Issues activation of multiple levels • Core partition • Levels of Privatization Par. j Par. k 18

Taskification T Y[N]; • Serialize reductions • Multiple dependence chains for (i) for (j) for (k) Y[i] +=M[k] op X[j] 19

Taskification T Y[N]; T tmp[Npriv]; • Serialize reductions • Multiple dependence chains for (i) for (j) for (k) if (small) • Split operations Y[i] +=M[k] op X[j] • Compute & reduce else • Persist intermediate result • Global array of tmps tmp[next]=M[k] op X[j] • Used in a circular way • Enforce antidependence Y[i] += tmp[next]; • Reduce overhead  do not split small operation • Compute directly on target operand • Avoid task instantiation and dependence overhead • Avoid memory allocation, initialization & reduction to target operand 20

Resulting dependence chains 21 21

Aqui tendria que poner la Performance ? correspondiente sin prioridades • Causes of pulsation of red tasks? • Instantiation order and granularity • Graph dependencies • Improvements • Priorities • Anti‐dependence distances • Nesting No tengo claro que era esta. Que prioridades? Aqui tendria que poner correspondiente con nes Do these effects happen also at ISA level? Can similar techniques be used to improve performance? 22

Question on graph scheduling dynamics F= m 𝑦� � 𝑐𝑦� � 𝑙𝑦 Effective k, m, b ? ω � � 𝑙/𝑛 Excitation ? Graph generation ? 𝑦�𝑢� � 𝐵𝑓 �� cos�𝑥𝑢 � φ� Resources ? F �𝑢� � 𝐺 � cos�𝑥𝑢� 𝑦�𝑢� � 𝐵 cos�𝑥𝑢� 

Thanks to Yale “There is no limit to what you can achieve provided you do not care who takes the credit” I first heard it from Yale Thanks !

Thanks

Parallel Programming Prof. Jess Labarta BSC & UPC Barcelona, - PowerPoint PPT Presentation

Parallel Programming Prof. Jess Labarta BSC & UPC Barcelona, Jujy 1st 2019 What am I doing here ? Already used in Mateo12 2 As below, so above Leverage computer architecture background in higher levels of the

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Distributed Data-Parallel Programming Parallel Programming and Data Analysis Heather Miller

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Parallel Programming Languages and Approaches

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

2110412 Parallel Comp Arch Parallel Programming Paradigm Natawut Nupairoj, Ph.D. Department of

Overview Parallel computing platforms Approaches to building parallel computers

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

Concurrent Programming with Parallel Extensions to .NET Joe Duffy Architect & Development

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Manitoba fiscal crisis: fact or fiction? Toby Sanger, Economist CUPE National Winnipeg, 27

Charm++ Workshop 2010 Processor Virtualization in Weather Models Eduardo R. Rodrigues Institute

Programming with Transactional Coherence and Consistency (TCC) all transactions, all the

REGION-BASED DYNAMIC SEPARATION FOR STM HASKELL Laura Effinger-Dean and Dan Grossman University

Stock market development Polish experience and implications for Belarus G. T. Jedrzejczak

Statistical Simulation in Python Tushar Shanker Data Scientist DataCamp Statistical Simulation

Underinsurance Facts & Figures 87% of 374,000 average amount buildings valued in 2015

Against Upwards Agree: A view from Dagestan Pavel Rudnev prudnev@hse.ru 22-03-2019 Background

Parallel Programming Prof. Jess Labarta BSC & UPC Barcelona, - PowerPoint PPT Presentation

Parallel Programming Prof. Jess Labarta BSC & UPC Barcelona, Jujy 1st 2019 What am I doing here ? Already used in Mateo12 2 As below, so above Leverage computer architecture background in higher levels of the

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

PARALLEL Joachim Nitschke PROGRAMMING Project Seminar Parallel Programming, Summer

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Distributed Data-Parallel Programming Parallel Programming and Data Analysis Heather Miller

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Lecture 2: Parallel Architectures Lecture 2: Parallel Architectures and Programming Models

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Parallel Programming Languages and Approaches

How to Think Algorithmically in Parallel? Or, Parallel Programming through Parallel Algorithms

2110412 Parallel Comp Arch Parallel Programming Paradigm Natawut Nupairoj, Ph.D. Department of

Overview Parallel computing platforms Approaches to building parallel computers

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

Concurrent Programming with Parallel Extensions to .NET Joe Duffy Architect &amp; Development

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Manitoba fiscal crisis: fact or fiction? Toby Sanger, Economist CUPE National Winnipeg, 27

Charm++ Workshop 2010 Processor Virtualization in Weather Models Eduardo R. Rodrigues Institute

Programming with Transactional Coherence and Consistency (TCC) all transactions, all the

REGION-BASED DYNAMIC SEPARATION FOR STM HASKELL Laura Effinger-Dean and Dan Grossman University

Stock market development Polish experience and implications for Belarus G. T. Jedrzejczak

Statistical Simulation in Python Tushar Shanker Data Scientist DataCamp Statistical Simulation

Underinsurance Facts &amp; Figures 87% of 374,000 average amount buildings valued in 2015

Against Upwards Agree: A view from Dagestan Pavel Rudnev prudnev@hse.ru 22-03-2019 Background

Concurrent Programming with Parallel Extensions to .NET Joe Duffy Architect & Development

Underinsurance Facts & Figures 87% of 374,000 average amount buildings valued in 2015