A DOMA MAIN S SPECIF ECIFIC A IC APPR PROACH CH TO H HETER - PowerPoint PPT Presentation

A DOMA MAIN S SPECIF ECIFIC A IC APPR PROACH CH TO H HETER EROGENEO ENEOUS US PARALLE LLELI LISM Hassan Chafi , Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Anand Atreya, Kunle Olukotun Stanford University Pervasive Parallelism Laboratory (PPL)

Er Era of a of Pow Power Li Limite ted Com Computing  Mobile  Data center  Energy costs  Battery operated  Passively cooled  Infrastructure costs

Com ompu puting Sy g Syste tem Pow Power Ops = Op × Power Energy second

Heterog rogeneous eneous H Hardware are Heterogeneous HW for energy efficiency  Multi-core, ILP, threads, data-parallel engines, custom engines  H.264 encode study  1 0 0 0 Perform ance Energy Savings Future performance gains will mainly come from heterogeneous 1 0 0 hardware with different specialized resources 1 0 1 4 cores + I LP + SI MD + custom ASI C inst Source: Understanding Sources of Inefficiency in General-Purpose Chips (ISCA’10)

Heteroge gene neous ous P Parallel Archi hite tecture ures Sun T2 Nvidia Fermi Driven by energy efficiency Altera FPGA Cray Jaguar

Heterog rogeneous eneous P Parallel llel Program ramming ing Pthreads Sun OpenMP T2 CUDA Nvidia OpenCL Fermi Verilog Altera VHDL FPGA MPI Cray Jaguar

Program rammab ability lity C Chasm Applications Pthreads Sun OpenMP T2 Scientific Engineering CUDA Virtual Nvidia Worlds OpenCL Fermi Personal Robotics Verilog Altera VHDL FPGA Data informatics MPI Cray Jaguar Too many different programming models

Program rammab ability lity C Chasm Applications Pthreads Sun OpenMP T2 Scientific Engineering CUDA Virtual Nvidia Worlds OpenCL Fermi Ideal Parallel Programming Language Personal Robotics Verilog Altera VHDL FPGA Data informatics MPI Cray Jaguar

The e Ideal eal P Paral allel el Programming L Languag age Performance Productivity Completeness

Succ Successfu ful La Lang nguages Performance Productivity Completeness

Succ Successfu ful La Lang nguages Performance PPL Target Languages Productivity Completeness

Dom omain Spe Specifi fic La Language ges Performance Domain Specific Languages Productivity Completeness

A S Solu lutio ion For Pe Pervasive e Pa Paral allel elism  Domain Specific Languages (DSLs) Programming language with restricted expressiveness for a particular  domain

Benefits of of U Using ng D DSLs SLs f for or Paralleli llelism Productivity • Shield average programmers from the difficulty of parallel programming • Focus on developing algorithms and applications and not on low level implementation details Performance • Match generic parallel execution patterns to high level domain abstraction • Restrict expressiveness to more easily and fully extract available parallelism • Use domain knowledge for static/ dynamic optimizations Portability and forward scalability • DSL & Runtime can be evolved to take advantage of latest hardware features • Applications remain unchanged • Allows HW vendors to innovate without worrying about application portability

Bridg ridging t g the lity Ch Chasm Program rammab ability Scientific Virtual Personal Data Applications Engineering Worlds Robotics informatics Domain Machine Physics Data Probabilistic Specific Learning Rendering ( Liszt ) Analysis (RandomT) Languages ( OptiML ) Domain Embedding Language ( Scala ) Polymorphic Embedding Staging Static Domain Specific Opt. DSL Infrastructure Parallel Runtime ( Delite ) Dynamic Domain Spec. Opt. Task & Data Parallelism Locality Aware Scheduling Heterogeneous Hardware

OptiML iML: A D DSL for SL for ML  Machine Learning domain  Learning patterns from data  Applying the learned models to tasks  Regression, classification, clustering, estimation  Computationally expensive  Regular and irregular parallelism  Characteristics of ML applications  Iterative algorithms on fixed structures  Large datasets with potential redundancy  Trade off between accuracy for performance  Large amount of data parallelism with varying granularity  Low arithmetic intensity

OptiML iML: Mo : Motiva ivatio ion  Raise the level of abstraction  Focus on algorithmic description, get parallel performance  Use domain knowledge to identify coarse-grained parallelism  Identify parallel and sequential operations in the domain (e.g. ‘summations, batch gradient descent’)  Single source = > Multiple heterogeneous targets  Not possible with today’s MATLAB support  Domain specific optimizations  Optimize data layout and operations using domain-specific semantics  A driving example  Flesh out issues with the common framework, embedding etc.

OptiML iML: O : Overvie view  Provides a familiar (MATLAB-like) language and API for writing ML applications  Ex. val c = a * b (a, b are Matrix[ Double] )  Implicitly parallel data structures  General data types : Vector[ T] , Matrix[ T]  Special data types : TrainingSet, TestSet, IndexVector, Image, Video ..  Encode semantic information  Implicitly parallel control structures  sum{ … } , (0: : end) { … } , gradient { … } , untilconverged { … }  Allow anonymous functions with restricted semantics to be passed as arguments of the control structures

Exa xample ple OptiM tiML / MATLAB c AB code (Gaus ussia ian D n Discrim riminan inant A t Analy lysis is) ML-specific data types // x : TrainingSet[Double] % x : Matrix, y: Vector // mu0, mu1 : Vector[Double] % mu0, mu1: Vector val sigma = sum (0,x.numSamples) { n = size(x,2); if (x.labels(_) == false) { sigma = zeros(n,n); (x(_)-mu0).trans.outer(x(_)-mu0) } parfor i=1:length(y) else { if (y(i) == 0) (x(_)-mu1).trans.outer(x(_)-mu1) sigma = sigma + (x(i,:)-mu0)’*(x(i,:)-mu0); } else } sigma = sigma + (x(i,:)-mu1)’*(x(i,:)-mu1); end Implicitly parallel Restricted index end semantics control structures OptiML code (parallel) MATLAB code

MATLAB i implement ntati tion on  ` parfor` is nice, but not always best  MATLAB uses heavy-weight MPI processes under the hood  Precludes vectorization, a common practice for best performance  GPU code requires different constructs  The application developer must choose an implementation, and these details are all over the code ind = sort(randsample(1:size(data,2),length(min_dist))); data_tmp = data(:,ind); all_dist = zeros(length(ind),size(data,2)); parfor i=1:size(data,2) all_dist(:,i) = sum(abs( repmat (data(:,i),1,size(data_tmp,2)) - data_tmp),1)'; end all_dist(all_dist==0)=max(max(all_dist));

Dom omain Spe Specifi fic O Opti ptimizations  Relaxed dependencies  Iterative algorithms with inter-loop dependencies prohibit task parallelism  Dependencies can be relaxed at the cost of a marginal loss in accuracy  Best effort computations  Some computations can be dropped and still generate acceptable results  Provide data structures with “best effort” semantics, along with policies that can be chosen by DSL users S. Chakradhar, A. Raghunathan, and J. Meng. Best-effort parallel execution fram ew ork for recognition and m ining applications. IPDPS’09  h

De Delit lite: a a fra framework to to he help bui p build d pa para rallel D DSLs SLs  Building DSLs is hard  Building parallel DSLs is harder  For the DSL approach to parallelism to work, we need many DSLs  Delite provides a common infrastructure that can be tailored to a DSL’s needs  An interface for mapping domain operations to composable parallel patterns  Provides re-usable components: GPU manager, heterogeneous code generation, etc.

Com ompo posabl ble parallel p patterns tterns  Delite view of a DSL: a collection of data(DeliteDSLTypes) and operations (OPs)  Delite supports OP APIs that express parallel execution patterns  DeliteOP_Map, DeliteOP_Zipwith, DeliteOP_Reduce, etc.  Planning to add more specialized ops  DSL author maps each DSL operation to one of the patterns (can be difficult)  OPs record their dependencies (both mutable and immutable)

Ex Exampl ple code ode for for De Delit lite OP OP case class OP_+ [ A] ( val collA: Matrix[ A] , Dependencies val collB: Matrix[ A] , val out: Matrix[ A] ) ( im plicit ops: ArithOps[ A] ) extends DeliteOP_ZipWith2[ A,A,A,Matrix] { def func = (a,b) = > ops.+ (a,b) Execution pattern } Interface for this pattern

De Delite: a a dynam namic p c par aral allel r runti untime  Executes a task graph on parallel, heterogeneous hardware  (paper) performs dynamic scheduling decisions  (soon) both static and dynamic scheduling  Integrates task and data parallelism in a single environment  Task parallelism at the DSL operation granularity  Data parallelism by data decomposition of a single operation into multiple tasks  Provides efficient implementations of the execution patterns

De Delit lite Ex Execution Fl Flow ow Calls Matrix DSL methods DSL defers OP execution to Delite R.T. Delite applies generic & domain transformations and generates mapping

A DOMA MAIN S SPECIF ECIFIC A IC APPR PROACH CH TO H HETER - PowerPoint PPT Presentation

A DOMA MAIN S SPECIF ECIFIC A IC APPR PROACH CH TO H HETER EROGENEO ENEOUS US PARALLE LLELI LISM Hassan Chafi , Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Anand Atreya, Kunle Olukotun Stanford University Pervasive Parallelism

Specif ication in the sof tware Specif ication and design process Specif ication and design

Topics covered Formal specif ication in the sof tware process I nterf ace specif ication

The he Sales C Com ompa parison Appr pproa oach Pa Part C C 2020 Le Level el I I

The he Sales C Com ompa parison Appr pproa oach Pa Part B B 2020 Le Level el I I

The he Sales C Com ompa parison Appr pproa oach Pa Part A A 2020 Le Level el I I

Contributions to Analysis and Functional Analysis in memoriam Pawe l Doma nski Dietmar

Convolution operators in discrete Ces` aro spaces Werner Ricker Pawe Doma nski Memorial

San n Fe Ferna rnando ndo CORR ORRIDO IDORS RS SPEC ECIFIC IFIC PLAN AN Am Amendment

A MEAS A MEAS MEASURE MEASURE URED APPR URED APPR APPROACH APPROACH CH REC CH REC

A Snapshot of the APPR Process Michele Bowman Assistant Superintendent for Personnel &

Dep Depar artmen ent of of Lo Loca cal Go Government Finan Finance ce Cost A Appr

Se Semi-Su Supervise sed QA with Ge Generative Do Doma main-Ad Adaptive N e Nets C arnegie

do doma main spe pecifi fic ciph phers Carlos Cid Royal Holloway University of London

ESTABLISHI SHING A G A PATIENT AN AND F FAMI AMILY CENTRE RED C D CARE APPROAC PROACH T

Ford of Europe Live Chat Backgr ckgroun und Approac proach Lesson ssons

CWGs Service Station Innovation Our Approach proach to Paymen ent t System tem Termina

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

Inference in ecology and evolution beyond generalised linear mixed models Reinder Radersma

CAPS Community Accreditation for Produce Safety If you buy from CAPS buys you LOCAL FARMS

Kernel Recursive ABC: Point Estimation with Intractable Likelihood Motonobu Kanagawa EURECOM,

Upwind Summation By Parts Methods for Large Scale Elastic Wave Equation ICERM, Brown University

Parallel Programming and Heterogeneous Computing A4 Workloads & Fosters Methodology

Integrating Maude into Hets Mihai Codescu, 1 Till Mossakowski, 1 Adri an Riesco 2 and Christian

Improving NAMD Performance and Scaling on Heterogeneous Architectures David J. Hardy and Julio D.

A DOMA MAIN S SPECIF ECIFIC A IC APPR PROACH CH TO H HETER - PowerPoint PPT Presentation

A DOMA MAIN S SPECIF ECIFIC A IC APPR PROACH CH TO H HETER EROGENEO ENEOUS US PARALLE LLELI LISM Hassan Chafi , Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Anand Atreya, Kunle Olukotun Stanford University Pervasive Parallelism

Specif ication in the sof tware Specif ication and design process Specif ication and design

Topics covered Formal specif ication in the sof tware process I nterf ace specif ication

The he Sales C Com ompa parison Appr pproa oach Pa Part C C 2020 Le Level el I I

The he Sales C Com ompa parison Appr pproa oach Pa Part B B 2020 Le Level el I I

The he Sales C Com ompa parison Appr pproa oach Pa Part A A 2020 Le Level el I I

Contributions to Analysis and Functional Analysis in memoriam Pawe l Doma nski Dietmar

Convolution operators in discrete Ces` aro spaces Werner Ricker Pawe Doma nski Memorial

San n Fe Ferna rnando ndo CORR ORRIDO IDORS RS SPEC ECIFIC IFIC PLAN AN Am Amendment

A MEAS A MEAS MEASURE MEASURE URED APPR URED APPR APPROACH APPROACH CH REC CH REC

A Snapshot of the APPR Process Michele Bowman Assistant Superintendent for Personnel &amp;

Dep Depar artmen ent of of Lo Loca cal Go Government Finan Finance ce Cost A Appr

Se Semi-Su Supervise sed QA with Ge Generative Do Doma main-Ad Adaptive N e Nets C arnegie

do doma main spe pecifi fic ciph phers Carlos Cid Royal Holloway University of London

ESTABLISHI SHING A G A PATIENT AN AND F FAMI AMILY CENTRE RED C D CARE APPROAC PROACH T

Ford of Europe Live Chat Backgr ckgroun und Approac proach Lesson ssons

CWGs Service Station Innovation Our Approach proach to Paymen ent t System tem Termina

Parallel Programming and Heterogeneous Computing Shared-Nothing Parallelism Models Max Plauth,

Inference in ecology and evolution beyond generalised linear mixed models Reinder Radersma

CAPS Community Accreditation for Produce Safety If you buy from CAPS buys you LOCAL FARMS

Kernel Recursive ABC: Point Estimation with Intractable Likelihood Motonobu Kanagawa EURECOM,

Upwind Summation By Parts Methods for Large Scale Elastic Wave Equation ICERM, Brown University

Parallel Programming and Heterogeneous Computing A4 Workloads &amp; Fosters Methodology

Integrating Maude into Hets Mihai Codescu, 1 Till Mossakowski, 1 Adri an Riesco 2 and Christian

Improving NAMD Performance and Scaling on Heterogeneous Architectures David J. Hardy and Julio D.

A Snapshot of the APPR Process Michele Bowman Assistant Superintendent for Personnel &

Parallel Programming and Heterogeneous Computing A4 Workloads & Fosters Methodology