Automatjc task-based parallelizatjon of Python codes Cristin - PowerPoint PPT Presentation

Automatjc task-based parallelizatjon of Python codes Cristián Ramón-Cortés Ramon Amela Jorge Ejarque Philippe Clauss Rosa M. Badia MS12: Task-based Programming for Scientific Computing: Runtime Support

Outline  Introductjon ● PLUTO ● PyCOMPSs  AutoParallel ● Annotatjon ● Architecture  Evaluatjon  Conclusions and Future Work

Introductjon

Motjvatjon Identjfying parallel regions Remote executjon Distributed Concurrency management Parallel Data transfers Issues Issues Executjon orchestratjon Ease the development of distributed applicatjons THE GOAL: Any fjeld expert can scale up an applicatjon to hundreds of cores 4

COMPSs  Based on sequentjal programming ● General purpose programming language + annotatjons  Task-based programming model ● Task is the unit of work ● Implicit Workfmow: Builds a task graph at runtjme that expresses potentjal concurrency 5

COMPSs  Infrastructure agnostjc ● Same applicatjon runs on clusters, grids, clouds and containers  Supports other types of parallelism ● Multj-threaded tasks (i.e., MKL kernels) ● Multj-node tasks (i.e., MPI applicatjons) ● Non-natjve tasks (i.e., binaries) ● Nested PyCOMPSs applicatjons ● Integratjon with BSC OmpSs 6

PyCOMPSs Annotatjon  Python decorators for task selectjon + synchronizatjon API  Instance and class methods  Task data directjons @task (returns=dict) def wordcount(block): ... @task (a=IN, b=IN, c=INOUT) def multiply_acum(a, b, c): @task (result=INOUT) c += a * b def reduce(result, pres): ... @task (returns=int) def multiply(a, b, c): def main(a, b, c): return c + a * b for block in data: @constraint (computingUnits=”2”) pres = wordcount(block) @task (file=FILE_IN) reduce(result, pres) def my_task(x): result = compss_wait_on(result) ... # f = compss_open(fn) @binary (binary=”sed”) # compss_delete_file(f) @task (f=FILE_INOUT) # compss_delete_object(o) def binary_task(flag, expr, f): # compss_barrier() pass 7

PLUTO ▶ The Polyhedral Model represents the instances of the loop nests’ statements as integer points inside a polyhedron ▶ PLUTO is an automatjc parallelizatjon tool based on the Polyhedral Model to optjmize arbitrarily nested loop sequences with affjne dependencies 8

AutoParallel

AutoParallel A single Python decorator to parallelize and distributedly execute sequentjal code containing affjne loop nests Automatjc Automatjc Python decorator Python decorator taskifjcatjon taskifjcatjon from pycompss.api.parallel import parallel @parallel () No data No data def matmul(a, b, c, m_size): for i in range(m_size): management management for j in range(m_size): for k in range(m_size): c[i][j] += np.dot(a[i][k], b[k][j]) Sequentjal code Sequentjal code No resource No resource management management task task task task task task task task task task task task Grid Cluster Cloud Container 10

AutoParallel Annotatjon ▶ Taskifjcatjon of affjne loop nests at runtjme @parallel () def matmul(a, b, c, m_size): for i in range(m_size): for j in range(m_size): for k in range(m_size): c[i][j] += np.dot(a[i][k], b[k][j]) # [COMPSs AutoParallel] Begin Autogenerated code @task (var2=IN, var3=IN, var1=INOUT) def S1(var2, var3, var1): var1 += np.dot(var2, var3) def matmul(a, b, c, m): if m >= 1: for t1 in range(0, m – 1): #i lbp = 0 ubp = m - 1 for t2 in range(lbp, ubp + 1): #k lbv = 0 ubv = m - 1 for t3 in range(lbv, ubv + 1): #j S1(a[t1][t2], b[t2][t3], c[t1][t3]) compss_barrier() # [COMPSs AutoParallel] End Autogenerated code 11

AutoParallel Architecture  Decorator ● Implements the @parallel decorator  Python to OpenScop translator ● Builds a Python Scop object from the Python’s AST representjng each affjne loop nest detected in the user functjon  Parallelizer ● Parallelizes an OpenScop fjle and returns its Python code using OpenMP syntax  Python to PyCOMPSs translator ● Inserts the PyCOMPSs syntax (task annotatjons and data synchronizatjons) to the annotated Python code (uses Python’s AST)  Code replacer ● Replaces each loop nest in the initjal user code by the auto-generated code 12

Evaluatjon

Cholesky LoC Lines Of Code Code Analysis CC Cyclomatic Complexity NPath Npath Complexity LoC CC NPath User 220 26 112 Auto 274 36 14.576 Loop Analysis #Main #Total Depth User 1 4 3 Auto 3 9 3 Problem Size Execution Total Matrix Task SpeedUp #Blocks Block Size #Tasks Size Types @ 192 cores User 3 6.512 1,95 65.536 x 65.536 32 x 32 2048 x 2048 Auto 4 7.008 2,04 14

LU LoC Lines Of Code Code Analysis CC Cyclomatic Complexity NPath Npath Complexity LoC CC NPath User 238 35 79.872 Auto 320 39 331.776 Loop Analysis #Main #Total Depth User 2 6 3 Auto 2 6 3 Problem Size Execution Total Matrix Task SpeedUp #Blocks Block Size #Tasks Size Types @ 192 cores User 4 14.676 2,45 49.152 x 49.152 24 x 24 2048 x 2048 Auto 12 15.227 2,13 15

LU ▶ In-depth performance analysis ● Paraver trace with 4 workers (192 cores) UserParallel AutoParallel 16

QR LoC Lines Of Code Code Analysis CC Cyclomatic Complexity NPath Npath Complexity LoC CC NPath User 303 41 168 Auto 406 43 344 Loop Analysis #Main #Total Depth User 1 6 3 Auto 2 7 3 Problem Size Execution Total Matrix Task SpeedUp #Blocks Block Size #Tasks Size Types @ 192 cores User 4 19.984 2,37 32.768 x 32.768 16 x 16 2048 x 2048 Auto 20 26.304 2,10 17

Conclusions and Future Work

Conclusions and Future Work ▶ AutoParallel goes one step further in easing the development of distributed applicatjons ● It is a Python module to automatjcally parallelize affjne loop nests and execute them in distributed infrastructures ● The evaluatjon shows that the automatjcally generated codes for the Cholesky, LU, and QR applicatjons can achieve the same performance than the manually parallelized versions ▶ Next steps ● Loop taskifjcatjon: An automatjc way to create blocks from sequentjal applicatjons based on loop tjles. Requires: ─ Research on how to simplify the chunk accesses from the AutoParallel module ─ Extend PyCOMPSs to support collectjon objects (e.g., lists) ● Integratjon with difgerent tools similar to PLUTO to support a larger scop of loop nests (e.g., APOLLO) 19 19

Thank you cristianrcv/pycompss-autoparallel http://compss.bsc.es/ cristian.ramon-cortes@bsc.es

Automatjc task-based parallelizatjon of Python codes Cristin - PowerPoint PPT Presentation

Automatjc task-based parallelizatjon of Python codes Cristin Ramn-Corts Ramon Amela Jorge Ejarque Philippe Clauss Rosa M. Badia MS12: Task-based Programming for Scientific Computing: Runtime Support Outline Introductjon PLUTO

Python for Data Science Overview of Python Why Python Installing Python Installing Python Modules

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons

Building Codes Building Codes Building Codes Building Codes 1 1 Builder Responsibilities

ECEN 5682 Theory and Practice of Error Control Codes Cyclic Codes Peter Mathys University of

Formal Modeling in Cognitive Science Source Codes Lecture 30: Codes; Kraft Inequality; Source

Looping through Python data structures Justin Kiggins Product Manager DataCamp Python for

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

First Tool: Python! Introduction to python programming Gholamhossein Tavasoli @ ZNU First Tool:

CODES FOR ALL SEASONS Emina Soljanin, Bell Labs IN THE CLOUD? CODES Emina @ Bell Labs Codes at

G ENERALIZED R EED -S OLOMON CODES (GRS CODES ) A CHARACTERIZATION OF MDS CODES THAT HAVE AN ERROR

Lattices from Codes or Codes from Lattices Amin Sakzad Dept of Electrical and Computer Systems

Error-Correcting codes: Application of convolutional codes to Video Streaming Diego Napp

Information Theory Lecture 8 BCH codes BCH codes: R8.45 (R5.6) Decoding BCH (and

Error-correcting codes and Cryptography Henk van Tilborg Code-based Cryptography Workshop

Getting Started with Python The Python Interpreter A piece of software that executes

We already know Java. Why learn Python? Using Python to Implement Algorithms Python has far less

MIXING ZONES DOCK E T 5 8 -0 1 0 2 -1 4 0 1 Rhodamine dye study Spring Creek, Wayne Wurtsbaugh

An Automated Framework for Decomposing Memory Transac?ons to

Jayant Haritsa Database Systems Lab Indian Institute of Science Bangalore, India Query

Shapes and Crystal Flowers - Use repetition in programs Principal partners 1 Blocks Pallete

Joe Duff Co-founder and CEO Operation Migration History Aransas Wood Buffalo Population Down

Outline What is Register Allocation Webs Interference Graphs Graph Coloring

Data Discovery and Lineage: Integrating streaming data in the public cloud with on-prem, classic

F Y 2014 Pro po se d Budg e t Wo rkse ssio n Sa n Anto nio Airpo rt Syste m Pre se nte d b y

Automatjc task-based parallelizatjon of Python codes Cristin - PowerPoint PPT Presentation

Automatjc task-based parallelizatjon of Python codes Cristin Ramn-Corts Ramon Amela Jorge Ejarque Philippe Clauss Rosa M. Badia MS12: Task-based Programming for Scientific Computing: Runtime Support Outline Introductjon PLUTO

Python for Data Science Overview of Python Why Python Installing Python Installing Python Modules

Python Tidbits Python created by that guy ---&gt; Python is named after Monty Pythons

Building Codes Building Codes Building Codes Building Codes 1 1 Builder Responsibilities

ECEN 5682 Theory and Practice of Error Control Codes Cyclic Codes Peter Mathys University of

Formal Modeling in Cognitive Science Source Codes Lecture 30: Codes; Kraft Inequality; Source

Looping through Python data structures Justin Kiggins Product Manager DataCamp Python for

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

First Tool: Python! Introduction to python programming Gholamhossein Tavasoli @ ZNU First Tool:

CODES FOR ALL SEASONS Emina Soljanin, Bell Labs IN THE CLOUD? CODES Emina @ Bell Labs Codes at

G ENERALIZED R EED -S OLOMON CODES (GRS CODES ) A CHARACTERIZATION OF MDS CODES THAT HAVE AN ERROR

Lattices from Codes or Codes from Lattices Amin Sakzad Dept of Electrical and Computer Systems

Error-Correcting codes: Application of convolutional codes to Video Streaming Diego Napp

Information Theory Lecture 8 BCH codes BCH codes: R8.45 (R5.6) Decoding BCH (and

Error-correcting codes and Cryptography Henk van Tilborg Code-based Cryptography Workshop

Getting Started with Python The Python Interpreter A piece of software that executes

We already know Java. Why learn Python? Using Python to Implement Algorithms Python has far less

MIXING ZONES DOCK E T 5 8 -0 1 0 2 -1 4 0 1 Rhodamine dye study Spring Creek, Wayne Wurtsbaugh

An Automated Framework for Decomposing Memory Transac?ons to

Jayant Haritsa Database Systems Lab Indian Institute of Science Bangalore, India Query

Shapes and Crystal Flowers - Use repetition in programs Principal partners 1 Blocks Pallete

Joe Duff Co-founder and CEO Operation Migration History Aransas Wood Buffalo Population Down

Outline What is Register Allocation Webs Interference Graphs Graph Coloring

Data Discovery and Lineage: Integrating streaming data in the public cloud with on-prem, classic

F Y 2014 Pro po se d Budg e t Wo rkse ssio n Sa n Anto nio Airpo rt Syste m Pre se nte d b y

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons