Array Replication to Increase Parallelism in Applications Mapped to - PowerPoint PPT Presentation

Array Replication to Increase Parallelism in Applications Mapped to Configurable Architectures Heidi Ziegler, Priyadarshini Malusare, and Pedro Diniz University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, CA 90292, USA {ziegler,priya,pedro}@isi.edu * This work was partly funded by the National Science Foundation (NSF) under award USC USC number CCR-0209228 and NGS-0204040. Heidi Ziegler is also supported by Intel and the Boeing Company. INFORMATION INFORMATION 1 SCIENCES SCIENCES INSTITUTE INSTITUTE

Motivation • Emerging architectures have configurable memories – Number, size, interconnect • Opportunities – Large on-chip storage area for data – Large on-chip read and write bandwidth – Relatively low cost to replicate and copy data • How we make use of these features? Configurable memory S-RAM S-RAM CPU CPU D-RAM D-RAM Configurable logic S-RAM S-RAM CPU CPU D-RAM D-RAM 2

Basic Ideas • Expose concurrency between L1 loop nests – Replicate arrays to eliminate anti- L2 Mem and output-dependences – Add synchronization L3 – Add update logic • Eliminate memory contention – Replicate arrays Mem – Add synchronization L1 L2 L3 – Add update logic 3

Dependence Definitions Write Read Execution Order L1 A L2 A d a e n p t e i - n Time d W e n A c R e L3 A data 2-D access order row-wise dependence array 4

Example Kernel • Start with all data mapped to the same memory and sequential execution Memory read Loop 1 A[i-1][*] A[ i ][*] read Loop 2 write Anti Loop 3 5

Exploiting Basic Data Independence • L1 and L2 can be parallelized • {L1, L2} and L3 can not Memory fork read A[i-1][*] read Loop 1 Loop 2 A[ i ][*] join write Loop 3 6

Using Array Renaming • Create a local copy of array A in order to remove anti-dependence Memory fork read A[i-1][*] read Loop 1 Loop 2 Loop 3 A[ i ][*] write copy join A_1[i-1][*] Copy A_1[ i ][*] Memory 7 Contention

Using Array Renaming & Replication Memory Memory Memory fork read A_1[i-1][*] A_2[i-1][*] A_3[i-1][*] read Loop 1 Loop 2 Loop 3 A_1[ i ][*] A_2[ i ][*] A_3[ i ][*] write join Copy • No memory contention but more memory space • Care in updating copies across iterations of control loop 8

Mapping to a Configurable Architecture S-RAM S-RAM CPU CPU D-RAM D-RAM A_1 L1 S-RAM S-RAM CPU CPU D-RAM D-RAM L2 A_3 L3 A_2 • Many on-chip Memories so that Each Array May be accessed in Parallel • Replication Operation done by Writing to Memories in Tandem using the same Bus 9

Array Data Access Descriptor lb d 1 ub < < 1 1 A { ER , WR } = ...... n i lb dx ub < < x x Set describes basic data access information n i program point / loop nest A array name ER, WR exposed read or write array access lb, ub lower and upper bound of each dimension d1 … dx accessed array section, integer linear inequalities 10

Compiler Analysis: Data Dependence Solve for data dependences Write Read lb d 1 ub � � 1 1 WR B = L1 L 1 lb d 2 ub � � 2 2 B Dep. D lb d 1 ub a � � t 1 1 a ER B D = L2 e p L 2 e lb d 2 u 2 B n � � 2 1 d e n c e ( ) B B data _ dependence f WR i ER , � n n j 11

Compiler Analysis: Outline • Outline: – Identify Control Loop • CFG with Coarse-grain task information – Extract Data Dependence Information • Exposed Read and Write Information • Array Section for Affine Array Accesses – Extract Parallel Regions • Using Array Renaming to Eliminate Anti-dependences – Identify Array Copies for Reduced Contention • Currently Replicate Write Arrays (partial replication) • Replicate All Write and Read Arrays (full replication) • Status: – Compiler Analysis Implemented in SUIF – Code Generation and Translation to VHDL Still Manual 12

Analysis Example 13

Experimental Methodology • Goal – Evaluate Cost/Benefit of Array Renaming and Replication – Configurable logic device - field programmable gate array (FPGA) – Use of Many Memory Blocks for Array Storage • Synthetic Kernels – HIST: 3 loop nests; 3 arrays – BIC: 4 loop nests; 4 arrays (most array data) – LCD: 3 loop nests; 2 arrays • Methodology – Analyze using SUIF and Transform Benchmarks Manually – Loop level execution times and Memory Schedules from Monet TM – Simulate total execution times using loop level inputs – Manual Replication Transformation 14

Execution Time Results Partial Replication Full Replication Original Code (replication of written arrays) (replication of all arrays) Fully replicated code Kernel Execution Cycles (simulation) versions achieve speedups - computation - update of replicas between 1.4 and 2.1 - total execution - stall due to memory contention - overall reduction (percentage) 15

Storage Requirement Results Kernel Fully replicated code Space Requirements - size in KBytes versions require storage - increment increase by a factor of 2 16

Discussion • Overhead of Updating Copies can be Negligible – Provided Enough Bandwidth for Updates – Small Number of Replicas • Memory Contention – Even with a Small Number of Arrays can be Substantial – Scheduling could Mitigate this Issue somewhat… • Preliminary Results Reveal: – Removal of anti-dependences can enable substantial increases in execution speed the cost of modest increase in storage – Increase in Space can be non-negligible if Initial Footprint is Small 17

Related Work • Array Privatization – Eigenmann et al . LCPC 1991; Li ICS 1992; Tu et al . LCPC 1993 • Fine-grain Memory Parallelism – So et. al . CGO 2004 • Pipelining and Communication for FPGAs – Tseng PPoPP 1995; Ziegler et. al . DAC 2003 • This work: – Relaxes constraints on previous analyses • Loop Nests rather than Statements • Coarser-grain & Loop Carried Dependences – Combines the transformations to take advantage of configurable architecture characteristics • such as many on-chip memories • low cost array replication 18

Conclusion • This paper: – Describes a Simple Loop Nests Analysis for Task-Level Parallelism • Uses Renaming and Replication to Eliminate Dependences • Across loop nests with selected replication strategies • Array Section Analysis to identify replication regions for each Array – Results target configurable architectures with • Many on-chip memories • Programmable Chip Routing – Results • Need to be Expanded to Larger Kernels • Respectable Speedups with Modest Space Increase. 19

Thank You 20

Array Replication to Increase Parallelism in Applications Mapped to - PowerPoint PPT Presentation

Array Replication to Increase Parallelism in Applications Mapped to Configurable Architectures Heidi Ziegler, Priyadarshini Malusare, and Pedro Diniz University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

singly linked lists Sept. 18, 2017 1 Recall last lecture: Java array array array array of

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

New features in MySQL Replication Lars Thalmann, Development Manager, Replication & Backup

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

Review We can declare an array of any type, even other arrays A 2D array is an array of

Cache Performance 1 C and cache misses (1) int array[1024]; // 4KB array int even_sum = 0,

Directory Replication: from Gigabit LAN to HF Radio Steve Kille CEO October 2011 LDAPCon,

Replication for Multi-Core Servers Authors: Manos Kapritsos, Yang Wang, Vivien Quema, Allen

Repeating Hyperbolic Pattern Algorithms Special Cases Douglas Dunham Department of Computer

Pf Pfimbi : Accelerat ating Big Dat ata Jo Jobs Through Flow- -Co Controlled Da

Consistency and Replication Chi Zhang czhang@cs.fiu.edu Object Replication (1) Organization of

Evaluating Treatment Effects and Replicability - work in progress - Victor Gonzalez-Jimenez

Energy-Efficient VNF Replication in Virtualized Data Centers Masters Project Fall 2017 By

Large databases lots of servers on premises in the cloud GET THEM ALL! Flavio Gurgel pgDay

Sambuz

Useful Links

Newsletter

Mail Us

Array Replication to Increase Parallelism in Applications Mapped to - PowerPoint PPT Presentation

Array Replication to Increase Parallelism in Applications Mapped to Configurable Architectures Heidi Ziegler, Priyadarshini Malusare, and Pedro Diniz University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

singly linked lists Sept. 18, 2017 1 Recall last lecture: Java array array array array of

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

New features in MySQL Replication Lars Thalmann, Development Manager, Replication &amp; Backup

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

Review We can declare an array of any type, even other arrays A 2D array is an array of

Cache Performance 1 C and cache misses (1) int array[1024]; // 4KB array int even_sum = 0,

Directory Replication: from Gigabit LAN to HF Radio Steve Kille CEO October 2011 LDAPCon,

Replication for Multi-Core Servers Authors: Manos Kapritsos, Yang Wang, Vivien Quema, Allen

Repeating Hyperbolic Pattern Algorithms Special Cases Douglas Dunham Department of Computer

Pf Pfimbi : Accelerat ating Big Dat ata Jo Jobs Through Flow- -Co Controlled Da

Consistency and Replication Chi Zhang czhang@cs.fiu.edu Object Replication (1) Organization of

Evaluating Treatment Effects and Replicability - work in progress - Victor Gonzalez-Jimenez

Energy-Efficient VNF Replication in Virtualized Data Centers Masters Project Fall 2017 By

Large databases lots of servers on premises in the cloud GET THEM ALL! Flavio Gurgel pgDay

Sambuz

Useful Links

Newsletter

Mail Us

New features in MySQL Replication Lars Thalmann, Development Manager, Replication & Backup