Optimizing Remote Accesses for Offloaded Kernels Application to - PowerPoint PPT Presentation

Context and motivations (see ASAP’10 paper) Communicating processes and “double buffering” Kernel off-loading with polyhedral techniques Optimizing Remote Accesses for Offloaded Kernels Application to High-Level Synthesis for FPGA Christophe Alias, Alain Darte , Alexandru Plesco Compsys Team Laboratoire de l’Informatique du Parall´ elisme (LIP) ´ Ecole normale sup´ erieure de Lyon Workshop on Polyhedral Compilation Techniques (IMPACT’12) Jan. 23, 2012, Paris, France 1 / 25

Context and motivations (see ASAP’10 paper) HLS tools, interfaces, and communications Communicating processes and “double buffering” Optimizing DDR accesses Kernel off-loading with polyhedral techniques Outline Context and motivations (see ASAP’10 paper) 1 HLS tools, interfaces, and communications Optimizing DDR accesses Communicating processes and “double buffering” 2 Kernel off-loading with polyhedral techniques 3 2 / 25

Context and motivations (see ASAP’10 paper) HLS tools, interfaces, and communications Communicating processes and “double buffering” Optimizing DDR accesses Kernel off-loading with polyhedral techniques High-level synthesis (HLS) tools Many industrial and academic tools Spark, Gaut, Ugh, MMalpha, Catapult-C, Pico-Express, Impulse-C, etc. Quite good at optimizing computation kernel Optimizes finite state machine (FSM). Exploits instruction-level parallelism (ILP). Performs operator selection, resource sharing, scheduling, etc. But most designers prefer to ignore HLS tools and code in VHDL. 3 / 25

Context and motivations (see ASAP’10 paper) HLS tools, interfaces, and communications Communicating processes and “double buffering” Optimizing DDR accesses Kernel off-loading with polyhedral techniques High-level synthesis (HLS) tools Many industrial and academic tools Spark, Gaut, Ugh, MMalpha, Catapult-C, Pico-Express, Impulse-C, etc. Quite good at optimizing computation kernel Optimizes finite state machine (FSM). Exploits instruction-level parallelism (ILP). Performs operator selection, resource sharing, scheduling, etc. But most designers prefer to ignore HLS tools and code in VHDL. Still a huge problem for feeding the accelerators with data Lack of good interface support ☛ write (expert) VHDL glue. Lack of communication opt. ☛ redesign the algorithm. Lack of powerful code analyzers ☛ rename or find tricks. 3 / 25

Context and motivations (see ASAP’10 paper) HLS tools, interfaces, and communications Communicating processes and “double buffering” Optimizing DDR accesses Kernel off-loading with polyhedral techniques Our goal: use HLS tools as back-end compilers Focus on accelerators limited by bandwidth Use the adequate FPGA resources for computation throughput. Optimize bandwidth throughput. 4 / 25

Context and motivations (see ASAP’10 paper) HLS tools, interfaces, and communications Communicating processes and “double buffering” Optimizing DDR accesses Kernel off-loading with polyhedral techniques Our goal: use HLS tools as back-end compilers Focus on accelerators limited by bandwidth Use the adequate FPGA resources for computation throughput. Optimize bandwidth throughput. Apply source-to-source transformations Push the dirty work in the back-end compiler. Optimize transfers at C level. Compile any new functions with the same HLS tool. 4 / 25

Context and motivations (see ASAP’10 paper) HLS tools, interfaces, and communications Communicating processes and “double buffering” Optimizing DDR accesses Kernel off-loading with polyhedral techniques Our goal: use HLS tools as back-end compilers Focus on accelerators limited by bandwidth Use the adequate FPGA resources for computation throughput. Optimize bandwidth throughput. Apply source-to-source transformations Push the dirty work in the back-end compiler. Optimize transfers at C level. Compile any new functions with the same HLS tool. Use Altera C2H as a back-end compiler. Main features: Syntax-directed translation to hardware. Basic DDR-latency-aware software pipelining with internal FIFOs. Full interface within the complete system. A few compilation pragmas. 4 / 25

Context and motivations (see ASAP’10 paper) HLS tools, interfaces, and communications Communicating processes and “double buffering” Optimizing DDR accesses Kernel off-loading with polyhedral techniques Asymmetric DDR accesses: need burst communications Ex: DDR-400 128Mbx8, size 16MB, CAS 3, 200MHz . Successive reads to the same row every 10 ns , to different rows every 80 ns . ➽ bad spatial DDR locality can kill performances by a factor 8! void vector_sum (int* __restrict__ a, b, c, int n) { for (int i = 0; i < n; i++) c[i] = a[i] + b[i]; } PRECHARGE READ PRECHARGE READ PRECHARGE WRITE ACTIVATE ACTIVATE ACTIVATE /RAS /CAS /WE a(i) b(i) c(i) DQ load a(i) load b(i) store c(i) Non-optimized version: time gaps + data thrown away. 5 / 25

Context and motivations (see ASAP’10 paper) HLS tools, interfaces, and communications Communicating processes and “double buffering” Optimizing DDR accesses Kernel off-loading with polyhedral techniques Asymmetric DDR accesses: need burst communications Ex: DDR-400 128Mbx8, size 16MB, CAS 3, 200MHz . Successive reads to the same row every 10 ns , to different rows every 80 ns . ➽ bad spatial DDR locality can kill performances by a factor 8! void vector_sum (int* __restrict__ a, b, c, int n) { for (int i = 0; i < n; i++) c[i] = a[i] + b[i]; } block size PRECHARGE READ PRECHARGE WRITE PRECHARGE READ ACTIVATE ACTIVATE ACTIVATE /RAS /CAS /WE a(i) a(i+k) b(i) b(i+k) c(i) c(i+k) DQ load a(i) ... a(i+k) load b(i) ... b(i+k) store c(i) ... c(i+k) Optimized block version: reduces gaps, exploits burst. 5 / 25

Context and motivations (see ASAP’10 paper) HLS tools, interfaces, and communications Communicating processes and “double buffering” Optimizing DDR accesses Kernel off-loading with polyhedral techniques Experimental results: typical examples 7 6 5 Typical speed-up Speed-up 4 vs block size figure 3 (here vector sum). 2 1 0 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Block size Kernel Speed-up ALUT Dedicated Total Total block DSP block Max Frequency registers registers memory bits 9-bit elements (MHz > 100) SA 1 5105 3606 3738 66908 8 205.85 VS0 1 5333 4607 4739 68956 8 189.04 VS1 6.54 10345 10346 11478 269148 8 175.93 MM0 1 6452 4557 4709 68956 40 191.09 MM1 7.37 15255 15630 15762 335196 188 162.02 SA: system alone. VS0 & VS1: vector sum direct & optimized version. MM0 & MM1: matrix-matrix multiply direct & optimized. 6 / 25

Context and motivations (see ASAP’10 paper) Loop tiling and the polytope model Communicating processes and “double buffering” Overview of the compilation scheme Kernel off-loading with polyhedral techniques Communication coalescing: related work Outline Context and motivations (see ASAP’10 paper) 1 Communicating processes and “double buffering” 2 Loop tiling and the polytope model Overview of the compilation scheme Communication coalescing: related work Kernel off-loading with polyhedral techniques 3 7 / 25

Context and motivations (see ASAP’10 paper) Loop tiling and the polytope model Communicating processes and “double buffering” Overview of the compilation scheme Kernel off-loading with polyhedral techniques Communication coalescing: related work Polyhedral model in a nutshell Ex: product of polynomials S 1 : for (i=0; i<= 2*N; i++) j θ ( S 1 , i ) = (0 , i ) S1: c[i] = 0; N S 2 : θ ( S 2 , i , j ) = (1 , i , j ) for (i=0; i<=N; i++) for (j=0; j<=N; j++) S2: c[i+j] = c[i+j] + a[i]*b[j] 0 N = 3 i Affine (parameterized) loop bounds and accesses Iteration domain, iteration vector Instance-wise analysis, affine transformations PIP: lexico-min in a polytope, given as a Quast (tree, internal node = affine inequality of parameters, leaf = affine function). 8 / 25

Context and motivations (see ASAP’10 paper) Loop tiling and the polytope model Communicating processes and “double buffering” Overview of the compilation scheme Kernel off-loading with polyhedral techniques Communication coalescing: related work Polyhedral model: tiling n loops transformed into n Tiled product of polynomials tile loops + n intra-tile loops. θ ( i , j ) = ( i + j , i ) Expressed from permutable j loops: affine function θ , here θ : ( i , j ) �→ ( i + j , i ). i 9 / 25

Context and motivations (see ASAP’10 paper) Loop tiling and the polytope model Communicating processes and “double buffering” Overview of the compilation scheme Kernel off-loading with polyhedral techniques Communication coalescing: related work Polyhedral model: tiling n loops transformed into n Tiled product of polynomials tile loops + n intra-tile loops. θ ( i , j ) = ( i + j , i ) Expressed from permutable j loops: affine function θ , here θ : ( i , j ) �→ ( i + j , i ). Tile: atomic block operation. Increases granularity of computations. Enables communication coalescing (hoisting). i 9 / 25

Optimizing Remote Accesses for Offloaded Kernels Application to - PowerPoint PPT Presentation

Context and motivations (see ASAP10 paper) Communicating processes and double buffering Kernel off-loading with polyhedral techniques Optimizing Remote Accesses for Offloaded Kernels Application to High-Level Synthesis for FPGA

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Optimizing Indirect Memory References with milk Vladimir Kiriansky, Yunming Zhang, Saman

PARKS DEPARTMENT Municipal Parks - 35 Beach Accesses- 53 BEACH Accesses Playgrounds- 10

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

HyperLoop: NIC Offloaded Primitives to Accelerate Replicated Transactions in Multi-tenant Storage

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

DTCP + Remote Access Proposal for Discussion with 3S October 28, 2009 1 Remote Access (RA)

COLLARTS SOURCING REMOTE INTERNSHIPS WHAT IS A REMOTE INTERNSHIP? COLLARTS REMOTE INTERNSHIPS

Hands-On Getting ready... Run a task that accesses ESD Locally Locally with ROOT

On enumerating the kernels in a bipolar valued digraph Raymond Bisdorff University of Luxembourg

Kernel on Automata Cousins of String Kernels and Dynamic Systems Kernels? S.V.N. Vishy

Integrating Reliable Memory in Databases Wee Teck Ng, Peter M. Chen Computer Science and

Control Allocation What, why, and how? Ola Hrkegrd Aircraft maneuvering T he pilot controls

HW/SW Codesign Analysis of Control & Data Flow I ECE 522 Control and Data Edges C programs

Direct Inverse Control & Internal Model Control Modelling and Control of Dynamic Systems 17

Quick & Dirty (&Right) Ted Neward Neward & Associates

The Power of Civic Engagement for Student Learning and Success Dr. Don Unger Erica Schomer

6 th Bruges European Business Conference Drivers of Growth How to achieve European energy

LOGISTICS AND INTRODUCTION Mahdi Nazm Bojnordi Assistant Professor School of Computing

Optimizing Remote Accesses for Offloaded Kernels Application to - PowerPoint PPT Presentation

Context and motivations (see ASAP10 paper) Communicating processes and double buffering Kernel off-loading with polyhedral techniques Optimizing Remote Accesses for Offloaded Kernels Application to High-Level Synthesis for FPGA

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Optimizing Indirect Memory References with milk Vladimir Kiriansky, Yunming Zhang, Saman

PARKS DEPARTMENT Municipal Parks - 35 Beach Accesses- 53 BEACH Accesses Playgrounds- 10

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

HyperLoop: NIC Offloaded Primitives to Accelerate Replicated Transactions in Multi-tenant Storage

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

DTCP + Remote Access Proposal for Discussion with 3S October 28, 2009 1 Remote Access (RA)

COLLARTS SOURCING REMOTE INTERNSHIPS WHAT IS A REMOTE INTERNSHIP? COLLARTS REMOTE INTERNSHIPS

Hands-On Getting ready... Run a task that accesses ESD Locally Locally with ROOT

On enumerating the kernels in a bipolar valued digraph Raymond Bisdorff University of Luxembourg

Kernel on Automata Cousins of String Kernels and Dynamic Systems Kernels? S.V.N. Vishy

Integrating Reliable Memory in Databases Wee Teck Ng, Peter M. Chen Computer Science and

Control Allocation What, why, and how? Ola Hrkegrd Aircraft maneuvering T he pilot controls

HW/SW Codesign Analysis of Control &amp; Data Flow I ECE 522 Control and Data Edges C programs

Direct Inverse Control &amp; Internal Model Control Modelling and Control of Dynamic Systems 17

Quick &amp; Dirty (&amp;Right) Ted Neward Neward &amp; Associates

The Power of Civic Engagement for Student Learning and Success Dr. Don Unger Erica Schomer

6 th Bruges European Business Conference Drivers of Growth How to achieve European energy

LOGISTICS AND INTRODUCTION Mahdi Nazm Bojnordi Assistant Professor School of Computing

HW/SW Codesign Analysis of Control & Data Flow I ECE 522 Control and Data Edges C programs

Direct Inverse Control & Internal Model Control Modelling and Control of Dynamic Systems 17

Quick & Dirty (&Right) Ted Neward Neward & Associates