Refinements in Data Manipulation Method for Coarse Grained - PowerPoint PPT Presentation

Refinements in Data Manipulation Method for Coarse Grained Reconfigurable Architectures Takuya Kojima and Hideharu Amano Keio University, Japan 14th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC 2019)

Importance of Programmability and High Energy Efficiency n Forthcoming n IoT devices n Wearable computers n Edge computing n Challenges for these devices n Programmability n To satisfy various demands n High energy efficiency n To extends long battery life 2

CGRAs: Coarse-Grained Reconfigurable Architectures n CGRAs n Support word-level reconfiguration ( ↔ bit-level of FPGAs) n Have many PEs (Processing Element) in 2D grid n Change functionality for each ALU & interconnection between PEs dynamically or statically 3

Power-hungy Dynamic Reconfiguration n Dynamic Reconfiguration Others n Changes configuration Computation 30% 30% cycle-by-cycle n Provides more flexibility Clock Tree Reconfiguration n Causes large dynamic 15% 25% power consumption Details of power consumption for a dynamic reconfiguration CGRA[1] [1] Ozaki, Nobuaki, et al. "Cool mega-arrays: Ultralow-power reconfigurable accelerator 4 chips." IEEE Micro 31.6 (2011): 6-18.

SF-CGRAs: Straight-Forward CGRAs Permutation Network Permutation Network PE PE PE PE Pipeline Register Date Memory PE PE PE PE PE PE PE PE PE PE PE PE n Key features of straight-forward CGRAs n Limited data flow direction n Pipelined PE array n Less frequent reconfiguration n High energy efficiency n Piperench[2] n XPP[3] n EGRA[4] n RSPA[5] [2] H. Schmit, et al , CICC 2002 [3] M.Petrov, et al, FPL 2004 [4] G. Ansaloni, et al , TVLSI 2011 5 [5] Yoon, Jonghee W., et al . ASP-DAC, 2008.

VPCMA: Variable Pipelined Cool Mega Array [2] n PE array consists of PE PE PE PE ・・・ n 8 x 12 PEs PE PE PE PE n 7 pipeline registers Pipeline ・・・ n PE has Registers ・・ PE-Array ・・ n No Register file ・・ n No clock tree PE PE PE PE ・・・ n Pipeline register works in μ-controller 1. latch mode PE PE PE PE or ・・・ 2. bypass mode n μ-Controller Data Manipulator n Controls data transfer Data Memory data mem. ↔ PE array [2] N.Ando , et al . "Variable pipeline structure for Coarse Grained Reconfigurable Array CMA." 6 Field-Programmable Technology , 2016.

Computation on the PE array n Fetch registers are connected to input of the PE array n Gather registers are connected to output of the PE array n The micro-controller n Writes data to the fetch registers n Read result from the gather registers Fetch Registers Gather Registers 7

Computation on the PE array n Fetch registers are connected to input of the PE array n Gather registers are connected to output of the PE array n The micro-controller n Writes data to the fetch registers n Read result from the gather registers Fetch Registers Gather Registers 8

Variable Pipeline Structure n No registers in each pipeline 8th PE row stage4 stage 7th PE row → Pure combinational circuit 6th PE row stage3 n Clock tree only for activated 5th PE row pipeline registers stage2 4th PE row 3rd PE row n Variable pipeline structure 2nd PE row stage1 depending on application 1st PE row 9

Multi-cycle Execution on PE Array Cycle Delayed 4 cycles Delay Fetch stage1 stage2 stage3 stage4Gather Fetch stage1 stage2 stage3 stage4Gather Fetch stage1 stage2 stage3 stage4Gather Branch Fetch stage1 stage2 stage3 stage4Gather n Micro-controller n A custom tiny RISC processor controls the processing n ”Fetch” op kicks off the execution Fused into an instruction n “Gather” op writes back the results n “Delay” op specifies delay time of “Gather” execution n “Branch” op makes a loop 10

Multi-cycle Execution on PE Array Cycle Delayed 8 cycles Delay Fetch stage1 stage2 stage3 stage4 Gather NOP Fetch stage1 stage2 stage3 stage4 NOP To adjust the timing Fetch stage1 stage2 stage3 by inserting other instr. Branch n Micro-controller Fetch stage1 stage2 n A custom tiny RISC processor controls the processing n ”Fetch” op kicks off the execution Fused into an instruction n “Gather” op writes back the results n “Delay” op specifies delay time of “Gather” execution n “Branch” op makes a loop 11

Data Manipulator of VPCMA Data PE n Data manipulator Data Memory Manipulator Array Fetch reg. Shifted data Fetch Addr. BANK0 PE n Placed between BANK1 PE Next Dmem & PE array BANK2 Fetch Addr. PE n Transfers BANK3 PE Transfer any input data to BANK4 PE T able #0 BANK5 PE any outputs dst. src. mask 0 1 col0 BANK6 PE n Loads at most 1 1 col1 BANK7 PE 0 N/A col2 consecutive 12 data BANK8 PE 2 1 col3 from 12 mem banks 3 1 col4 BANK9 PE 0 N/A col5 BANK10 PE n Increments addr. ... BANK11 PE automatically for next fetch 1st Fetch 12

Data Manipulator of VPCMA n Data manipulator Data PE Data Memory Manipulator Array Shifted data Fetch reg. Fetch Addr. BANK0 PE n Placed between BANK1 PE Dmem & PE array Next BANK2 PE Fetch Addr. n Transfers BANK3 PE Transfer any input data to BANK4 PE T able #0 BANK5 any outputs PE dst. src. mask 0 1 col0 BANK6 PE n Loads at most 1 1 col1 BANK7 PE 0 N/A col2 consecutive 12 data BANK8 PE 2 1 col3 from 12 mem banks 3 1 col4 BANK9 PE 0 N/A col5 BANK10 PE n Increments addr. ... BANK11 PE automatically for next fetch 2nd Fetch 13

Ultra Low Power Consumption of CMA n No-Pipelined version of CMA[6] n Works with Lemon battery n Achieves 743 MOPS/mW (297MOPS/0.4mW) n VPCMA n Keeps the same energy efficiency n Achieves 4x higher peek performance n Problem n Less flexibility because of saving too much energy [6] M.Koichiro, et al . "A 297mops/0.4 mw ultra low power coarse-grained reconfigurable accelerator CMA- 14 SOTB-2." 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig )

Limitation of data handling in VPCMA A0 A1 Array a Too far Loop example B0 B1 Array b Memory allocation in bank memory n Data manipulator cannot access multiple data more than 12 step distance simultaneously → needs data rearrangement → often incurs extra copy of data 15

Limitation of data handling in VPCMA A0 B0 A1 B1 A16 B0 A17 B1 Loop example A32 B0 A33 B1 Copies of array b Memory allocation in bank memory n Data manipulator cannot access multiple data more than 12 distance simultaneously → needs data rearrangement → often incurs extra copy of data 16

Other limitations of VPCMA n Also, VPCMA 1. Suffers from a lack of constant registers for the PE array A PE row (12 PEs) share two const regs. n or borrows from other rows via interconnection → Degrades mappability of complex kernels 2. Depends on a host processor for overall control Micro-controller basically controls data transfer n & loop counter All of other controls (e.g. reconfiguration) are n carried out by the host processor even if trivial change is needed 17

Proposed architecture n A new architecture VPCMA2 n Relaxing aforementioned limitations 1. Improved bank access by new data manipulator 2. Refined connectivity of constant registers PE array has 16 constant registers (same as VPCMA) n All PE can use any 16 registers n 3. Introduced an extended data bus for micro- controller 18

New Data Manipulator Fetch addr. 0x0 Increment 4 Data PE fetch addr Manipulator Data Memory Array ... for each bank offset + Shifted data Fetch reg. 0 0 PE ... + 0 0 PE ... + 0 0 PE ... + 0 0 PE ... + 0 5 PE ... + array a 5 0 array b PE ... + 5 0 PE ... + 5 0 PE ... + 0 0 PE ... + 0 0 PE ... + 0 0 PE ... + 0 0 PE 1st Fetch n Offset values for each bank is introduced n Relaxed the limitation of consecutive data access 19

New Data Manipulator Fetch addr. 0x4 Increment 4 Data PE fetch addr Manipulator Data Memory Array ... for each bank offset + Fetch reg. Shifted data 1 0 PE ... + 1 0 PE ... + 1 0 PE ... + 1 0 PE ... + 0 0 PE ... + shifted array a 0 0 array b PE ... + 0 0 PE ... + 0 0 PE ... + 5 0 PE ... + 5 0 PE ... + 5 0 PE ... + 5 0 PE 2nd Fetch n Offset values for each bank is introduced n Relaxed the limitation of consecutive data access 20

Extended Data Bus n Micro-controller can handle any data in other modules External host processor 22 32 External Bus Address Bus (22bit) Data Bus (32bit) 32 22 25 22 25 22 16 22 32 22 Config. Controller Inst. DMAC 20x96 Mem Constant Data Config. Register Mem PE Array 16 Registers 20x96 Micro 25x96 Controller 25x12 32 22 32 22 22 32 32 22 General-purpose bus Address Bus (22bit) for micro-controller Data Bus (32bit) 21

Evaluation Setup n An implementation of VPCMA2 n Using Renesas SOTB 65-nm technology n LSTP (Low STanby Power) version n Synthesized by Synopsys Design Compiler 2017 n A real chip of VPCMA[7] PE Array n Fabricated same technology n LP (Low Power) version 3mm (75% slower than LSTP) TCI [7] T. Kojima, et al . “Real chip evaluation of a low power 6mm CGRA with optimized application mapping,” 9th International Symposium on Highly-Efficient Accelerators Chip photo of VPCMA[7] and Reconfigurable Technologies. ACM, 2018, p. 13. 22

Refinements in Data Manipulation Method for Coarse Grained - PowerPoint PPT Presentation

Refinements in Data Manipulation Method for Coarse Grained Reconfigurable Architectures Takuya Kojima and Hideharu Amano Keio University, Japan 14th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC 2019)

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Refinements on CP Refinements: Modeling

Data Manipulation in R Introduction to dplyr May 15, 2017 Data Manipulation in R May 15, 2017

Coarse-graining Markov state models with PCCA Coarse-graining Markov state models

New design method for C30 recycled concr ete using mixed source concrete coarse agg regates

Robust Regression with Coarse Data Marco Cattaneo and Andrea Wiencierz Department of Statistics,

Money Manipulation & the Effects on the International -Spencer Houston Community Definition

Coarse Woody Debris as Measurable Management Targets A.J. Kroll Weyerhaeuser COARSE WOODY

COARSE-TO-FINE, COST-SENSITIVE CLASSIFICATION OF E-MAIL Jay Pujara jay@cs.umd.edu Lise Getoor

Some categorical aspects of coarse spaces and balleans Nicol` o Zava joint work with Dikran

Application of the Lattice Boltzmann method with moving boundaries in a coarse-grained suspension

Manipulation in Political Stock Manipulation in Political Stock Markets Markets Koleman Strumpf

Recap: Strategic Manipulation We had seen two theorems that show that we cannot rule out strategic

Effective refinements of classical theorems in descriptive set theory Vassilis Gregoriades

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Data manipulation with Data manipulation with dplyr dplyr Programming for Statistical

Workshop 2.4: Data manipulation Murray Logan 10 Mar 2019 Section 1 Data manipulation

L I NK E D OPE N DAT A AND SPE CI AL COL L E CT I ONS Lind y Sm ith, Re fe re nc e

61A Lecture 10 Wednesday, September 25 Announcements Homework 3 due Tuesday 10/1 @ 11:59pm

CS 2316 Data Manipulation for Engineers HTML Christopher Simpkins chris.simpkins@gatech.edu

Data and Analysis Part I Structured Data Ian Stark January 2011 Part I: Structured Data

From AHAR to LSA: Understanding the FY18 Changes Office Hours, Session #1 Tuesday, October 23,

Week 1: 6 weeks, Sep 13 - Oct 18 Instructor: Tamara Munzner participation, 10%

Big Data Management & Analytics EXERCISE 8 TEXT PROCESSING, PCA 21st of December, 2015

CSC 411 Lecture 18: Matrix Factorizations Roger Grosse, Amir-massoud Farahmand, and Juan

Refinements in Data Manipulation Method for Coarse Grained - PowerPoint PPT Presentation

Refinements in Data Manipulation Method for Coarse Grained Reconfigurable Architectures Takuya Kojima and Hideharu Amano Keio University, Japan 14th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC 2019)

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Refinements on CP Refinements: Modeling

Data Manipulation in R Introduction to dplyr May 15, 2017 Data Manipulation in R May 15, 2017

Coarse-graining Markov state models with PCCA Coarse-graining Markov state models

New design method for C30 recycled concr ete using mixed source concrete coarse agg regates

Robust Regression with Coarse Data Marco Cattaneo and Andrea Wiencierz Department of Statistics,

Money Manipulation &amp; the Effects on the International -Spencer Houston Community Definition

Coarse Woody Debris as Measurable Management Targets A.J. Kroll Weyerhaeuser COARSE WOODY

COARSE-TO-FINE, COST-SENSITIVE CLASSIFICATION OF E-MAIL Jay Pujara jay@cs.umd.edu Lise Getoor

Some categorical aspects of coarse spaces and balleans Nicol` o Zava joint work with Dikran

Application of the Lattice Boltzmann method with moving boundaries in a coarse-grained suspension

Manipulation in Political Stock Manipulation in Political Stock Markets Markets Koleman Strumpf

Recap: Strategic Manipulation We had seen two theorems that show that we cannot rule out strategic

Effective refinements of classical theorems in descriptive set theory Vassilis Gregoriades

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Data manipulation with Data manipulation with dplyr dplyr Programming for Statistical

Workshop 2.4: Data manipulation Murray Logan 10 Mar 2019 Section 1 Data manipulation

L I NK E D OPE N DAT A AND SPE CI AL COL L E CT I ONS Lind y Sm ith, Re fe re nc e

61A Lecture 10 Wednesday, September 25 Announcements Homework 3 due Tuesday 10/1 @ 11:59pm

CS 2316 Data Manipulation for Engineers HTML Christopher Simpkins chris.simpkins@gatech.edu

Data and Analysis Part I Structured Data Ian Stark January 2011 Part I: Structured Data

From AHAR to LSA: Understanding the FY18 Changes Office Hours, Session #1 Tuesday, October 23,

Week 1: 6 weeks, Sep 13 - Oct 18 Instructor: Tamara Munzner participation, 10%

Big Data Management &amp; Analytics EXERCISE 8 TEXT PROCESSING, PCA 21st of December, 2015

CSC 411 Lecture 18: Matrix Factorizations Roger Grosse, Amir-massoud Farahmand, and Juan

Money Manipulation & the Effects on the International -Spencer Houston Community Definition

Big Data Management & Analytics EXERCISE 8 TEXT PROCESSING, PCA 21st of December, 2015