The SiLago Method: Next Generation VLSI Architectures and Design - - PowerPoint PPT Presentation

the silago method next generation vlsi architectures and
SMART_READER_LITE
LIVE PREVIEW

The SiLago Method: Next Generation VLSI Architectures and Design - - PowerPoint PPT Presentation

The SiLago Method: Next Generation VLSI Architectures and Design Automation Ahmed Hemani KTH Dept of Electronics and Embedded Systems, School of ICT, KTH . Acknowledgement: Nasim Farahini, Muhammad Asad, Li Shuo, Hassan Sohofi, Muhammad Ali


slide-1
SLIDE 1

Ahmed Hemani KTH – Dept of Electronics and Embedded Systems, School of ICT, KTH.

Acknowledgement: Nasim Farahini, Muhammad Asad, Li Shuo, Hassan Sohofi, Muhammad Ali Shami, Adeel Tajammul, Omer Malik, Anders Lansnser, Christer Svensson

The SiLago Method: Next Generation VLSI Architectures and Design Automation

1

slide-2
SLIDE 2

The Core Ideas behind the SiLago Method

Manufacturing Cost 5 MUSDs Engineering Cost 45 MUSDs < 5 MUSDs << 45 MUSDs

The SiLago Method

  • 1. Higher abstraction of

Physical Design Platform

  • 2. A structured grid based

physical design scheme

slide-3
SLIDE 3

3

The Large Engineering and Manufacturing Cost Software Centric Accelerator rich Platform Based Design Loss in Silicon and Computational Efficiencies Blocks Application Categories Stifles Innovation

slide-4
SLIDE 4

Generality comes at a huge cost of Silicon, Computational and Engineering efficiencies Custom solutions are Orders of magnitude more efficient The SiLago Method

4

Generality vs. Customisation

Accelerator Rich Software Centric Platform Based SOC Design

Hardware Centric Custom Design

slide-5
SLIDE 5

5

Energy Breakdown in a GPP

Data Supply 28 % Instruction Supply 42% Clock & Control 24% Arithmetic 6% William J. Dally, James Balfour, David Black‐Shaffer, James Chen, R. Curtis Harting, Vishal Parikh, Jongsoo Park, and David Sheffield, Stanford University, ”Efficient Embedded Computing”. IEEE Computer July 2008

slide-6
SLIDE 6

6

The Impact of Customization

CPU Core i7 GPU GTX255 FPGA LX760 10 101 102 SiLago ASIC

GFlops /w

FFT 2048 Matrix Matrix Multiplication

slide-7
SLIDE 7

A Brief History VLSI Design Automation

To Explain Why the Path of Customization has been abandoned

7

slide-8
SLIDE 8

Abstraction Level # of Solutions increases exponentially with abstraction gap

RTL/Logic Synthesis Gates Physical Physical Synthesis High‐level Synthesis RTL / ‐architecture Algoritims Application‐level Synthesis

The Design Space Manual: Stick Diagram, Mead Conway, Silicon Compiler

System level Synthesis System Application

The Mead Conway Era

slide-9
SLIDE 9

The Mead Conway Era Survived As long as the complexity was of the

  • rder of O(10K gates)

9

slide-10
SLIDE 10

One time

Abstraction Level # of Solutions increases exponentially with abstraction gap

RTL/Logic Synthesis Standard‐Cell Physical Physical Synthesis High‐level Synthesis RTL / ‐architecture Algoritims Application‐level Synthesis System level Synthesis System Application

The Standard Cell Era

The Design Space Manual Automated

Standard Cells

slide-11
SLIDE 11

What Standard Cells Did

Physical Design Discipline

Standard pitch and Row based layout Enabled physical design automation

Improves efficiency of

  • 1. Synthesis from RTL to GDSII
  • 2. Verification at RTL
  • 3. System Design

Abstraction

Boolean level abstraction Hides circuit and physical design details Enabled logic synthesis

slide-12
SLIDE 12

Standard Cells as building blocks are not scalable for 10‐100 million gate designs

Standard Cell

~10‐100 K gates ~10‐100 Million gates

slide-13
SLIDE 13

An Analogy

slide-14
SLIDE 14

So what happens when you try to build skyscapers with bricks

14

slide-15
SLIDE 15

Commercial HLS achieves local optimisation

ADC FDEC DEC RRC ↓ 2 EQ Filter CR Comp SLICER Carrier Adaptation EQ Adaptation Clock Adaptation System Control

Global constraints are manually partitioned to local partitions The synthesis tool does the local optimisation

Commercial HLS Global Area, Energy and latency constraints are specified for the application

15

Local Optimisation  min (L); L is the # of algorithms in the application

Application Algorithms

slide-16
SLIDE 16

Commercial HLS: No synthesis of inter‐ algorithm interfaces in an Application

ADC FDEC DEC RRC ↓ 2 EQ Filter CR Comp SLICER Carrier Adaptation EQ Adaptation Clock Adaptation System Control

16

The user has to manually refine the interface between the synthesized algorithms This manual refinement induces a functional verification step because the correct by construction contract assured by machine translation is now violated

Commercial HLS

slide-17
SLIDE 17

The 45 MUSD State of the Art SOC Design Flow

Functional Verification Constraints Verification:

Timing/Energy/Power/Area

Chip

Automatic: High‐level, RTL / Logic & Physical Synthesis

Logic: Algorithm + RTL + Boolean System: Multiple applications

Software Design

Architecture Definition in terms of pre‐designed IPs

Stitch Architecture: Buy and Assemble

System Architecting

  • 1. HW/SW Partitioning
  • 2. Interface Design
  • 3. Memory & Interconnect

Hierarchy

  • 4. I/O Design

17

slide-18
SLIDE 18

Solution:

The SiLago Method

SiLago = Silicon Large Grain Object

18

Inspired by Lego

slide-19
SLIDE 19

We shifted to pre‐fabricated wall segments

slide-20
SLIDE 20

The First Proposition – Raise Abstraction to Arch level

SiLago Block

(Register Files, DPUs, Switch boxes, Processors, SRAM banks etc.)

Standard Cell 4‐5 orders larger than Sandard Cell Characterised boolean operations Characterised Micro‐architectural operations SiLago Blocks Are NOT IPs – Soft or Hard

slide-21
SLIDE 21

21

Solutions to VLSI Design Complexity: 1. Abstraction 2. Physical Design Discipline / Regularity The VLSI community has largely forgotten the second component

slide-22
SLIDE 22

London Manhattan

slide-23
SLIDE 23

A grid based structured layout scheme

1 2 3 4 6 5 7 8

Traditional SOC

9

SiLago Fabric based SOC

Inner Modem Protocol Processing Streaming Storage Data Storag e

System Ctrl

Progra m Storag e

DRAM CTRL Flash CTRL Ethernet PLL/CGU PMC

Inner Modem Outer Mode m Outer Mode m Flexilators

Physical Design Regularity is the sword that can slay the demons of VLSI Design Complexity

slide-24
SLIDE 24

The SiLago Method

24

Ahmed Hemani, Nasim Farahini, Syed M.A.H. Jafri, Hassan Sohofi, Shuo Li and Kolin Paul, ”The SiLago Solution: Architecture and Design Methods for a Heterogeneous Dark Silicon Aware Coarse Grain Reconfigurable Fabric”, Chapter 3 in the book “The Dark Side of the Silicon” Springer, DOI 10.1007/978-3-319-31596-6

slide-25
SLIDE 25

The SiLago Concepts

A Virtual GRID

All SiLago design objects are alligned with grid lines And occupy multiples of contiguous grid cells Grid has not pre‐determined size, it is as big as the synthesis tool decides or the designer decides Protocol Processing Streaming Storage Data Storage

System Ctrl

Program Storage Inner Modem Outer Modem Flexilators

DRAM CTRL Flash CTRL Ethernet PLL/CGU PMC

REGIONS

A grid is divided into regions Each region is specialized in a type of functionality Some regions are infrastructural while others are functional Regions are separated by corridors to accomodate NOCs to connect the regions Each region has its own internal interconnect scheme.

SiLago Blocks

Each region is occupied by SiLago blocks that are region specific These SiLago blocks occupy one or more contiguous grid cells SiLago blocks are hardened and characterized with post layout data SiLago blocks absorb, global nets including power grids, clock grid and connect to the neighbouring SiLago blocks by abutment

25

slide-26
SLIDE 26

Inner Modem

The SiLago Concepts

Protocol Processing Streaming Storage Data Storage

System Ctrl

Program Storage

DRAM CTRL Flash CTRL Ethernet PLL/CGU PMC

Inner Modem Outer Modem

26

Outer Modem Flexilators

NOCs NOCs

This is a SiLago Design Instance It is automatically generated by the SiLago Syntheses tool chain Number, size and position of regions vary from one instance to another

slide-27
SLIDE 27

SiLago Interconnects are also hardened

The SiLago interconnects are not just logical interconnect, i.e., soft. They are physical and electrical objects in a templatized or parametric manner

27

slide-28
SLIDE 28

SiLago fabrics are composed by abutment

  • 1. SiLago blocks absorbs

a) Clock Tree & Power Ring b) Absorbs regional and global interconnect c) Pins on the periphery at right positions

  • 2. Fabric Composition by abutment

28

Block 1 Block 2

slide-29
SLIDE 29

SiLago Platform Cost Metrics are Space Invariant

Power Ring SiLago Block Power Ring Power Stripes SiLago Block

29

  • 1. 16 global wires in each cell varies by

about 70% from cell to cell

  • 2. This variation is a proof that even if it is

hierarchical design, the cost metrics would vary

Power Stripes

  • 1. The SiLago physical design discipline

ensures that all wires are of exact same length

slide-30
SLIDE 30

Clocking & STA

  • Clock

– Three levels of clocking: local, regional and global – Local

Each SiLago block is hardened to be timing clean and synthesized with a certain margin for skew and latency The Local Clock is synthesized using standard EDA flow

– Regional

Each Region is a synchronous region and the regional clock is manually synthesized to have sufficient buffers to maintain good edge and the delays balanced to keep the skew and latency within the margins of the local clock

– Global

Regions communicate with each other on latency insensitive basis using a previously developed GRLS scheme. For more details see http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6507330&tag=1

  • STA – Static Timing Analysis

– ILMs are created for each SiLago blocks – Once the regional clocks are synthesized and inserted back into the data base, a hierarchcial STA script is run to ensure that the entire design is timing clean.

30

slide-31
SLIDE 31

Characterization

  • 1. Each SiLago block is hardened
  • 2. Sufficiently exhaustive simulation is performed for

molecules of SiLago blocks at gate level with post layout data back annotated

– The SiLago blocks cannot be too large and complex – The same pipeline cannot be used for multiple

  • perations
  • 3. Concurrent operations within and neighbouring

SiLago blocks weakly couple and we model this coupling

  • 4. The NOCs are parameterically hardened

31

slide-32
SLIDE 32

The SiLago Proof of Concept How are the ‐architectural design decisions made ?

32

slide-33
SLIDE 33

Target Application Domain: Modems & Codecs

ADC FDEC DEC RRC ↓ 2 EQ Filter CR Comp SLICER Carrier Adaptatin EQ Adaptation Clock Adaptation Streaming Functons Adaptive Functions System Control System Control Functions

33

  • 1. Streaming DSP functions
  • 2. Nearest Neighbour Connectivity
  • 3. Rich in address generation functionality
  • 1. Adaptive Functions
  • 2. Spatial locality but not nearest neighbour
  • 3. Control intensive and non‐deterministic

Outer Modem Bit Level Operations absorbed in AGUs

slide-34
SLIDE 34

Proof of Concept SiLago Platform

Data Storage Program Storage DiMArch: Streaming Data Storage

System Control

Flexilators

Sensors

Memory Control Power Mngmt PLL + CGU

RF/ Analog RF/ Analog DRRA: Streaming DSP

34

Adaptive Functions

slide-35
SLIDE 35

DRRA – Computational Fabric

Dynamically Reconfigurable Resource Array

DPU Register File Sequencer

DPU & Register File Outputs 3 Columns to the Left and and to the Right And this 3 column window slides

This is only a fragment 22 nm, 100 mm2 10 000 DRRA Cells

slide-36
SLIDE 36

Distributed Memory Fabric – DiMARCH

Streaming Register Files ALU Sequencer Interconnect fabric Memory banks Instruction NOC Packet swtiched Data NOC Circuit Switched

slide-37
SLIDE 37

Private Execution Partitions

Memory Banks can be clustered to serve as one large bank Programmed to stream data Can be connected to clusters in computational fabric

Time Division to Space Division Multiplexing of Resources Fine Grain Power Management Composable and Predictable Systems

Parallelism in computation is matched with parallelism in access to scratchpad memory

slide-38
SLIDE 38

38

  • 1. M.A. Shami, A. Hemani, Address generation scheme for a coarse grain reconfigurable architecture,

in 2011 IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP) (2011), pp. 17–24

  • 2. N. Farahini, A. Hemani, K. Paul, Distributed runtime computation of constraints for multiple inner

loops, in 2013 Euromicro Conference on Digital System Design (DSD) (2013)

  • 3. M.A. Shami, A. Hemani, Classification of massively parallel computer architectures, in 2012IEEE

26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW) (2012), pp. 344–351

  • 4. N. Farahini, A. Hemani, Atomic stream computation unit based on micro-thread level parallelism, in

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP) (2015), pp. 25–29

  • 5. N. Farahini, A. Hemani, H. Sohofi, S.M.A.H. Jafri, M.A. Tajammul, K. Paul, Parallel distributed

scalable runtime address generation scheme for a coarse grain reconfigurable computation and storage fabric. Microprocess. Microsyst. 38, 788–802 (2014)

  • 6. M.A. Shami, A. Hemani, Morphable DPU: smart and efficient data path for signal processing

applications, in IEEE Workshop on Signal Processing Systems, 2009 (SiPS 2009) (2009), pp. 167– 172

  • 7. M.A. Shami, A. Hemani, Control scheme for a CGRA, in 2010 22nd International Symposium on

Computer Architecture and High Performance Computing (SBAC-PAD) (2010), pp. 17–24

  • 8. M.A. Shami, A. Hemani, An improved self-reconfigurable interconnection scheme for a coarse

grain reconfigurable architecture, in NORCHIP, 2010 (2010), pp. 1–6

  • 9. M. A. Shami, A. Hemani, Partially reconfigurable interconnection network for dynamically

reprogrammable resource array, in IEEE 8th International Conference on ASIC, 2009. ASICON’09 (2009), pp. 122–125 10.M.A. Tajammul, M.A. Shami, A. Hemani, S. Moorthi, NoC based distributed partitionable memory system for a coarse grain reconfigurable architecture, in 2011 24th International Conference on VLSI Design (VLSI Design) (2011), pp. 232–237

slide-39
SLIDE 39

+ * ‐

x0 x1 W1

* + +

ci xi xn‐i

+ * ‐

x0 x1 W1

+ * ‐

x0 x1 W1

‐ + ‐

Datapath can be clustered to create arbitrary DFG Register files/SRAMs can be also clustered to create larger and/or more parallel data storage Sequencers can be organized to create hierarchical FSMs

Datapath Register File Sequencer Switchbox

39

Clustering SiLago blocks is clustering Standard Cells or LUTs in FPGAs

Variations in Function, Capacity and Parallelism created by clustering micro‐architectural operations in SiLago blocks

slide-40
SLIDE 40

One Time Engineering Effort

SiLago Design Flow

  • 1. Select Optimal Solution from ML solutions
  • 2. Global Interconnect, buffers and control
  • 3. Floorplanning

c

Application Model Simulink L Algorithms Sampling Rate, Total Latency Number and types of SiLago blocks + Mapping

SiLago Platform

GDSII Macro Reports Compose GDSII Macro FSMD Library M FSMDs

40

slide-41
SLIDE 41

Abstraction Level # of Solutions increases exponentially with abstraction gap

RTL/Logic Synthesis Gates Physical Physical Synthesis High‐level Synthesis RTL / ‐architecture Algoritims Application‐level Synthesis

The Design Space Manual: Stick Diagram, Mead Conway, Silicon Compiler

System level Synthesis System Application

The Mead Conway Era

slide-42
SLIDE 42

One time

Abstraction Level # of Solutions increases exponentially with abstraction gap

RTL/Logic Synthesis Standard‐Cell Physical Physical Synthesis High‐level Synthesis RTL / ‐architecture Algoritims Application‐level Synthesis System level Synthesis System Application

The Standard Cell Era

The Design Space Manual Automated

Standard Cells

slide-43
SLIDE 43

Onetime

Abstraction Level # of Solutions increases exponentially with abstraction gap

RTL/Logic Synthesis Standard‐Cell Physical Physical Synthesis High‐level Synthesis RTL / ‐architecture Algoritims Application‐level Synthesis

The Design Space Automatic

System level Synthesis System Application

The SiLago Era

One time

slide-44
SLIDE 44

SiLago achieves Global Optimisation

ADC FDEC DEC RRC ↓ 2 EQ Filter CR Comp SLICER Carrier Adaptation EQ Adaptation Clock Adaptation System Control

L: Algorithms in Application M: Number of ways of implementing each algorithm Global Area, Energy and latency constraints are specified for the application

44

SiLago: Global Optimisation ‐ Min (ML) Commercial HLS : Local Optimization ‐ min(L)

slide-45
SLIDE 45

SiLago also automates Interface Synthesis

ADC FDEC DEC RRC ↓ 2 EQ Filter CR Comp SLICER Carrier Adaptation EQ Adaptation Clock Adaptation System Control

45

The interfaces are automatically synthesized depending on the chosen degree of parallelism of algorithms. Machine translation ensures correct by construction guarantee

SiLago Application Level Synthesis

slide-46
SLIDE 46

What the SiLago Method promises to achieve ?

Functional Verification Constraints Verification:

Timing/Energy/Power/Area

Chip

Automatic: High‐level, RTL / Logic & Physical Synthesis

Logic: Algorithm + RTL + Boolean System: Multiple applications

Software Design

Architecture Definition in terms of pre‐designed IPs

Stitch Architecture: Buy and Assemble

System Architecting

  • 1. HW/SW Partitioning
  • 2. Interface Design
  • 3. Memory & Interconnect

Hierarchy

  • 4. I/O Design

46

slide-47
SLIDE 47

Experimental Proof that the proposed Solution Works

47

slide-48
SLIDE 48

SiLago FSMD Library Development Efficiency

Energy Estimation Error Synthesis Runtime (Seconds)

100‐1000X Better

4e4 8e4 12e4 20e4 16e4 200 400 600 1000 800 Physical Synthesis Logic Synthesis High‐level Synthesis

SiLago Standard Cell based Synthesis

100% 200% 300%

SiLago Standard Cell based Synthesis

1.69 %

48

slide-49
SLIDE 49

Area Overhead

SiLago Standard Cell based Synthesis

And what do we pay for it ?

0.2 0.4 0.6 1.0 0.8 1.2

Energy Overhead

SiLago Standard Cell based Synthesis

0.2 0.4 0.6 1.0 0.8 1.2

49

slide-50
SLIDE 50

50

Normalized Energy and Area overhead of the Systems generated by the SiLago Design Flow

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

Area Energy

slide-51
SLIDE 51

SiLago provides significant improvement in Predictability

51

Energy Estimation Error (%)

100 101 102 103

369 % 223 % 282 % 8.3 % 5.9 % 8.1 % 5.5 % 185 %

  • Std. cell based flow

SiLago flow 46 % 73 % 6.2 % 3.9 %

slide-52
SLIDE 52

Design Space Exploration in SiLago Application‐level Synthesis

10 40 20 30 50 60 70 80 JPEG Encoder WLAN Tx LTE Uplink Number of Solutions evaluated by SiLago SLS 90

Seconds

25 100 50 75 125 150 175 200 Time requred for DSE 225

52

slide-53
SLIDE 53

0.5 1 1.5 2 2.5 3

X 107

3.5 3 4 4.5

X 106

50 100 150

Sample Interval Number of FSMDs

300 200 100

SiLego Design Space Exploration

53

slide-54
SLIDE 54

Application of SiLago to Neuromorphic Computing

54

slide-55
SLIDE 55

The Evolution of Embedded Systems

Interaction between machine and environment

Static and Finite Dynamic and Infinite

slide-56
SLIDE 56

Neuromorphic Machines are the answer

Reference: http://www.21stcentech.com/heard‐synapse/

slide-57
SLIDE 57

Implementing Brain in Electronics is non‐trivial

20 Watts

Riken ‐ World’s most efficient supercomputer 7 GFlops/watt. BCPNN  140 kWs Abstract model of Cortex BCPNN  1 PetaFlops

Realistically 1 Mega Watts

slide-58
SLIDE 58

What can eBrain achieve ?

2 Kilo Watts 20 Watts ~30 MilliWatts 2 Watts

eBrain@KTH

~1 Mega Watts ~1000 Watts

The most efficient Supercomputer

58

slide-59
SLIDE 59

Functional Requirements

BCPNN Requirements

  • 1. Realtime simulation
  • 2. 2 Million HCUs
  • 3. 1 Petaflops/sec – BCPNN Computation
  • 4. 40 TBs – HCU State Storage
  • 5. 130 TBs / s ‐ Bandwidth
  • 6. 20 billion spikes / s

Infrastructural Requirements

59

slide-60
SLIDE 60

100 MCUs 10 000 Connections HCU State Memory (20 MB) MCU State Vector MCU Row

HCU =  MCUs

The BCPNN Computation Model

Input Spike Computation 10 000 Spikes/s 100 × 100 Spikes/s Support Computation 100 / s Output Spike Computation

Delay Buffer

60

slide-61
SLIDE 61

BCU 1 BCU 2

System Controller to boot, initialize and save/restore the HCU State (a) eBrain: Multi‐chip fabric of BCUs connected by inter‐BCU spike propagation network Inter‐BCU Spike Propagation Network (SPN) L: Number of BCU Chips M: Number of H‐Tiles in each BCU N: Number of HCUs in each H‐Tile

L × M × N = 2 million HCUs

(b) BCU: The Brain Computation Unit is a regular fabric of 1000s of H‐Tiles Intra‐BCU Spike Propagation Interconnect Clock, Reset, Boot/Configuration, Power Management H‐Tile

The eBrain System Concept

61

slide-62
SLIDE 62

BCU Logic Chip Organization BCU Controller Network Interface Switch for the Inter‐BCU Spike Propagation Network iSDIN – Incoming Spike Distribution Interconnect

  • SDIN – Outgoing Spike Distribution Interconnect

Outgoing Spike Dispatcher Incoming Spike Dispatcher

BCU Logic Chip ‐ Organisation

H‐Tile 1 H‐Tile M H‐Tile 2 HMWI: H‐Tiles to HCU‐State Memory Write Interconnect HMRI: HCU‐State Memory to H‐Tiles Read Interconnect

. . . . . .

62

slide-63
SLIDE 63

H‐Tile Organisation

Incoming Spikes Queue & Controller iSDIN incoming Spike Distribution Interconnect Input Computation Controller Output Computation Controller Outgoing Spikes Queue & Controller

  • SDIN
  • utgoing Spike Distribution Interconnect

Delay Buffers & Controller for fanout spikes Scratchpad Memories Input Computation Input Computation FSM Input Computation Unit R1 SP FPUs HCU State Storage Memory Interface

ms Timer

Scratchpad Memories Output Computation Output Computation FSM Output Computation Unit R2 SP FPUs

1 Petaflops Infrastructural Operations

63

slide-64
SLIDE 64

The SiLago Method

A Structured Physical Design Scheme to enable System‐level synthesis

64

H‐ Tile H‐ Tile H‐ Tile H‐ Tile H‐ Tile H‐ Tile H‐ Tile H‐ Tile H‐ Tile

TSVs + Controller FPUs SRAMs H‐Tile Controller NOC Interface Ques + Controller NOC Corridor NOC Corridor

slide-65
SLIDE 65

The SiLago Method

A Structured Physical Design Scheme to enable System‐level synthesis

BCU Ctrl BCU SRAM PMC

PLL CGU

NOC NOC NOC NOC BCU NI+SW

65

slide-66
SLIDE 66

The Basis for dimensioning

Technology

22 nm node 3D integrated custom DRAM 16 X 82 mm2 die integrated on an interposer

Mouse

31 250 HCUs 71 MCUs and 1225 connections

Results

Post layout data for Logic 40 nm results conservatively scaled to 22 nm node Qualified circuit level models of 3D DRAM from TU Kaiserslautern

66

slide-67
SLIDE 67

Mouse eBrain Package Level Organisation

67

Interposer

Computation + Infrastructure

Interposer

Computation + Infrastructure

Interposer

Computation + Infrastructure

Interposer

Computation + Infrastructure

Interposer based package level inegration 16 X 82 mm2 chip with 8 layers of 3D DRAM and 32 channels per chip 32 H Tiles of 2.52 mm2 16 HCUs / H Tiles

slide-68
SLIDE 68

68

Organisation and dimensions of H‐Tile

Layer 0 Layer 7

Bank 0: 64 Mb Bank 1: 64 Mb

TSV Area

TSV Area

RIB

Column Column

RIB

1200m 584 m 200 m 500 m 4 HCUs per bank, 2*8 Banks per layer  64 HCUs per H‐Tile

slide-69
SLIDE 69

Energy Consumption

Computation 4.032 Joules Infrastructure 1.814 Joules DRAM 6.912 Joules

+

9.878 Joules Sparse activity, temporal locality, low resolution ~2 Joules

slide-70
SLIDE 70

The SiLago Method also has the potential to lower the Manufacturing cost

70

slide-71
SLIDE 71

Inner Modem

SiLago can reduce the mask development cost

Protocol Processing Streaming Storage Data Storage

System Ctrl

Program Storage

DRAM CTRL Flash CTRL Ethernet PLL/CGU PMC

Inner Modem Outer Modem

71

Outer Modem Flexilators

All SiLago designs are composed of a finite number of SiLago block Types All SiLago blocks can only have a finite types of neighbors Each SiLago blocks’s mask depending on the neighbor types can be saved as a component mask The entire design mask can be composed from such component masks

slide-72
SLIDE 72

72

Future & Ongoing Work

SiLago Regions are being expanded to cover the 13 dwarfs of the Berkeley report on parallel computing Extending Application Level Synthesis to System Level Synthesis Ability to deal with non‐determinism Using SiLago Method to design

  • 1. Complex Radio Systems – project with Catena
  • 2. Custom Supercomputer for brain simulation and

bioinformatics

  • 3. Resilient autonomous systems based on neural networks

Extending SiLago to 3D SiLago to achieve end‐to‐end parallelisms

slide-73
SLIDE 73

Thanks for your attention ! Questions ?