CUstom Built HEterogeneous Multi- Core ArCHitectures (CUBEMACH): - - PowerPoint PPT Presentation

custom built heterogeneous multi core architectures
SMART_READER_LITE
LIVE PREVIEW

CUstom Built HEterogeneous Multi- Core ArCHitectures (CUBEMACH): - - PowerPoint PPT Presentation

CUstom Built HEterogeneous Multi- Core ArCHitectures (CUBEMACH): Breaking the Conventions Nagarajan Venkateswaran Director, Waran Research Foundation Karthikeyan Palavedu Saravanan - Nachiappan Chidambaram Nachiappan Research Trainees (2008 -


slide-1
SLIDE 1

CUstom Built HEterogeneous Multi- Core ArCHitectures (CUBEMACH):

Breaking the Conventions

Nagarajan Venkateswaran Director, Waran Research Foundation Karthikeyan Palavedu Saravanan - Nachiappan Chidambaram Nachiappan Research Trainees (2008 - 2010), Waran Research Foundation Aravind Vasudevan - Balaji Subramaniam - Ravindhiran Mukundarajan Former Research Trainees (2007 - 2009), Waran Research Foundation

1

Chennai, India

slide-2
SLIDE 2

Motivation : Heterogeneity Redefined

  • Cost Effective High Performance Custom Built

Heterogeneous Multi-Core Node Design for wider class applications

– Inter and Intra core heterogeneity

  • Breaking the Conventions

– Multiple User Multiple Application without Space- Time sharing in a Cluster : Cost sharing across users – Single User Multiple Application without Space-Timer Sharing (non-multiprogramming) : Cost sharing across applications

2

Chennai, India

slide-3
SLIDE 3

Overview

  • Custom Built Heterogeneous Multi-Core

Architectures (CUBEMACH)

  • Design Space

– Architectural Space – Optimization Space

– Customer Vendor Interaction

– Simulation Space

  • CUBEMACH Design and Simulation Tool

Framework

  • Conclusion

3

Chennai, India

slide-4
SLIDE 4

Custom Built Heterogeneous Multi-Core Architectures (CUBEMACH)

  • CUBEMACH promises

– Increased Resource Utilization – Multiple Application Flavored Architectures – Elimination of Space Time Sharing at the Quantum Level during Multiple Application Execution – Manufacturing and Operational Cost reduction

4

Chennai, India

slide-5
SLIDE 5

Overview

  • Custom Built Heterogeneous Multi-Core

Architectures (CUBEMACH)

  • Design Space

– Architectural Space

– Optimization Space

– Customer Vendor Interaction

– Simulation Space

  • CUBEMACH Design and Simulation Tool

Framework

  • Conclusion

5

Chennai, India

slide-6
SLIDE 6

CUBEMACH Design Paradigm

6

Chennai, India

slide-7
SLIDE 7

ONNET

CUBEMACH Architectural Space

SCOS PCOS

Compiler-On- Silicon

ALFU

SRAM DRAM

Memory ALISA

Architectural Design Space - CUBEMACH

7

Chennai, India

slide-8
SLIDE 8

Architectural Space

  • Why ALU Why Not ALFU??

– Hardwired units –Design : Homogeneously Structured –Reduced Instruction Generation & Fetches : Employ a Higher Level ISA –Reduced memory-functional unit interaction –Helps execute multiple applications without space & time sharing

8

Chennai, India

slide-9
SLIDE 9

Control Unit Algorithm Size Memory In Processor HLFU Characteristics ALFU Requirements Delay ALFU Types Class of Algorithms Type of MIP Cell Number

  • f MIP

Cells HLFU Control Centralize/ Decentralized Grain Size Architecture Class of Units Input Bits Scalar

ALFU Algorithm Level Functional Unit

9

Chennai, India

slide-10
SLIDE 10

ALU vs ALFU Instruction Generation Results

10

Chennai, India

slide-11
SLIDE 11

Sample Algorithm Level Functional Units

  • Matrix Centric Units
  • Matmul
  • Matadd
  • Chain Matadd
  • Scalar Units
  • Scalar Adder / Subtractor
  • Scalar Multiplier
  • Scalar Divider
  • Comparator
  • Sorter
  • Multiple Operand Adder
  • Min / Max Finder
  • Vector Units
  • Inner Product
  • Graph Theoretic Units
  • Graph Traversal Unit –

BFS, DFS

  • KL Graph Partitioning

Architectural Space Contd…

11

Chennai, India

slide-12
SLIDE 12

ALISA – Algorithm Level Instruction Set Architecture

  • Algorithm Level Instructions
  • Triggers ALFUS
  • ALISA Multiple VLIWs
  • ALISA for heterogeneous multi-cores

Architectural Space Contd…

12

Chennai, India

slide-13
SLIDE 13

Hierarchical Compilation Scheme

  • PCOS Partitions A Problem

Into Sub-Problems – Level 1

  • SCOS Partitions The Sub-

Problems Into ALFU Level Instruction – Level 2 PCOS SCOS

Application Sub - Application Instruction

Architectural Space Contd…

13

Chennai, India

slide-14
SLIDE 14

ALISA & Compiler On Silicon

SCOS

PCOS

  • No. of parallel

Units Rate of Output Generation Scheduler Processing rate

  • No. of Ports

BISA Length

  • No. of I/O

Ports Scheduler O/P Generation rate

Compiler-On- Silicon

Number of Instructions Per ALISA Types of Instructions Per ALISA Types of Instructions in ISA

ALISA

No of ALISA Fields Decoding/ Encoding Logic Field Length

14

Chennai, India

slide-15
SLIDE 15

ON-Node-Network Architecture

2D - Torus Sub-Local Router Local Router ALFU Population Global Router Core

Architectural Space Contd…

15

Chennai, India

slide-16
SLIDE 16

ON-Node-Network Architecture

H- Tree Topology

Architectural Space Contd…

Sub-Local Router Local Router Global Router

16

Chennai, India

slide-17
SLIDE 17

ONNET Conventional NOCs Type of Switch MIN Crossbar Number of Routers N* log 2 (N) N2 Hierarchy Yes No Switching Latency Log2(Number of Inputs) * Switch Delay Number of Inputs * Switch Delay

Comparison of Conventional NOCs with ONNET

17

Chennai, India

slide-18
SLIDE 18

ONNET

Routers Organization Packet Switching Input/Output Packetization Address Decoding Destination ID Buffer Size Output Data Size HLFU ID Path Latency Destination Router ID Type of MIN Input Traffic HLFU Count Logical Grouping Buffer/ Stack Size Route Location Data Rate I/O Port

  • No. of

Decoders Decoding Rate Length of Stack

  • No. of Buffers

Packet Size I/P Data Size Word Length

On Node Network Architecture

18

Chennai, India

slide-19
SLIDE 19

Overview

  • Motivation
  • Custom Built Heterogeneous Multi-Core

Architectures (CUBEMACH)

  • Design Space

– Architectural Space

– Optimization Space –Customer Vendor Interaction – Simulation Space

  • CUBEMACH Design and Simulation Tool

Framework

  • Conclusion

19

Chennai, India

slide-20
SLIDE 20

CUBEMACH Architectural Space

Multiple Applications Multiple Applications

Multiple Applications Input

CUBEMACH Simulation Space

Optimization Space

Core Formation Power Model Performanc e Model Desired Power to Performance Ratio CUBEMACH Optimization Space Simulated Annealing Selected Parameters Final CUBEMACH Calculated Power to Performance Ratio

Initial Candidate Architecture Parameters

Game Theory

1 2 3 4 5 5a 5b 9 7 6 8

20

Chennai, India

slide-21
SLIDE 21

Optimization Space

  • Generates Optimized CUBEMACH for input

specifications such as,

– Power – Performance – Cost – Initial Architecture

  • Power and Performance Model
  • Uses GT and SA for optimization of Power and

performance

  • Uses KL For Core Grouping

21

Chennai, India

slide-22
SLIDE 22

Sample CUBEMACH Architecture

22

slide-23
SLIDE 23

CUBEMACH Design Implementation : Supercomputer On Chip (SCOC) IP Cores

23

Chennai, India

slide-24
SLIDE 24

SCOC IP Cores

  • ALFUs designed as SCOC IP Cores
  • Soft IP Core
  • Coarse-grained Reusable Soft IP Cores
  • Scalable IP Cores

24

Chennai, India

slide-25
SLIDE 25

Customer Vendor Interaction

App 2 App 3 App 4 App 1 CUBEMACH Node Manufacturers/ System Vendors

Customers Application Requirements – Power & Performance

Initial Heterogeneous Multi Core Candidate Architecture

Final Architecture

Layout Fabrication of IP Cores Simultaneous Multiple Applications

Optimizer

Workload Generation Intermediate Format

CUBEMACH Design Space

Simulator

SCOC IP CORES

Intermediate CUBEMACH Optimized CUBEMACH`

Optimization Space Contd…

25

Chennai, India

slide-26
SLIDE 26

Overview

  • Motivation
  • Custom Built Heterogeneous Multi-Core

Architectures (CUBEMACH)

  • Design Space

– Architectural Space – Optimization Space –Customer Vendor Interaction – Simulation Space

  • CUBEMACH Design and Simulation Tool

Framework

  • Conclusion

26

Chennai, India

slide-27
SLIDE 27

CUBEMACH Simulator

  • pThread based Simulator
  • Evaluates candidate CUBEMACH Architecture
  • Feed results to CUBEMACH Optimizer
  • CUBEMACH Optimization Engine (COE) produces

Optimized Architecture

  • Simulation & Optimization : An iterative process
  • Consists of

ALFU Sub-Simulator COS Sub-Simulator ONNET Sub-Simulator Memory Sub-Simulator

27

Chennai, India

slide-28
SLIDE 28

CUBEMACH Simulator

28

slide-29
SLIDE 29

Integrated CUBEMACH Design Paradigm … What we have seen . . .

29

Chennai, India

slide-30
SLIDE 30

Core Formation

ONNET

Routers Organization Packet Switching Input/Output Packetization Address Decoding Destination ID Buffer Size Output Data Size HLFU ID Path Latency Destination Router ID Type of MIN Input Traffic HLFU Count Logical Grouping Buffer/ Stack Size Route Location Data Rate I/O Port

  • No. of

Decoders Decodin g Rate Length

  • f Stack
  • No. of Buffers

Packet Size I/P Data Size Word Length

CUBEMACH Architectural Space

Multiple Applications Multiple Applications Multiple Applications Input Power Model Performance Model Desired Power to Performance Ratio

CUBEMACH Optimization Space CUBEMACH Simulation Space

Game Theory Simulated Annealing Selected Parameters Simulator Simulation Results Intermediate Architectural Parameters Final CUBEMACH Calculated Power to Performance Ratio Initial Candidate Architecture Parameters SCOS PCOS

  • No. of

parallel Units Rate of Output Generation Scheduler Processing rate

  • No. of

Ports BISA Length

  • No. of

I/O Ports Scheduler O/P Generation rate

Compiler-On-Silicon

Control Unit Algorithm Size Memory In Processor HLFU Characteristics ALFU Requirements Delay ALFU Types Class of Algorithms Type of MIP Cell Number

  • f MIP

Cells HLFU Control Centralize/ Decentralized Grain Size Architecture Class of Units Input Bits Scalar

ALFU SRAM DRAM

Mapping and Replacement Heuristic Packet Size No of Blocks DRAM Size SRAM Size Cache Line Size Word Length

Memory

No Of Ports Number of Instructions Per ALISA Types of Instructions Per ALISA Types of Instructions in ISA

ALISA

No of ALISA Fields Decoding/ Encoding Logic Field Length

30

slide-31
SLIDE 31

Sample CUBEMACH Architecture : Simulation Results

Matrix Based Algorithms Graph Based Algorithms

31

Chennai, India

slide-32
SLIDE 32

Sample CUBEMACH Architecture : Simulation Results

Mixture of Algorithms Comparison of Performance delivered by Optimized Architectures for corresponding types of Algorithms

32

Chennai, India

slide-33
SLIDE 33

Overall Resource Utilization of : (i) Initial CUBEMACH Architecture : Mean = 59 % (ii) Optimized CUBEMACH Architecture : Mean = 74 %

Sample CUBEMACH Architecture : Simulation Results

33

Chennai, India

slide-34
SLIDE 34

In Initial Candidate CUBEMACH Architecture,

  • Matrix ALFUS – low usage
  • Scalar ALFUS – average usage
  • Graph ALFUS – high usage

In Optimized Candidate CUBEMACH Architecture,

  • Matrix ALFUS – high usage
  • Scalar ALFUS – high usage
  • Graph ALFUS – high usage

Sample CUBEMACH Architecture : Simulation Results

34

Chennai, India

slide-35
SLIDE 35

Conclusion

  • Custom Built Heterogeneous Multi-Core

Architectures (CUBEMACH) promises,

– Increased Resource Utilization – Multiple application flavored architectures – Elimination of Space Time Sharing at the Quantum Level during Multiple Application Execution (without multiprogramming) – Manufacturing and Running Cost reduction

35

Chennai, India

slide-36
SLIDE 36

Thank You Questions??

36

Chennai, India

slide-37
SLIDE 37

Customizable Compiler-On-Silicon

  • What Compiler-On-Silicon?
  • Why do we need Compiler-On-Silicon ?
  • Why go for Customizable Compiler-On-Silicon ?

Architectural Space Contd…

37

Chennai, India

slide-38
SLIDE 38

ONNET

Architecture uses -

  • Multistage Interconnect Network
  • Hardware Packetization Unit
  • ONNET Design Space

– H-Tree Structure within a Core – 2D Torus Across Cores – MIN Type

Architectural Space Contd…

ONNET

Routers Organizatio n Packet Switching Input/Outpu t Packetization Address Decoding Destination ID Buffer Size Output Data Size HLFU ID Path Latency Destination Router ID Type of MIN Input Traffic HLFU Count Logical Grouping Buffer/ Stack Size Route Location Data Rate I/O Port

  • No. of

Decoders Decodin g Rate Length

  • f Stack
  • No. of

Buffers Packet Size I/P Data Size Word Length

38

Chennai, India

slide-39
SLIDE 39

Architectural Design Space - CUBEMACH

  • ALFU – Algorithm Level Functional Units
  • BISA – Backbone Instruction Set Architecture
  • COS – Compiler On Silicon
  • ONNET – On Node Network
  • Novel Cache Mapping Scheme
  • SCOC IP Cores : Achieving cost effectiveness

( Super Computer On Chip - IP Cores)

39

Chennai, India

slide-40
SLIDE 40

Features -

  • Communication across heterogeneous multi-cores
  • Data requirements of diverse ALFUs
  • High bandwidth
  • Scalable
  • Hierarchical Network-On-Chip

On Node Network Architecture

40

Chennai, India

slide-41
SLIDE 41

SRAM DRAM

Mapping and Replacement Heuristic Packet Size No of Blocks DRAM Size SRAM Size Cache Line Size Word Length

Memory

No Of Ports

Memory

41

Chennai, India

slide-42
SLIDE 42

Advantages of SCOC IP Cores

  • Fully Customizable
  • Greatly reduces Design-Turnaround-Time
  • Physically Design Friendly

– Constraints of Area, Power and Performance

  • Constrained & Rigid Design Methodology

42

Chennai, India