Network Interface Architecture and Prototyping for Chip and Cluster - - PowerPoint PPT Presentation

network interface architecture and prototyping for chip
SMART_READER_LITE
LIVE PREVIEW

Network Interface Architecture and Prototyping for Chip and Cluster - - PowerPoint PPT Presentation

University of Crete School of Sciences & Engineering Computer Science Department Master Thesis by Michael Papamichael Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors Supervisor Prof. Manolis G.H.


slide-1
SLIDE 1

University of Crete School of Sciences & Engineering Computer Science Department Master Thesis by Michael Papamichael

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Supervisor

  • Prof. Manolis G.H. Katevenis

Heraklion, Crete, July, 2007

slide-2
SLIDE 2

Presentation Outline

 NI Queue Manager

 Introduction  Key Concepts  Architecture & Implementation

 NI Design for CMPs

 NI Design Goals  NI Design Issues  Proposed NI Design

 Conclusions & Future Work

2

slide-3
SLIDE 3

Presentation Outline

 NI Queue Manager

 Introduction  Key Concepts  Architecture & Implementation

 NI Design for CMPs

 NI Design Goals  NI Design Issues  Proposed NI Design

 Conclusions & Future Work

3

slide-4
SLIDE 4

NI Queue Manager

 FPGA-based Prototyping Platform

 PCI-X RDMA-capable NIC in cluster environment  Buffered crossbar switch

 Goals

 Confirm buffered

crossbar behavior

 Interprocessor

communication research

4

Introduction

slide-5
SLIDE 5

Presentation Outline

 NI Queue Manager

 Introduction  Key Concepts  Architecture & Implementation

 NI Design for CMPs

 NI Design Goals  NI Design Issues  Proposed NI Design

 Conclusions & Future Work

5

slide-6
SLIDE 6

NI Queue Manager - Key Concepts

Head-Of-Line (HOL) Blocking

1 2 1 2 1 2

Switching Fabric Input Queues Outputs

1 2 1 2

HOL Blocking

 HOL Blocking reduces switch throughput

Idle!

6

slide-7
SLIDE 7

NI Queue Manager - Key Concepts

Virtual Output Queues (VOQs)

1 2 1 2 1 2

Switching Fabric Input Queues Outputs

1 2 1 2

 VOQs eliminate HOL Blocking

VOQs

7

slide-8
SLIDE 8

NI Queue Manager - Key Concepts

Traffic Segmentation Schemes

300 160 80

68 256

256 B 3 segments, S 40

256

580 bytes

Variable-size Multipacket seg.

300 160 80 40

256 256 256

768 bytes 3 segments, S = 256 B

Fixed-size Multipacket segments

300 160 40 80

64 64 64 64 64 64 64 64 64 64 64

704 bytes 11 segments, S = 64 B

Fixed-size Unipacket segments

300 160 80 40

256 160

256 B 5 segments, S

80

580 bytes

Variable-size Unipacket seg.

40 44  Traffic segmented to optimize switching  Variable-Size MultiPacket (VSMP) Segmentation well suited to

buffered crossbar

8

slide-9
SLIDE 9

Presentation Outline

 NI Queue Manager

 Introduction  Key Concepts  Architecture & Implementation

 NI Design for CMPs

 NI Design Goals  NI Design Issues  Proposed NI Design

 Conclusions & Future Work

9

slide-10
SLIDE 10

NI Queue Manager – Architecture & Implementation

 Virtual Output Queues (VOQs)  VOQ migration to external memory

 Hardware-managed linked lists

 VSMP Segmentation  Scheduling  Flow Control  3 major versions implemented

10

Overview

slide-11
SLIDE 11

NI Queue Manager – Architecture & Implementation

Architecture

Linked List Manager

Scheduler

Flow Control

Packet Processor

Host

(PCI-X)

Network

(RocketIO)

External Memory

(Off-chip VOQs) Packet Sorter

On-Chip VOQs

Memory Controller 11

slide-12
SLIDE 12

NI Queue Manager – Architecture & Implementation

Packet Sorter

Linked List Manager

Scheduler

Flow Control

Packet Processor

Host

(PCI-X)

Network

(RocketIO)

External Memory

(Off-chip VOQs) Packet Sorter

On-Chip VOQs

Memory Controller 12

slide-13
SLIDE 13

NI Queue Manager – Architecture & Implementation

 Sorts packets according to:

 destination  other criteria (e.g. priority)

 Notifies Scheduler about incoming traffic  Light-weight packet processing

 e.g. enforce maximum packet size

 2 versions implemented

 with packet segmentation  without packet segmentation

13

Packet Sorter

slide-14
SLIDE 14

NI Queue Manager – Architecture & Implementation

On-Chip VOQs

Linked List Manager

Scheduler

Flow Control

Packet Processor

Host

(PCI-X)

Network

(RocketIO)

External Memory

(Off-chip VOQs)

On-Chip VOQs

Memory Controller

Packet Sorter

14

slide-15
SLIDE 15

NI Queue Manager – Architecture & Implementation

 Accumulates traffic in VOQs  VOQs implemented as:

 Circular buffers in single statically

partitioned on-chip memory

 Xilinx FIFOs

 2 versions implemented

 VOQs in BRAM  VOQs in Xilinx FIFOs

15

On-Chip VOQs

slide-16
SLIDE 16

NI Queue Manager – Architecture & Implementation

Linked List Manager

Linked List Manager

Scheduler

Flow Control

Packet Processor

Host

(PCI-X)

Network

(RocketIO)

External Memory

(Off-chip VOQs)

Memory Controller

Packet Sorter

On-Chip VOQs

16

slide-17
SLIDE 17

Linked List Manager

 Performs Segment Transfers

 variable-size segments  fixed-size segments

 Manages Linked Lists

 Head, Tail pointers in on-chip memory  Next Block pointers in DRAM (along data)

 Optimization Techniques

 Free Block Preallocation  Free-List Bypass

 FSM-based Implementation

17

slide-18
SLIDE 18

2 3

Linked List Manager

  • 1. From On-Chip VOQs to Packet Processor
  • 2. From On-Chip to Off-Chip VOQs
  • 3. From Off-Chip VOQs to Packet Processor

Segment Transfers

On-Chip VOQs

VOQs Off-Chip

  • max. segment size = 1 block

variable-size segments fixed-size blocks

from Packet Sorter

to Packet Processor

1

18

slide-19
SLIDE 19

3 2

Linked List Manager

  • 1. From On-Chip VOQs to Packet Processor
  • 2. From On-Chip to Off-Chip VOQs
  • 3. From Off-Chip VOQs to Packet Processor

Segment Transfers

On-Chip VOQs

VOQs Off-Chip

  • max. segment size = 1 block

variable-size segments

1

fixed-size blocks

from Packet Sorter

to Packet Processor

19

slide-20
SLIDE 20

1 3 2

Linked List Manager

  • 1. From On-Chip VOQs to Packet Processor
  • 2. From On-Chip to Off-Chip VOQs
  • 3. From Off-Chip VOQs to Packet Processor

Segment Transfers

On-Chip VOQs

VOQs Off-Chip

from Packet Sorter

to Packet Processor

fixed-size blocks

  • max. segment size = 1 block

variable-size segments

20

slide-21
SLIDE 21

1 3 2

Linked List Manager

  • 1. From On-Chip VOQs to Packet Processor
  • 2. From On-Chip to Off-Chip VOQs
  • 3. From Off-Chip VOQs to Packet Processor

Segment Transfers

On-Chip VOQs

VOQs Off-Chip

  • max. segment size = 1 block

variable-size segments fixed-size blocks

from Packet Sorter

to Packet Processor

21

slide-22
SLIDE 22

Linked List Manager

 Large VOQs migrate to DRAM

 Traffic stored in linked-lists of fixed-size blocks  Dynamic allocation of external memory

 Block size needs to be:

 Large to benefit from DRAM burst length  Small to minimize size of On-Chip VOQs

 2 Basic Operations

 Enqueue  Dequeue

 Free blocks stored in Free-Block List

Linked List Management

22

slide-23
SLIDE 23

Linked List Manager

 Enqueue

 Get free block from Free-Block list  Write data in the new block  Update Next-Block pointer of the last block  Update VOQ Tail pointer

 Dequeue

 Read data from the first block  Read Next-Block from first block  Update VOQ Head pointer  Put free block in Free-Block list

Basic Linked List Operations

23

slide-24
SLIDE 24

Linked List Manager

Enqueue/Dequeue Example

1 2 3 4 5 … 15 16 17 18 19 20 …

Tail Head Tail Head

… … … …

VOQ Pointers Free List

Next Free Block Enqueue into VOQ 5 Enqueue into VOQ 5 Dequeue from VOQ 5

24

DRAM

1

5 15

slide-25
SLIDE 25

Linked List Manager

Finite State Machine (FSM)

DEQ1

Idle

DEQUEUE DEQ2 PUSH FREE BLOCK

PUSHFB2 PUSHFB2

POP FREE BLOCK

POPFB2 POPFB2

ENQ1 ENQUEUE ENQ2 SRAM2XBAR

SRAM2XBAR2 SRAM2XBAR1 25

slide-26
SLIDE 26

Linked List Manager

Finite State Machine (FSM)

DEQUEUE PUSH FREE BLOCK POP FREE BLOCK ENQUEUE On-Chip VOQs to Packet Processor IDLE

26

slide-27
SLIDE 27

NI Queue Manager

Scheduler

Scheduler

Flow Control

Packet Processor

Host

(PCI-X)

Network

(RocketIO)

External Memory

(Off-chip VOQs)

Memory Controller

Packet Sorter

On-Chip VOQs Linked List Manager

27

slide-28
SLIDE 28

Scheduler

 Keeps track of each VOQ

 On-chip occupancy  Off-chip occupancy

 Employs Flow Control (network & local)

 Number of sent data words  Number of credits

 Implements Scheduling

 Builds VOQ eligibility masks  Enforces scheduling policy

 Instructs Linked List Manager

28

slide-29
SLIDE 29

Scheduler

 Determining Eligibility

 One eligibility mask for each kind of transfer  Eager approach  Lazy approach

 Scheduling Policy

 Round-Robin  Weighted Round-Robin  Deficit Round-Robin  Strict Priority

 Starvation?

Scheduling Issues

29

slide-30
SLIDE 30

NI Queue Manager

Packet Processor

Packet Processor

Host

(PCI-X)

Network

(RocketIO)

External Memory

(Off-chip VOQs)

Memory Controller

Packet Sorter

On-Chip VOQs Linked List Manager

Scheduler

Flow Control

30

slide-31
SLIDE 31

Packet Processor

 Processes Network Traffic

 Receives variable-size segments  Creates autonomous network packets

 Performs 3 Basic Operations

 Insert header  Modify header  Delete header

 Implemented as 3-stage pipeline  Greatly depends on packet nature

 RDMA packets well suited

31

slide-32
SLIDE 32

Packet Processor

Example of packet processing

= Packet Header = Packet Body

Segment 1 Segment 2 Seg 3 Seg 4 Seg 5

Packet 1 Pck 2 Pck 3 Packet 4 Packet 5

Modify Insert Delete Modify Insert Insert Insert

: : : : :

= End of Packet

: Traffic passing through Packet Processor

32

slide-33
SLIDE 33

NI Queue Manager

 3 major versions

 Full  “No external memory”  “No VSMP segmentation”

 Variations of individual modules

 Packet Processor

 with/without segmentation

 On-Chip VOQs

 with BRAM/Xilinx FIFOs

 Linked List Manager

 with/without external memory support

Implementation

33

slide-34
SLIDE 34

NI Queue Manager

FPGA Hardware Cost Results (8 VOQs)

Module

LUTs Slices Flip Flops BRAMs Gate Count

Packet Sorter with segmentation 179 (1%) 135 (1%) 80 (1%) 0 (0%) 1909 Packet Sorter no segmentation 42 (1%) 25 (1%) 12 (1%) 0 (0%) 392 On-Chip VOQs BRAM 320 (1%) 236 (1%) 170 (1%) 31 (16%) 2035342 On-Chip VOQs Xilinx FIFOs 904 (2%) 1408 (7%) 1648 (4%) 32 (16%) 2119015 Scheduler 2240 (5%) 1256 (6%) 428 (1%) 1 (1%) 86226 Linked List Mgrwith ext mem 680 (1%) 365 (1%) 425 (1%) 0 (0%) 7069 Linked List Mgrno ext mem 33 (1%) 23 (1%) 18 (1%) 0 (0%) 426 Packet Processor 521 (1%) 511 (2%) 617 (1%) 0 (0%) 9983

Queue Mgr Version

LUTs Slices Flip Flops BRAMs Gate Count

Full 2962 (7%) 2106 (10%) 1571 (3%) 34 (17%) 2263151 No External Memory 2713 (6%) 1900 (9%) 1467 (3%) 34 (17%) 2260321 No VSMPS 3430 (8%) 2639 (13%) 2286 (5%) 37 (19%) 2471047

34

slide-35
SLIDE 35

NI Queue Manager

Network Performance Results*

Throughput measurements using unbalanced traffic patterns. Average Delay vs. Input Load under uniform traffic. Max load is 96%.

*The experiments were conducted by Vassilis Papaefstathiou, member of the CARV team

 Verified previous theoretical and simulation results about

the behavior and performance of buffered crossbar.

35

slide-36
SLIDE 36

Presentation Outline

 NI Queue Manager

 Introduction  Key Concepts  Architecture & Implementation

 NI Design for CMPs

 NI Design Goals  NI Design Issues  Proposed NI Design

 Conclusions & Future Work

36

slide-37
SLIDE 37

IPC: NI Design Issues

 High Performance

 Low Latency  High Bandwidth

 Scalability  Reliability  Protection & Security  Communication Overhead Minimization

NI Design Goals

37

slide-38
SLIDE 38

Presentation Outline

 NI Queue Manager

 Introduction  Key Concepts  Architecture & Implementation

 NI Design for CMPs

 NI Design Goals  NI Design Issues  Proposed NI Design

 Conclusions & Future Work

38

slide-39
SLIDE 39

NI Design Issues

NI Placement

  • 1. I/O Bus
  • 2. Memory Bus
  • 3. Cache
  • 4. CPU Registers

Processor Proximity

Higher Performance Proprietary Interfaces Resource Sharing Less Buffer Space Lower Performance Standard Interfaces More Buffer Space

39

slide-40
SLIDE 40

NI Design Issues

 Programmed IO (PIO)

 Uncached stores, loads or I/O instructions

 Low bandwidth

 Cached stores, loads

 data transferred in cache blocks  requires coherence

 Occupies Processor to copy data

 Using Direct Memory Access (DMA)

 Decouples Processor  Requires Virtual to Physical Address Translation

NI Data Transfer Mechanisms

40

slide-41
SLIDE 41

NI Design Issues

 Traditional approach

 System Call to Access NI  Very Costly (high latency)

 User-level NIs (ULNIs)

 Bypass the OS  Mapped to Virtual Memory  Uses OS paging mechanism

for protection

NI Virtualization & Protection

User-level Software Operating System NI Hardware User-level Software Operating System NI Hardware

41

slide-42
SLIDE 42

Presentation Outline

 NI Queue Manager

 Introduction  Key Concepts  Architecture & Implementation

 NI Design for CMPs

 NI Design Goals  NI Design Issues  Proposed NI Design

 Conclusions & Future Work

42

slide-43
SLIDE 43

Proposed NI Design

 Targeted to future Chip Multiprocessors

 thousands to millions of nodes

 Tightly coupled with the processor  Low Complexity/Cost

 Compared to processor and local memory

 Resources (e.g. memory)

 Shared with Processor  Dynamically allocated (only to active connections)

 Low Overhead transfer initiation

 Powerful Communication Primitives

 Bulky Data Transfers  Control & Synchronization Traffic

Design Goals / Desired Features

43

slide-44
SLIDE 44

Proposed NI Design

Target Hardware/System

Xilinx XUP board

 CPU: PowerPC  OS: Linux  On-chip BRAM

 Scratchpad  Cache?

 External DRAM  10 Gbps Network

 RocketIO

44

slide-45
SLIDE 45

Proposed NI Design

Target Hardware/System

PowerPC

@ 266 MHz

On-chip Memory

(low latency, high throughput, 8 dual-port BRAM banks)

Network

Network Interface simple and small compared to CPU and its local memory

NI

PLB

OCM ?

FPGA 10 Gbps 10 Gbps

@133 MHz

DRAM

@133MHz @166 MHz

Xilinx XUP

Shared among CPU and NI

45

slide-46
SLIDE 46

Proposed NI Design

 Remote DMA

 Bulky Data Transfers  Producer-Consumer  Facilitates Zero-Copy Protocols  Requires Virtual-to-Physical Translation

 Message Queues

 Low Latency, Minimal Overhead communication  Small Data Transfers  One-to-One, One-to-Many, Many-to-One  Powerful synchronization primitive

Communication Primitives

46

slide-47
SLIDE 47

Proposed NI Design

Message Queues

Queues

 One-to-One: e.g. Small Data Transfers  One-to-Many: e.g. Job Dispatching  Many-to-One: e.g. Synchronization

47

slide-48
SLIDE 48

Proposed NI Design

 Mechanism for

 Sending Messages  Initiating RDMA operations

 A Connection consists of:

 Queue Pair (Incoming & Outgoing)  Information needed for RDMA  Various Queue, Flow Control metadata

 Connection Types

 One-to-One (lower overhead, higher security)  Many-to-Many (better resource utilization)

Connections

48

slide-49
SLIDE 49

Proposed NI Design

 One-to-One

 Incoming & Outgoing Queue is One-to-One

 Many-to-Many

 Incoming Queue is Many-to-One Or/And

Outgoing Queue is One-to-Many

Connection Types

49

slide-50
SLIDE 50

Proposed NI Design

 One-to-One: Local Communication  Many-to-Many: Global Communication

Connection Types - Example

50

slide-51
SLIDE 51

Proposed NI Design

Connection Table (CT)

Connection Table (CT)

ConnectionID or CID CID

Scratchpad

CID CID CID CID CID …

Scratchpad

Node Memory

Connection Table Entry (CTE)

 Connections reside in Connection Table in

the form of Connection Table Entries (CTEs)

51

slide-52
SLIDE 52

Proposed NI Design

 Each CTE stores

 Destination Info (for one-to-one)  Protection Info (for many-to-many)  Incoming/Outgoing Queue Info  Incoming/Outgoing RDMA Info  Flow Control Info  Other Info

 e.g. Notification Type

Connection Table Entry (CTE)

52

slide-53
SLIDE 53

Proposed NI Design

Connection Table Entry for Xilinx XUP

B8 (32) B7 (32) B6 (32) B5 (32) B4 (32) B3 (32) B2 (32) B1 (32)

CID

Base (6) Start (8) End (8) Head (10)

System-level Outgoing Queue Info

Outgoing RDMA Info Base (6) Start (8) End (8) Tail (10)

System-level Incoming Queue Info

Incoming RDMA Info

Outgoing Tail (10) Incoming Head (10) User-level Outgoing Queue Info User-level Incoming Queue Info

System-level Connection Info System-level Connection Info Remote Node ID (24) Remote CID (16) PGID (16)

53

slide-54
SLIDE 54

Proposed NI Design

Software Interface

 Send Message or Initiate RDMA

 Read Head/Tail of Outgoing Queue  Post Descriptor  Write Tail of Outgoing Queue  (Poll Head of Outgoing Queue)

 Receive Message or RDMA Notification

 Poll Tail of Incoming Queue (or get Interrupt)  Read Message (or Descriptor)  Write Head of Incoming Queue

54

slide-55
SLIDE 55

Proposed NI Design

Software Interface - Descriptors

55

slide-56
SLIDE 56

Proposed NI Design

Protection – Intranode Protection

Process 13

Kernel NI Hardware

Process 42 Process 27

User-Level

 Isolation of malicious processes. Not allowed to:

 Read or write data of connections of other processes  Corrupt connections of other processes

56

slide-57
SLIDE 57

Proposed NI Design

Protection – Internode Protection

 Isolation of compromised node. Not allowed to:

 Compromise other nodes  Corrupt connections of other processes

Node A Node B Node D Node C Node E

57

slide-58
SLIDE 58

Proposed NI Design

Protection in the Proposed NI

 Creation of 3 Protection Zones

CID CID CID CID CID CID CID CID

Connection Table

Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 8

High protection e.g. banks written

  • nly by special NI

hardware or run-time system Moderate protection e.g. banks written only by system-level software (e.g. kernel driver) Low protection e.g. banks written by everyone including user-level software

58

slide-59
SLIDE 59

Proposed NI Design

Controlling Access to the CT

 Distinguish User-level from Kernel-level

 Use of Shadow Address Spaces  Double the required Address Space

Mapped to system- level process

Virtual Address Space Physical Address Space

CTE 0xFFFF1ABC 0xFFFF0ABC 0xDEADBEEF

Normal Physical Address Space Mapped to user- level process Shadow Physical Address Space

0x123456EF

Connection Table 59

slide-60
SLIDE 60

Proposed NI Design

Controlling Access to the CT

 Distinguish different User-level processes

 Fine-grain protection

Process 1 virtual page

Virtual Address Space

Connection Table Process 1 physical page

Physical Address Space

Process 2 virtual page Process 3 virtual page Process 2 physical page Process 3 physical page Process 4 physical page

32 CTEs

Process 4 virtual page

32 CTEs 32 CTEs 32 CTEs

Page

60

slide-61
SLIDE 61

Presentation Outline

Key Concepts NI Queue Manager

Architecture Implementation

IPC: NI Design Issues Proposed NI Design Conclusions & Future Work

61

slide-62
SLIDE 62

Conclusions & Future Work

 NI Queue Manager

 Feasibility and Effectiveness of VSMPS  Confirmed buffered crossbar results  Novel Packet Processing Mechanism

 Eliminates need for reassembly

 Proposed NI Design

 Lightweight, well suited to future CMPs  Powerful communication primitives

 Message Queues, Remote DMA

 Notion of Connections/Queues  Versatile Protection & Security Mechanisms

Conclusions

62

slide-63
SLIDE 63

Conclusions & Future Work

 NI Queue Manager

 Scalability

 Dynamic memory allocation for On-Chip VOQs

 Proposed NI Design

 Process migration  Flow control  Efficient cache coherence support

Future Work

63

slide-64
SLIDE 64

Thank You!

64