Introduction Vast transistor budgets, but .... Poor interconnect - PowerPoint PPT Presentation

Introduction • Vast transistor budgets, but .... • Poor interconnect scaling – Pressure to decentralise designs 7 • On-Chip Interconnection Networks • Need to manage complexity and power • Need for flexible/fault tolerant designs • Parallel architectures Chip Multiprocessors (ACS MPhil) – Keep core complexity constant or simplify Robert Mullins – The result is need to interconnect lots of cores, memories and other IP cores. Introduction Introduction • On-chip communication requirements: • On-chip communication requirements: – High-performance – Simplicity (ease of design and verification) • Latency and Bandwidth • Structured, modular and regular – Flexibility • Optimize channel and router once • Move away from fixed application- – Efficiency specific wiring • Ability to share global wiring resources – Scalability between different flows • Number of modules is rapidly increasing – Fault tolerance (in the long term) • The existence of multiple communication paths between module pairs – Support for different traffic types and QoS

Introduction Introduction • Don't we already know how to design • The design of the on-chip network is not an interconnection networks? isolated design decision (or afterthought) – Many existing network topologies, router designs – e.g. consider impact on cache coherency protocol and theory has already been developed for high- – What is the correct balance of resources (wires end supercomputers and telecom switches and transistors, silicon area, power etc.) between the on-chip network and computational – Yes, and we'll cover some of this material, but the resources? trade-offs on-chip lead to very different designs. – Where does the on-chip network stop and the design of a module or core start? • “ integrated microarchitectural networks ” – Does network simply blindly allow modules to communicate or does it have additional functionality? Chip Multiprocessors (ACS MPhil) 6 On-chip vs. Off-chip On-chip interconnect • Compare availability of pins and wiring tracks on-chip • Typical interconnect at 45nm node: to cost of pins/connectors and cables off-chip – 10-14 metal layers • Compare communication latencies on- and off-chip – Local interconnect (M1) – What is the impact on router and network design? • 65nm metal width, 65nm spacing • Applications and workloads • 7700 metal tracks/mm • Amount of memory available on-chip – Global (e.g. M10) – What is the impact on router design/flow control? • 400nm metal width, • Power budgets on- and off-chip 400nm spacing • 1250 metal tracks/mm • Need to map network to planar chip (or perhaps more • Remember global interconnects recently a 3D stack of dies) scale poorly when compared to transistors 9Cu+1Al process (Fujitsu 2007) Chip Multiprocessors (ACS MPhil) 7 Chip Multiprocessors (ACS MPhil) 8

Bus-based interconnects Bus-based interconnects • Bus-based • Real bus implementations are typically switch interconnects based – Central arbiter – Multiplexers and unidirectional interconnects with provides access repeaters to bus – Tri-states are rarely used now – Logically the bus is – Interconnect itself may be pipelined simply viewed as a • A bus-based CMP usually exploits multiple set of wires shared unidirectional buses by all processors – e.g. address bus, response bus and data bus Chip Multiprocessors (ACS MPhil) 9 Chip Multiprocessors (ACS MPhil) 10 Bus-based interconnects for multicore? Bus-based interconnects for multicore? • Metal/wiring is cheap on- • Optimising bus-based solutions: chip! Repeated Bus Global Interconnect – Arbitrate for next cycle on current clock cycle • Avoid complexity of – Use wide, low-swing interconnects packet-switched networks O R R O R R – Limit broadcast to subset of processors? • Keep cache-coherency • Segment bus and filter redundant broadcasts to some simple segments by maintaining some knowledge of cache • Performance issues R R R R R R contents. So called, “Filtered Segmented Buses” – Centralised arbitration – Employ multiple buses – Low clock frequency – Move from electrical to on-chip optical solutions? (pipeline?) R R R R R R – Power? – Scalability? Shekhar Borkar (OCIN'06) Chip Multiprocessors (ACS MPhil) 11 Chip Multiprocessors (ACS MPhil) 12

Filtered Segmented Bus Bus-based interconnects for multicore? • Filter broadcasts to • Exploiting multiple buses (or rings): segments with Bloom filter – Multiple address-interleaved buses • Energy savings possible vs. • e.g. Sun Wildfire/Starfire mesh and flattened butterfly – Use different buses for different message types networks (for 16, 32 and 64- cores) because routers can – Subspace snooping [Huh/Burger06] be removed • Associate (dynamic) address ranges with each bus. • For large numbers of cores Each subspace are regions of data that are shared by a multiple (address- stable subset of the processors. interleaved) buses are • This technique tackles snoop bandwidth limitations as required to avoid a all processors are not required to snoop all buses significant performance penalty due to contention – Exploit buses at the lowest level of a hierarchical network (e.g. mesh interconnecting tiles, where each tile is a group of cores connected by a bus) “Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks”, Udipi et al, HPCA 2010 Chip Multiprocessors (ACS MPhil) 13 Chip Multiprocessors (ACS MPhil) 14 Sun Starfire (UE10000) Ring Networks • Up to 64-way SMP using bus-based snooping protocol 4 processors + µ P µ P µ P µ P µ P µ P µ P µ P memory module per system board k -node ring (or k -ary 1-cube) $ $ $ $ $ $ $ $ Uses 4 interleaved address buses to scale snooping • Exploit short point-to-point interconnects protocol Board Interconnect Board Interconnect • Can support many concurrent data transfers • Can keep coherence protocol simple and avoid need 16x16 Data Crossbar for directory-based schemes Separate data – We may still broadcast transactions transfer over Memory Memory Module Module high bandwidth • Modest area requirements crossbar Slide from Krste Asanovic (Berkeley) Chip Multiprocessors (ACS MPhil) 15 Chip Multiprocessors (ACS MPhil) 16

Ring Networks Ring Networks: Examples • Control • IBM – May be distributed – Power4, Power5 • Need to be a little careful to avoid possibility of deadlock • IBM/Sony/Toshiba (more later!) – Cell BE (PS3, HDTV, Cell blades, ...) – Or a centralised arbiter/scheduler may be used • Intel • e.g. IBM Cell BE and Larrabee both appear to use a centralised scheduler – Larabee (graphics), 8-core Xeon processor • Try and schedule as many concurrent (non-overlapping) • Kendall Square Research (1990's) transfers on each available ring as possible – Massively parallel supercomputer design • Trivial routers at each node – Ring of rings (hierarchical or multi-ring) topology – Simple routers are attractive as they don't • Cluster = 32 nodes connected in a ring introduce significant latency, power and area • Up to 34 clusters connected by higher level ring overheads Chip Multiprocessors (ACS MPhil) 17 Chip Multiprocessors (ACS MPhil) 18 Ring Networks: Example IBM Cell BE Ring Networks: Example Larrabee • Cell Broadband Engine – Message-passing style (no $ coherence) – Element Interconnect Bus (EIB) • 2-rings are provided in each direction • Cache coherent • Crossbar solution was • Bi-directional ring network, 512-bit wide links deemed too large – Short linked rings proposed for >16 processors – Routing decisions are made before injecting messages • The clockwise ring delivers on even clock cycles and the anticlockwise ring on odd clock cycles Chip Multiprocessors (ACS MPhil) 19 Chip Multiprocessors (ACS MPhil) 20

Crossbar Networks Crossbar Networks • A crossbar switch is able to directly connect any input to any output without any intermediate stages – It is an example of a strictly non-blocking network • It can connect any input to any output, incrementally, • A 4x3 crossbar without the need the rearrange any of the circuits implemented using three currently set up. 4:1 multiplexers – The main limitation of a crossbar is its cost. Although very useful in small configurations, n x n • Each multiplexer selects a particular input to be crossbars can quickly become prohibitively connected to the expensive as their cost increases as n 2 corresponding output (Dally/Towles book Chapter 6) Chip Multiprocessors (ACS MPhil) 21 Chip Multiprocessors (ACS MPhil) 22 Crossbar Networks: Example Niagara Crossbar Networks: Example Cyclops • Crossbar switch interconnects 8 processors to banked on-chip L2 cache – A crossbar is actually provided in each direction: • Forward and Return • Simple cache coherence • IBM, US Dept. of Energy/Defense, Academia protocol • Full system 1M+ processors, 80 cores per chip – See earlier seminar • Interconnect: centralised 96x96 buffered crossbar switch with a 7-stage pipeline Reproduced from IEEE Micro, Mar'05 Chip Multiprocessors (ACS MPhil) 23 Chip Multiprocessors (ACS MPhil) 24

Introduction Vast transistor budgets, but .... Poor interconnect - PowerPoint PPT Presentation

Introduction Vast transistor budgets, but .... Poor interconnect scaling Pressure to decentralise designs 7 On-Chip Interconnection Networks Need to manage complexity and power Need for flexible/fault tolerant designs

INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION

Introduction ATV Introduction A T V Introduction A lphabet T V Introduction A lphabet

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Shenzhen Cuilu jewelry Co., Ltd was founded in 1996 and its a large private enterprise

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Spectrum Painting Richard Shipman MW0RCZ ADARS 6th Jan 2020 Introduction Introduction

Introduction Introduction Introduction Introduction Outline Motivation Failures

Introduction Introduction Introduction Nationwide Cause for Concern 1

Team Introduction Experiments Outreach Problem Project Brainstorm Introduction Introduction

Lecture 1 Andreas Habegger Introduction Zynq Introduction Zynq Introduction Zynq PS vs. PL

Introduction to Web Design & Computer Principles Class 1 CSCI-UA 4 Introduction and Overview

Introduction to CICS Course introduction Course introduction What is CICS? What is an

INF5110 Compiler Construction Introduction Spring 2016 1 / 33 Outline 1. Introduction

INTRODUCTION I Syllabus INTRODUCTION I Syllabus I Why study labor economics? INTRODUCTION I

2018.06 01 SMILE5 Introduction S E 5 02 Alpha Cloud M I L 03 Company Introduction 04

How Secure are Secure How Secure are Secure Interdomain Routing Protocols? Interdomain Routing

Accurate Numerical Simula.ons of Chemical Phenomena Involved in

communication-optimal QR factorizations: performance and scalability on varying architectures

Let the Market Drive Deployment A Strategy for Transitioning to BGP Security Phillipa Gill

August 23, 2017 Wednesday 8/23 To start: What do you know about MLA format? Our mission:

Euler, Plato & balloons! Euler : The master of us all! Born in Basel, Switzerland in 1707

Lecture 1: Artistic Rivalry Dr. Sunnie Evers January 7,2020 Rebirth of antique culture Raphael

Computational Challenges of Coupled Cluster Theory Jeff Hammond Leadership Computing Facility

Introduction Vast transistor budgets, but .... Poor interconnect - PowerPoint PPT Presentation

Introduction Vast transistor budgets, but .... Poor interconnect scaling Pressure to decentralise designs 7 On-Chip Interconnection Networks Need to manage complexity and power Need for flexible/fault tolerant designs

INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION

Introduction ATV Introduction A T V Introduction A lphabet T V Introduction A lphabet

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Shenzhen Cuilu jewelry Co., Ltd was founded in 1996 and its a large private enterprise

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Spectrum Painting Richard Shipman MW0RCZ ADARS 6th Jan 2020 Introduction Introduction

Introduction Introduction Introduction Introduction Outline Motivation Failures

Introduction Introduction Introduction Nationwide Cause for Concern 1

Team Introduction Experiments Outreach Problem Project Brainstorm Introduction Introduction

Lecture 1 Andreas Habegger Introduction Zynq Introduction Zynq Introduction Zynq PS vs. PL

Introduction to Web Design &amp; Computer Principles Class 1 CSCI-UA 4 Introduction and Overview

Introduction to CICS Course introduction Course introduction What is CICS? What is an

INF5110 Compiler Construction Introduction Spring 2016 1 / 33 Outline 1. Introduction

INTRODUCTION I Syllabus INTRODUCTION I Syllabus I Why study labor economics? INTRODUCTION I

2018.06 01 SMILE5 Introduction S E 5 02 Alpha Cloud M I L 03 Company Introduction 04

How Secure are Secure How Secure are Secure Interdomain Routing Protocols? Interdomain Routing

Accurate Numerical Simula.ons of Chemical Phenomena Involved in

communication-optimal QR factorizations: performance and scalability on varying architectures

Let the Market Drive Deployment A Strategy for Transitioning to BGP Security Phillipa Gill

August 23, 2017 Wednesday 8/23 To start: What do you know about MLA format? Our mission:

Euler, Plato &amp; balloons! Euler : The master of us all! Born in Basel, Switzerland in 1707

Lecture 1: Artistic Rivalry Dr. Sunnie Evers January 7,2020 Rebirth of antique culture Raphael

Computational Challenges of Coupled Cluster Theory Jeff Hammond Leadership Computing Facility

Introduction to Web Design & Computer Principles Class 1 CSCI-UA 4 Introduction and Overview

Euler, Plato & balloons! Euler : The master of us all! Born in Basel, Switzerland in 1707