RF-Interconnect for Communications On-Chip Frank Chang 1 , Jason - - PDF document

rf interconnect for communications on chip
SMART_READER_LITE
LIVE PREVIEW

RF-Interconnect for Communications On-Chip Frank Chang 1 , Jason - - PDF document

RF-Interconnect for Communications On-Chip Frank Chang 1 , Jason Cong 2 , Glenn Reinman 2 Eran Socher 1 , Rocco Tam 1 Department of Electrical Engineering 1 Department of Computer Science 2 Current Trend in CMP - NoC ISSCC 2007: An 80-Tile


slide-1
SLIDE 1

RF-Interconnect for Communications On-Chip

Frank Chang1, Jason Cong2, Glenn Reinman2 Eran Socher1, Rocco Tam1

Department of Electrical Engineering1 Department of Computer Science2

Current Trend in CMP - NoC

  • 65nm CMOS 80 tile NoC
  • 10X8 2D mesh network-
  • n-chip running @ 4GHz
  • Bisection bandwidth

256GB/s

  • 1 TFLOPS @ 1V about

98W

ISSCC 2007: An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS (Sriram Vangal et al., Intel)

slide-2
SLIDE 2

What is The Challenge?

  • Cores would keep shrinking in size but

maintain the same operation frequency (2~4GHz) due to thermal constraints

  • More cores would be integrated on the

same chip to achieve performance boost through parallelism

  • Performance would be limited by the

communication efficiency between cores and memories on- and off-chip

The Scaling Trend

  • Scaling reduces delay of logic gates but not wires

Transistor and Wire Delay Trend in CMOS 10 20 30 40 50 60 70 80 90 100 1 8 n m 1 3 n m 9 n m 6 5 n m 4 5 n m 3 2 n m Technology Node Delay [ps]

FO4 1mm RC global wire Repeated 1mm RC global wire

slide-3
SLIDE 3

Traditional Interconnect

  • Units communicate through a parallel bus using

voltage signaling (charging and discharging the wire capacitance)

  • Latency is RC limited (~L2)
  • Using CMOS repeaters reduces latency (~L) but does

not benefit from scaling

  • Supply no longer scales due to leakage
  • Baseband-only signaling requires extensive

equalization

  • Waste of broad bandwidth available from modern

CMOS devices (ft>150GHz, fmax>250GHz)

10

T

f

Major Interconnect Issues

  • Latency is large across chip
  • Bandwidth is RC limited (~1Gbps/wire)
  • Communication pattern is fixed
  • Energy consumption is high and not

scalable (~10pJ/bit)

  • Future microprocessors may encounter

communication congestion and most of the energy will be spent on “talking” instead of computing

slide-4
SLIDE 4

How Can RF Help?

  • EM waves travel at the (effective) speed
  • f light (~10ps/mm)
  • Carrier frequencies can be modulated by

modern CMOS with high data rates

  • Transmission lines on- or off-chip can

guide the waves (RF modulated data) from the transmitter to receiver with recoverable attenuation

RF-Interconnect Concept

f

  • Data transmit through transmission lines at the speed of light, with

less dispersion across the band and less baseband interference

  • data rate is only limited by CMOS mixer modulation speed
slide-5
SLIDE 5

RF-I using Multi-band FDMA

  • More bands are used with same modulation speed at

each band

  • Higher aggregate data rates can be achieved on the

same transmission line

3.6Gbps Multi-drop Multiband Bi-directional RF-I *

0.15ns/div 100mV/div 4ns/div 100mV/div 4ns/div Recovered data eye diagrams Recovered data waveforms Input data patterns

Data-B FDMA chip4 Data-B FDMA chip2 Data-R FDMA chip3 Data-R FDMA chip1

Data-B : 1.8Gb/s PRBS through baseband Data-R : 1.8Gb/s PRBS through RF-band

Data-R Data-R Data-B Data-B Data-R Data-B Data-R Data-B 10 cm FR4 Interconnect

* World’s 1st Multiband RF-I, Ko & Chang, 2005 ISSCC

slide-6
SLIDE 6

RF-Interconnect for NoC

  • RF-I is built on top of 2D-Mesh NoC

and serves as a “super-highway”

  • Multiple carrier frequencies in the

RF and MMW range (100GHz to

  • ver 500GHz)
  • Data encoding by amplitude

modulation of carrier

  • Direct coupling between the

transmission line and electronic circuits

  • Improves with device performance

scaling (higher data rates, more carriers)

  • Potentially lower energy

consumption

Can We Implement RF-I in CMOS?

  • Today’s RF-CMOS circuits are in the

wireless communication “sweet spots” of 500MHz-5GHz

– Insufficient bandwidth for RF-I to be effective!

  • Millimeter-wave CMOS circuits have

been developed for 60GHz and recently for 324 GHz bands

slide-7
SLIDE 7

CMOS 324GHz Generator

  • 76dBm before

calibration

  • 46dBm after

calibration

*Huang, Larocca and Chang, “324GHz CMOS Frequency Generator using Linear Superposition Technique,” pp. 476- 477, 2008 ISSCC

Frequency Generation in Multiband RF-Interconnect

10GHz 20GHz 30GHz 40GHz 50GHz 60GHz f

f6 = 60GHz f5 = 50GHz f4 = 40GHz f3 = 30GHz f2 = 20GHz f1 = 10GHz 60GHz 10GHz 60GHz

Transmission Line Output Buffer Mixer Mixer LPF

frequency

Data1

frequency

Data6

frequency

Data1

frequency

Data6

10GHz

X 6 TX X6 RX

Multi-Band Synthesizer

slide-8
SLIDE 8

Simultaneous Sub-harmonic Injection Locked mm-Wave Frequency Generation

  • Using sub-harmonic

injection-locked VCOs simultaneous lock to one single reference frequency

  • Advantages:

– Eliminate PLLs – Low Power Consumption – Small Area

Master VCO

Non-linear Harmonic Generator

Slave VCOs

Sub-harmonic Injection Locked VCO*

  • LC-based VCO core
  • Differential pair for odd harmonic generation
  • Single-ended even harmonic generation
  • Injection locking to high harmonic within

locking range of the VCO

Process Free Running Frequency (GHz) Max locking Range (GHz) Locking Harmonics Power (mW) This Work* 90nm CMOS 29.3 5.6 2nd,4th, 6th, 8th 3rd, 5th, 7th 4

*Sai-Wang Tam, M.-C. Frank Chang, etc…, "Simultaneous Sub-harmonic Injection-Locked mm-Wave Frequency Generators for Multi-band Communications in CMOS", IEEE RFIC Sym., 2008

slide-9
SLIDE 9

RF-I using Amplitude shift-Key (ASK) Modulation

  • TX: Use transformer couples output of VCO to ASK modulator

and use simple modulator to generate RF signal in ASK.

  • RX: Use self-mixer for envelope detection. Afterwards a simple

buffer and Schmitt Trigger recover the signal to rail-to-rail swing.

Differential Transmission Line

  • Loss of 0.6-1.6 dB/mm

Differential TML

slide-10
SLIDE 10

RF-I using Amplitude Shift-Key (ASK) Modulation

VCO Output: 60GHZ ASK modulated Signal Mixer output 5Gbit/s Data input

3DIC ASK RF-I Tested at 11Gbps*

Output Eye diagram Output versus input Input Output 10ps/div 50mV/div 500ps/div

Coupling Capacito r TX in Layer 2 RX in Layer 1

Die Photo

*Gu and Chang, pp.448-449, 2007 ISSCC (0.33pJ/bit)

slide-11
SLIDE 11

Single Channel ASK RF-I Performance Summary

  • Simple Architecture:

One TX VCO, One Mixer, One RX Buffer

  • No synchronization

circuits such as PLL or clock data recovery needed in ASK RF-I

  • Can expand the same

architecture to multi- band RF-I

Process IBM 90nm CMOS Digital Process

RF-Carrier Freq. 60GHz Data Rate 5Gbit/s Power TX:2mW RX: 3mW Energy per bit 1pJ/Bit Active Area 1300 µm2

22

Future Trends in Multi-band ASK RF-I

Technology # of Carriers data rate per carrier (Gb/s) Total Data rate per wire (Gb/s) Power (mW) Energy per bit(pJ) Area (TX+RX) mm2 Area/Gbit (µm2/Gbit)

90nm 3RF + 1 BB 5 20 20 1.00 0.022 1100 65nm 4RF + 1 BB 6 30 25 0.83 0.0238 800 45nm 5RF + 1 BB 7 42 30 0.71 0.0228 540 32nm 6RF + 1 BB 8 56 35 0.63 0.0211 380 22nm 7RF + 1 BB 9 72 40 0.56 0.0193 260

slide-12
SLIDE 12

23

Interconnect Topology Comparison

2cm Interconnect Data Rate Density

2 4 6 8 10 12 14 90nm 65nm 45nm 32nm 22nm Technology Node Data Rate Density [Gbps/um] Bus RF-I Optical-I 2cm Interconnect Energy 5 10 15 20 25 90nm 65nm 45nm 32nm 22nm Technology Node Energy [pJ/bit] Bus RF-I Optical-I

2cm Interconnect Latency

200 400 600 800 1000 1200 1400 1600 90nm 65nm 45nm 32nm 22nm Technology Node Latency [ps] Bus RF-I Optical-I

  • Comparison across process technology
  • f…

– Traditional RC parallel bus – RF-Interconnect – Optical Interconnect

  • As process technology scales toward

22nm…

– RF-I has lowest latency – RF-I consumes least energy – RF-I has highest data rate density

  • RF-I is fully compatible with modern CMOS

technology

Advantages of RF- Interconnects

  • Latency
  • Bandwidth
  • Energy
  • Reconfigurability
slide-13
SLIDE 13

Example: RF-I for CMP NoC Design

  • 10x10 mesh of 5-cycle

pipelined routers

– NoC runs at 2GHz – XY/YX routing

  • 64 4GHz 3-wide processor

cores containing

– 8KB L1 Data Cache – 8KB L1 Instruction Cache

  • 32 L2 Cache Banks

– 256KB each – Organized as shared NUCA cache

  • 4 Main Memory Interfaces

– Labeled with + in the figure

R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R R R R R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R R R R R (square) = router C (circle) = processor core $ (diamond) = L2 cache bank + (plus) = main memory interface

MORFIC: Mesh Overlaid w ith RF- InterConnect

  • Shared Z-shaped RF waveguide
  • Organized as 8 bidirectional

shortcut links

  • Each direction of each shortcut

can transmit simultaneously over shared medium

  • Router A can send a flit to other

router A, B to B, … H to H in a single cycle

  • Router labeled X cannot directly

send to any router not labeled X

– E.g. Router B in upper left cannot send to router E in upper right directly – However, B in upper left can send to B in upper right, and then north to E using normal mesh link

A F A C C B B D D E E G G H H F

PHYSICAL ORGANIZATION LOGICAL ORGANIZATION

A F A C C B B D D E E G G H H F

slide-14
SLIDE 14

MORFIC Results For 256B Total RF-I [HPCA’2008]

  • 256B RF-I consumes 0.18% silicon overhead on 400mm2 die

– RF-I components: 0.13%, Router overhead: 0.05%

  • Normalized Splash-2 Execution Time and Average Packet

Latency Results

– Normalized to baseline mesh run-cycles/latency at 1 – Average 13% (max 18%) performance improvement – Average 22% (max 24%) packet latency improvement

0.74 0.76 0.78 0.80 0.82 fft radix water-sp watern^2 lu

  • cean

barnes Normalized Avg Packet Lat 256B RF-I 0.75 0.80 0.85 0.90 0.95 fft radix water-sp watern^2 lu

  • cean

barnes Normalized Run Cycles 256B RF-I

The Bad New s …

Most Interconnect Optimization Techniques May Not be Relevant …

  • Performance-driven interconnect design based on distributed RC delay model - all 10

versions » Jason Cong, Kwok-Shing Leung, and Dian Zhou, Design Automation Conference 1993, Cited by 141 - Related Articles - Web Search - Library Search

  • Interconnect design for deep submicron ICs - all 25 versions »

J Cong, L He, KY Khoo, CK Koh, Z Pan - Proc. Int. Conf. on Computer Aided Design, 1997 - doi.ieeecomputersociety.org Cited by 139 - Related Articles - Web Search

  • Efficient algorithms for the minimum shortest path Steiner arborescence problem with

applications to … - all 11 versions » Jason Cong, Andrew B. Kahng, and Kwok-Shing Leung, IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 17, NO. 1, JANUARY 1998 Cited by 127 - Related Articles - Web Search

  • Buffer block planning for interconnect-driven floorplanning - all 21 versions »

J Cong, T Kong, DZ Pan - Proc. Int. Conf. Computer-Aided Design, 1999 - doi.ieeecomputersociety.org Cited by 130 - Related Articles - Web Search … (from Google Scholar)

slide-15
SLIDE 15

Good New s -- Plenty of New Problems for Future PhD Students

  • How many/which routers should be RF-enabled?

– How many RF-I ports should each router have?

  • Dedicated or multiplexed with other ports?
  • How much RF-I bandwidth to allocate?

– Total? Per communicating pair? – Impacts active layer area consumed by RF-I components

  • Which routing strategy to employ in presence of RF-I express

channels?

  • Dynamic or static allocation of frequency bands to

sources/destinations – Dynamic: requires arbitration overhead for channel assignment – Static: may miss opportunity to match changing communication demand

  • Support of multi-cast

Example: Deadlock: To Avoid or Confront?

  • South-Last Strategy [Ogras and Marculescu, 2006]

– Routes which can lead to circular buffer dependence are forbidden avoids deadlock

  • Deadlock Detection & Recovery (DDR)

– Based on Duato and Pinkston’s theory [Duato and

Pinkston 2001]

  • If deadlock occurs, route all packets in the network
  • n a spare virtual channel
  • Use deadlock-free XY-routing
  • Packets entering network after this point may be

routed normally

slide-16
SLIDE 16

Deadlock Results

– South-Last strategy too restrictive

  • Halves the average realizable performance

– Deadlock is best detected and recovered from when it occurs

  • Detection happens reasonably quickly
  • Performance during recovery no worse than baseline

Example: RF-I Topology and Bandw idth Optimization

  • For each channel

– Source and destination may be reconfigured via frequency-band reassignment

  • Can assign variable # of

channels to each source, destination pair (s,d)

– critical channels given more bandwidth

  • A flexible means to

reconfigure topology

PHYSICAL LOGICAL A LOGICAL B

slide-17
SLIDE 17

Variance In Communication Patterns Variance In Communication Patterns

Mpeg2Enc time varying behavior

1 10 100 1,000 10,000 100,000 1,000,000 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 interval (250k cycles) event count L2 ACCESS NW INJECT BW STALL FLITS SENT

m peg2enc traffic by m anhattan distance 50,000 100,000 150,000 200,000 250,000 300,000 350,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # msgs w aterspatial traffic by m anhattan distance 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # msgs

WaterSpatial time varying behavior

1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 interval (250k cycles) event count L2 ACCESS NW INJECT BW STALL FLITS SENT

Conclusions

  • RF-I on CMOS is real
  • RF-I is a very promising solution to global

interconnect bottleneck

– Latency – Bandwidth – Energy – Reconfigurability

  • RF-I introduces many interesting physical and

architecture design problems in NoC designs