[PDF] - RF-Interconnect for Communications On-Chip Frank Chang 1 , Jason PDF Document

SLIDE 1

RF-Interconnect for Communications On-Chip

Frank Chang1, Jason Cong2, Glenn Reinman2 Eran Socher1, Rocco Tam1

Department of Electrical Engineering1 Department of Computer Science2

Current Trend in CMP - NoC

65nm CMOS 80 tile NoC
10X8 2D mesh network-
n-chip running @ 4GHz
Bisection bandwidth

256GB/s

1 TFLOPS @ 1V about

98W

ISSCC 2007: An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS (Sriram Vangal et al., Intel)

SLIDE 2

What is The Challenge?

Cores would keep shrinking in size but

maintain the same operation frequency (2~4GHz) due to thermal constraints

More cores would be integrated on the

same chip to achieve performance boost through parallelism

Performance would be limited by the

communication efficiency between cores and memories on- and off-chip

The Scaling Trend

Scaling reduces delay of logic gates but not wires

Transistor and Wire Delay Trend in CMOS 10 20 30 40 50 60 70 80 90 100 1 8 n m 1 3 n m 9 n m 6 5 n m 4 5 n m 3 2 n m Technology Node Delay [ps]

FO4 1mm RC global wire Repeated 1mm RC global wire

SLIDE 3

Traditional Interconnect

Units communicate through a parallel bus using

voltage signaling (charging and discharging the wire capacitance)

Latency is RC limited (~L2)
Using CMOS repeaters reduces latency (~L) but does

not benefit from scaling

Supply no longer scales due to leakage
Baseband-only signaling requires extensive

equalization

Waste of broad bandwidth available from modern

CMOS devices (ft>150GHz, fmax>250GHz)

10

T

f

Major Interconnect Issues

Latency is large across chip
Bandwidth is RC limited (~1Gbps/wire)
Communication pattern is fixed
Energy consumption is high and not

scalable (~10pJ/bit)

Future microprocessors may encounter

communication congestion and most of the energy will be spent on “talking” instead of computing

SLIDE 4

How Can RF Help?

EM waves travel at the (effective) speed
f light (~10ps/mm)
Carrier frequencies can be modulated by

modern CMOS with high data rates

Transmission lines on- or off-chip can

guide the waves (RF modulated data) from the transmitter to receiver with recoverable attenuation

RF-Interconnect Concept

f

Data transmit through transmission lines at the speed of light, with

less dispersion across the band and less baseband interference

data rate is only limited by CMOS mixer modulation speed

SLIDE 5

RF-I using Multi-band FDMA

More bands are used with same modulation speed at

each band

Higher aggregate data rates can be achieved on the

same transmission line

3.6Gbps Multi-drop Multiband Bi-directional RF-I *

0.15ns/div 100mV/div 4ns/div 100mV/div 4ns/div Recovered data eye diagrams Recovered data waveforms Input data patterns

Data-B FDMA chip4 Data-B FDMA chip2 Data-R FDMA chip3 Data-R FDMA chip1

Data-B : 1.8Gb/s PRBS through baseband Data-R : 1.8Gb/s PRBS through RF-band

Data-R Data-R Data-B Data-B Data-R Data-B Data-R Data-B 10 cm FR4 Interconnect

* World’s 1st Multiband RF-I, Ko & Chang, 2005 ISSCC

SLIDE 6

RF-Interconnect for NoC

RF-I is built on top of 2D-Mesh NoC

and serves as a “super-highway”

Multiple carrier frequencies in the

RF and MMW range (100GHz to

ver 500GHz)
Data encoding by amplitude

modulation of carrier

Direct coupling between the

transmission line and electronic circuits

Improves with device performance

scaling (higher data rates, more carriers)

Potentially lower energy

consumption

Can We Implement RF-I in CMOS?

Today’s RF-CMOS circuits are in the

wireless communication “sweet spots” of 500MHz-5GHz

– Insufficient bandwidth for RF-I to be effective!

Millimeter-wave CMOS circuits have

been developed for 60GHz and recently for 324 GHz bands

SLIDE 7

CMOS 324GHz Generator

76dBm before

calibration

46dBm after

calibration

*Huang, Larocca and Chang, “324GHz CMOS Frequency Generator using Linear Superposition Technique,” pp. 476- 477, 2008 ISSCC

Frequency Generation in Multiband RF-Interconnect

10GHz 20GHz 30GHz 40GHz 50GHz 60GHz f

f6 = 60GHz f5 = 50GHz f4 = 40GHz f3 = 30GHz f2 = 20GHz f1 = 10GHz 60GHz 10GHz 60GHz

Transmission Line Output Buffer Mixer Mixer LPF

frequency

Data1

frequency

Data6

frequency

Data1

frequency

Data6

10GHz

X 6 TX X6 RX

Multi-Band Synthesizer

SLIDE 8

Simultaneous Sub-harmonic Injection Locked mm-Wave Frequency Generation

Using sub-harmonic

injection-locked VCOs simultaneous lock to one single reference frequency

Advantages:

– Eliminate PLLs – Low Power Consumption – Small Area

Master VCO

Non-linear Harmonic Generator

Slave VCOs

Sub-harmonic Injection Locked VCO*

LC-based VCO core
Differential pair for odd harmonic generation
Single-ended even harmonic generation
Injection locking to high harmonic within

locking range of the VCO

Process Free Running Frequency (GHz) Max locking Range (GHz) Locking Harmonics Power (mW) This Work* 90nm CMOS 29.3 5.6 2nd,4th, 6th, 8th 3rd, 5th, 7th 4

*Sai-Wang Tam, M.-C. Frank Chang, etc…, "Simultaneous Sub-harmonic Injection-Locked mm-Wave Frequency Generators for Multi-band Communications in CMOS", IEEE RFIC Sym., 2008

SLIDE 9

RF-I using Amplitude shift-Key (ASK) Modulation

TX: Use transformer couples output of VCO to ASK modulator

and use simple modulator to generate RF signal in ASK.

RX: Use self-mixer for envelope detection. Afterwards a simple

buffer and Schmitt Trigger recover the signal to rail-to-rail swing.

Differential Transmission Line

Loss of 0.6-1.6 dB/mm

Differential TML

SLIDE 10

RF-I using Amplitude Shift-Key (ASK) Modulation

VCO Output: 60GHZ ASK modulated Signal Mixer output 5Gbit/s Data input

3DIC ASK RF-I Tested at 11Gbps*

Output Eye diagram Output versus input Input Output 10ps/div 50mV/div 500ps/div

Coupling Capacito r TX in Layer 2 RX in Layer 1

Die Photo

*Gu and Chang, pp.448-449, 2007 ISSCC (0.33pJ/bit)

SLIDE 11

Single Channel ASK RF-I Performance Summary

Simple Architecture:

One TX VCO, One Mixer, One RX Buffer

No synchronization

circuits such as PLL or clock data recovery needed in ASK RF-I

Can expand the same

architecture to multi- band RF-I

Process IBM 90nm CMOS Digital Process

RF-Carrier Freq. 60GHz Data Rate 5Gbit/s Power TX:2mW RX: 3mW Energy per bit 1pJ/Bit Active Area 1300 µm2

22

Future Trends in Multi-band ASK RF-I

Technology # of Carriers data rate per carrier (Gb/s) Total Data rate per wire (Gb/s) Power (mW) Energy per bit(pJ) Area (TX+RX) mm2 Area/Gbit (µm2/Gbit)

90nm 3RF + 1 BB 5 20 20 1.00 0.022 1100 65nm 4RF + 1 BB 6 30 25 0.83 0.0238 800 45nm 5RF + 1 BB 7 42 30 0.71 0.0228 540 32nm 6RF + 1 BB 8 56 35 0.63 0.0211 380 22nm 7RF + 1 BB 9 72 40 0.56 0.0193 260

SLIDE 12

23

Interconnect Topology Comparison

2cm Interconnect Data Rate Density

2 4 6 8 10 12 14 90nm 65nm 45nm 32nm 22nm Technology Node Data Rate Density [Gbps/um] Bus RF-I Optical-I 2cm Interconnect Energy 5 10 15 20 25 90nm 65nm 45nm 32nm 22nm Technology Node Energy [pJ/bit] Bus RF-I Optical-I

2cm Interconnect Latency

200 400 600 800 1000 1200 1400 1600 90nm 65nm 45nm 32nm 22nm Technology Node Latency [ps] Bus RF-I Optical-I

Comparison across process technology
f…

– Traditional RC parallel bus – RF-Interconnect – Optical Interconnect

As process technology scales toward

22nm…

– RF-I has lowest latency – RF-I consumes least energy – RF-I has highest data rate density

RF-I is fully compatible with modern CMOS

technology

Advantages of RF- Interconnects

Latency
Bandwidth
Energy
Reconfigurability

SLIDE 13

Example: RF-I for CMP NoC Design

10x10 mesh of 5-cycle

pipelined routers

– NoC runs at 2GHz – XY/YX routing

64 4GHz 3-wide processor

cores containing

– 8KB L1 Data Cache – 8KB L1 Instruction Cache

32 L2 Cache Banks

– 256KB each – Organized as shared NUCA cache

4 Main Memory Interfaces

– Labeled with + in the figure

R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R R R R R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R

$

R R R R R (square) = router C (circle) = processor core $ (diamond) = L2 cache bank + (plus) = main memory interface

MORFIC: Mesh Overlaid w ith RF- InterConnect

Shared Z-shaped RF waveguide
Organized as 8 bidirectional

shortcut links

Each direction of each shortcut

can transmit simultaneously over shared medium

Router A can send a flit to other

router A, B to B, … H to H in a single cycle

Router labeled X cannot directly

send to any router not labeled X

– E.g. Router B in upper left cannot send to router E in upper right directly – However, B in upper left can send to B in upper right, and then north to E using normal mesh link

A F A C C B B D D E E G G H H F

PHYSICAL ORGANIZATION LOGICAL ORGANIZATION

A F A C C B B D D E E G G H H F

SLIDE 14

MORFIC Results For 256B Total RF-I [HPCA’2008]

256B RF-I consumes 0.18% silicon overhead on 400mm2 die

– RF-I components: 0.13%, Router overhead: 0.05%

Normalized Splash-2 Execution Time and Average Packet

Latency Results

– Normalized to baseline mesh run-cycles/latency at 1 – Average 13% (max 18%) performance improvement – Average 22% (max 24%) packet latency improvement

0.74 0.76 0.78 0.80 0.82 fft radix water-sp watern^2 lu

cean

barnes Normalized Avg Packet Lat 256B RF-I 0.75 0.80 0.85 0.90 0.95 fft radix water-sp watern^2 lu

cean

barnes Normalized Run Cycles 256B RF-I

The Bad New s …

Most Interconnect Optimization Techniques May Not be Relevant …

Performance-driven interconnect design based on distributed RC delay model - all 10

versions » Jason Cong, Kwok-Shing Leung, and Dian Zhou, Design Automation Conference 1993, Cited by 141 - Related Articles - Web Search - Library Search

Interconnect design for deep submicron ICs - all 25 versions »

J Cong, L He, KY Khoo, CK Koh, Z Pan - Proc. Int. Conf. on Computer Aided Design, 1997 - doi.ieeecomputersociety.org Cited by 139 - Related Articles - Web Search

Efficient algorithms for the minimum shortest path Steiner arborescence problem with

applications to … - all 11 versions » Jason Cong, Andrew B. Kahng, and Kwok-Shing Leung, IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 17, NO. 1, JANUARY 1998 Cited by 127 - Related Articles - Web Search

Buffer block planning for interconnect-driven floorplanning - all 21 versions »

J Cong, T Kong, DZ Pan - Proc. Int. Conf. Computer-Aided Design, 1999 - doi.ieeecomputersociety.org Cited by 130 - Related Articles - Web Search … (from Google Scholar)

SLIDE 15

Good New s -- Plenty of New Problems for Future PhD Students

How many/which routers should be RF-enabled?

– How many RF-I ports should each router have?

Dedicated or multiplexed with other ports?
How much RF-I bandwidth to allocate?

– Total? Per communicating pair? – Impacts active layer area consumed by RF-I components

Which routing strategy to employ in presence of RF-I express

channels?

Dynamic or static allocation of frequency bands to

sources/destinations – Dynamic: requires arbitration overhead for channel assignment – Static: may miss opportunity to match changing communication demand

Support of multi-cast

Example: Deadlock: To Avoid or Confront?

South-Last Strategy [Ogras and Marculescu, 2006]

– Routes which can lead to circular buffer dependence are forbidden avoids deadlock

Deadlock Detection & Recovery (DDR)

– Based on Duato and Pinkston’s theory [Duato and

Pinkston 2001]

If deadlock occurs, route all packets in the network
n a spare virtual channel
Use deadlock-free XY-routing
Packets entering network after this point may be

routed normally

SLIDE 16

Deadlock Results

– South-Last strategy too restrictive

Halves the average realizable performance

– Deadlock is best detected and recovered from when it occurs

Detection happens reasonably quickly
Performance during recovery no worse than baseline

Example: RF-I Topology and Bandw idth Optimization

For each channel

– Source and destination may be reconfigured via frequency-band reassignment

Can assign variable # of

channels to each source, destination pair (s,d)

– critical channels given more bandwidth

A flexible means to

reconfigure topology

PHYSICAL LOGICAL A LOGICAL B

SLIDE 17

Variance In Communication Patterns Variance In Communication Patterns

Mpeg2Enc time varying behavior

1 10 100 1,000 10,000 100,000 1,000,000 1 15 29 43 57 71 85 99 113 127 141 155 169 183 197 211 225 239 interval (250k cycles) event count L2 ACCESS NW INJECT BW STALL FLITS SENT

m peg2enc traffic by m anhattan distance 50,000 100,000 150,000 200,000 250,000 300,000 350,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # msgs w aterspatial traffic by m anhattan distance 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # msgs

WaterSpatial time varying behavior

1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 interval (250k cycles) event count L2 ACCESS NW INJECT BW STALL FLITS SENT

RF-Interconnect for Communications On-Chip

Frank Chang1, Jason Cong2, Glenn Reinman2 Eran Socher1, Rocco Tam1

Department of Electrical Engineering1 Department of Computer Science2

Current Trend in CMP - NoC

256GB/s

98W

ISSCC 2007: An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS (Sriram Vangal et al., Intel)

What is The Challenge?

maintain the same operation frequency (2~4GHz) due to thermal constraints

same chip to achieve performance boost through parallelism

communication efficiency between cores and memories on- and off-chip

The Scaling Trend

Traditional Interconnect

voltage signaling (charging and discharging the wire capacitance)

not benefit from scaling

equalization

CMOS devices (ft>150GHz, fmax>250GHz)

Major Interconnect Issues

scalable (~10pJ/bit)

communication congestion and most of the energy will be spent on “talking” instead of computing

How Can RF Help?

modern CMOS with high data rates

guide the waves (RF modulated data) from the transmitter to receiver with recoverable attenuation

RF-Interconnect Concept

RF-I using Multi-band FDMA

each band

same transmission line

3.6Gbps Multi-drop Multiband Bi-directional RF-I *

* World’s 1st Multiband RF-I, Ko & Chang, 2005 ISSCC

RF-Interconnect for NoC

Can We Implement RF-I in CMOS?

wireless communication “sweet spots” of 500MHz-5GHz

– Insufficient bandwidth for RF-I to be effective!

been developed for 60GHz and recently for 324 GHz bands

CMOS 324GHz Generator

calibration

calibration

Frequency Generation in Multiband RF-Interconnect

Simultaneous Sub-harmonic Injection Locked mm-Wave Frequency Generation

Master VCO

Slave VCOs

Sub-harmonic Injection Locked VCO*

RF-I using Amplitude shift-Key (ASK) Modulation

Differential Transmission Line

RF-I using Amplitude Shift-Key (ASK) Modulation

3DIC ASK RF-I Tested at 11Gbps*

Output Eye diagram Output versus input Input Output 10ps/div 50mV/div 500ps/div

Single Channel ASK RF-I Performance Summary

One TX VCO, One Mixer, One RX Buffer

circuits such as PLL or clock data recovery needed in ASK RF-I

architecture to multi- band RF-I

Future Trends in Multi-band ASK RF-I

Interconnect Topology Comparison

Advantages of RF- Interconnects

Example: RF-I for CMP NoC Design

pipelined routers

cores containing

MORFIC: Mesh Overlaid w ith RF- InterConnect

MORFIC Results For 256B Total RF-I [HPCA’2008]

The Bad New s …

Most Interconnect Optimization Techniques May Not be Relevant …

Good New s -- Plenty of New Problems for Future PhD Students

Example: Deadlock: To Avoid or Confront?

– Routes which can lead to circular buffer dependence are forbidden avoids deadlock

– Based on Duato and Pinkston’s theory [Duato and

routed normally

Deadlock Results

Example: RF-I Topology and Bandw idth Optimization

– Source and destination may be reconfigured via frequency-band reassignment

channels to each source, destination pair (s,d)

– critical channels given more bandwidth

reconfigure topology

Variance In Communication Patterns Variance In Communication Patterns

Conclusions

interconnect bottleneck

– Latency – Bandwidth – Energy – Reconfigurability

architecture design problems in NoC designs