Design Techniques for Scalable, Sub-pJ/b Serial I/O Transceivers - - PowerPoint PPT Presentation

design techniques for scalable sub pj b serial i o
SMART_READER_LITE
LIVE PREVIEW

Design Techniques for Scalable, Sub-pJ/b Serial I/O Transceivers - - PowerPoint PPT Presentation

Design Techniques for Scalable, Sub-pJ/b Serial I/O Transceivers Samuel Palermo spalermo@tamu.edu Analog & Mixed-Signal Center Texas A&M University Outline Motivation Power-Scalable I/O Techniques Low-Power Clocking


slide-1
SLIDE 1

Samuel Palermo spalermo@tamu.edu Analog & Mixed-Signal Center Texas A&M University

Design Techniques for Scalable, Sub-pJ/b Serial I/O Transceivers

slide-2
SLIDE 2

Outline

  • Motivation
  • Power-Scalable I/O Techniques
  • Low-Power Clocking
  • Low-Power Equalizers
  • Conclusion

2

slide-3
SLIDE 3

More and More Data …

3

Human-driven traffic growth

Hi definition video conference

Machine-driven traffic growth

Cloud service, big data, IoT Enterprise service Supercomputer

slide-4
SLIDE 4

High-Speed Serial I/O

  • Found in applications ranging

from high-end computing systems to smart mobile devices

  • Typical processor platform
  • Processor-to-memory: DDR3
  • Processor-to-peripheral: PCIe & USB
  • Storage: SATA
  • Network: LAN
  • Mobile systems
  • DSI : Display Serial Interface
  • CSI : Camera Serial Interface
  • UniPRO : MIPI Universal Protocol

4

I ntel I vyBridge w/ Chipset

slide-5
SLIDE 5

High-Speed Electrical Link System

5

  • Data serialization required due to limited I/O channel count
  • Future systems demand efficient high-speed drivers,

receivers, and clock generation/recovery circuitry

  • Equalization circuitry compensates for high frequency

channel loss

Serializer Deserializer

slide-6
SLIDE 6

I/O Energy Efficiency is Paramount

  • High-performance

processor aggregate I/O bandwidth demands will soon approach 1TB/s

  • Typical I/O power

budgets are 10W or less

  • Energy efficiencies near

1mW/Gbps are necessary

6

HPC I / O Bandwidth*

*M. Mansuri et al, “A Scalable 0.128–1 Tb/s, 0.8–2.6 pJ/bit, 64-Lane Parallel I/O in 32-nm CMOS," IEEE JSSC,

  • Dec. 2013.
slide-7
SLIDE 7

Outline

  • Motivation
  • Power-Scalable I/O Techniques
  • Low-Power Clocking
  • Low-Power Equalizers
  • Conclusion

7

slide-8
SLIDE 8
  • Adaptive power supply regulation

allows the minimum voltage required for a given data rate

  • Efficient DC-DC converters driven

by a frequency controller generate the supply voltage for the I/O clocking and serialization

  • Dramatic energy efficiency

improvements possible, particularly as data rates scale down based on I/O bandwidth demand

Scaling Supply with Data Rate

8

[Kim JSSC 2002] * GP 65nm CMOS Technology

slide-9
SLIDE 9

Increasing Data Rate with Parallelism

  • Utilizing large mux/demux factors allows parallel segments

to operate at low clock frequencies and low supply voltages

  • Important to minimize jitter and static phase offset of

multiple clock phases

9

clk

f N   Rate Data

slide-10
SLIDE 10

Fast Power-State Transitioning

  • Efficient system operation

demands minimal latency when adjusting the I/O per-channel data rates

  • Certain applications, such as

memory interfaces, have bursts

  • f data traffic which necessitate

rapidly achieving maximum I/O bandwidth

  • Techniques must be developed

to enable fast power-state transitioning of key I/O circuits

10

[O’Mahony VLSI -DAT 2009]

slide-11
SLIDE 11

Low-Swing TX Driver Comparison

11

 Norton-equivalent parallel termination  High PSRR  Low pre-driver complexity  High signaling power

Current-Mode Driver (CM) Voltage-Mode Driver (VM)

 Thevenin-equivalent series termination  Voltage-regulator is required  ¼ signaling power of CM

VM driver uses 4X less current than CM driver

slide-12
SLIDE 12

Low-Voltage Serial I/O Transceiver

12

  • Utilizes a high TX output multiplexing (4:1)

and RX input multiplexing (1:8) factor for low-voltage operation

Y.-H. Song, R. Bai, P. Chiang, and S. Palermo, “A 0.47-0.66pJ/bit, 4.8-8Gb/s I/O Transceiver in 65nm-CMOS,” IEEE JSSC, vol. 48,

  • no. 5, pp. 1276-1289, May 2013.
slide-13
SLIDE 13

4:1 Output Multiplexing Voltage-Mode TX

13

Y.-H. Song, R. Bai, P. Chiang, and S. Palermo, “A 0.47-0.66pJ/bit, 4.8-8Gb/s I/O Transceiver in 65nm-CMOS,” IEEE JSSC, vol. 48,

  • no. 5, pp. 1276-1289, May 2013.

Level Shifter Level Shifter

DFF

D Q Q 8:4 CK0 CK180 CK0 CK0

Pulse Generator TXP TXN

VZDN

CKP CKN Txdata

CK180 CK90 CK270 CP0 CP90 CP180 CP270 2Gb/s 8Gb/s 2GHz 8x1Gb/s /2 CK0/90/ 180/270 CP0/90/ 180/270 Scalable DVDD VZUP

VREF 0.65 V Cdec

4:1 Voltage Mode Output Driver

2 Stages PPF CML to CMOS Converter

Scalable DVDD

8:4MUX, AND Gate, and Level Shifter

I QB Q IB

ERROR AMP

  • 4 parallel voltage-mode
  • utput segments

perform output multiplexing

  • Efficient quadrature

clock generation with 2- stage poly-phase filter

  • Level-shifting pre-driver

allows for smaller

  • utput transistors
slide-14
SLIDE 14

1:8 Input De-Multiplexing RX

14

  • 1:8 input de-multiplexing allows input comparators

to operate at low voltages

  • Injection-locked-oscillator is used for efficient multi-

phase clock generation and de-skew

Y.-H. Song, R. Bai, P. Chiang, and S. Palermo, “A 0.47-0.66pJ/bit, 4.8-8Gb/s I/O Transceiver in 65nm-CMOS,” IEEE JSSC, vol. 48,

  • no. 5, pp. 1276-1289, May 2013.
slide-15
SLIDE 15

0.47-0.66pJ/bit, 4.8-8Gb/s GP 65nm CMOS Prototype

15

  • Optimal 0.47pJ/b energy efficiency achieved at 6.4Gb/s
  • At low data rates, less amortization of static current
  • At high data rates, higher voltage required for serialization timing

Y.-H. Song, R. Bai, P. Chiang, and S. Palermo, “A 0.47-0.66pJ/bit, 4.8-8Gb/s I/O Transceiver in 65nm-CMOS,” IEEE JSSC, vol. 48,

  • no. 5, pp. 1276-1289, May 2013.

Testing with 20cm FR-4 Channel

4.8 6.4 8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Data Rate [Gb/s] Energy Efficiency [pJ/b] TX+RX TX RX

TX and RX (VDD=0.6V) TX (VDD=0.8V) RX (VDD=0.75V) TX and RX (VDD=0.65V)

slide-16
SLIDE 16

Outline

  • Motivation
  • Power-Scalable I/O Techniques
  • Low-Power Clocking
  • Low-Power Equalizers
  • Conclusion

16

slide-17
SLIDE 17

Low-Power Transmitter Clocking

  • Transmitters which

utilize voltage-scaling to save power require efficient generation of multi-phase clocks

  • Key issue is the

extreme phase variations faced with low-voltage operation

17

slide-18
SLIDE 18

1 1.2 1.4 1.6 1.8 2 2.2 60 70 80 90 100 110

Frequency [GHz] I&Q Phase Diff [Deg]

1-Stage 2-Stage

< 6°

Passive Poly-Phase Filter Clock Generation

18

  • 2-stage passive poly-phase filter

generates 4 clock phases for

  • utput multiplexing from low-swing

global TX ¼-rate differential clocks

  • Requires subsequent CML2CMOS

converter to generate TX clocks

Y.-H. Song, R. Bai, P. Chiang, and S. Palermo, “A 0.47-0.66pJ/bit, 4.8-8Gb/s I/O Transceiver in 65nm-CMOS,” IEEE JSSC, vol. 48,

  • no. 5, pp. 1276-1289, May 2013.
slide-19
SLIDE 19

Injection-Locked Oscillator (ILO) Clock Generation

19

  • 4-phase CLK generation by ILO
  • Eliminates CML2CMOS convertor
  • Fine frequency control by EN_VCTL also enables fast

power state transition

Y.-H. Song, H.-W. Yang, H. Li, P. Chiang, and S. Palermo, “An 8–16 Gb/s, 0.65–1.05 pJ/b, Voltage-Mode Transmitter With Analog Impedance Modulation Equalization and Sub-3 ns Power-State Transitioning,” IEEE JSSC, vol. 49, no. 11, pp. 2631-2643, Nov. 2014.

Dummy

Injection Lock Oscillator 2mm Cs

I IB Q QB IN INB OUT OUTB ENBCLK EN_VCTL 1V VCTL ENCLK ENBCLK EN_VCTL

Cw

  • Diff. ¼-Rate CLK
slide-20
SLIDE 20
  • Async. Sampling Based Phase Calibration

20

  • Compensates for deterministic jitter (DJ) due to duty-cycle

distortion (DCD) and phase mismatches of quadrature clocks

Y.-H. Song, H.-W. Yang, H. Li, P. Chiang, and S. Palermo, “An 8–16 Gb/s, 0.65–1.05 pJ/b, Voltage-Mode Transmitter With Analog Impedance Modulation Equalization and Sub-3 ns Power-State Transitioning,” IEEE JSSC, vol. 49, no. 11, pp. 2631-2643, Nov. 2014.

slide-21
SLIDE 21

Automatic Phase Correction

21

  • Eye diagrams without and with phase calibration

Y.-H. Song, H.-W. Yang, H. Li, P. Chiang, and S. Palermo, “An 8–16 Gb/s, 0.65–1.05 pJ/b, Voltage-Mode Transmitter With Analog Impedance Modulation Equalization and Sub-3 ns Power-State Transitioning,” IEEE JSSC, vol. 49, no. 11, pp. 2631-2643, Nov. 2014.

8Gb/s 16Gb/s

Eye width variation is 28.5% Eye width variation is 4.7% Eye width variation is 13.1% Eye width variation is 5.4%

slide-22
SLIDE 22

RX-Forwarded Clock I/O De-Skew

22

  • “Coherent” clocking allows jitter tracking, but still need to

employ per-channel de-skew to maximize timing margins

DLL/ PLL + Phase I nterpolator (PI ) I njection-Locked Oscillator (I LO)

  • DLL can have jitter amplification,

while PLL can have jitter accumulation

  • Both circuits can occupy

significant area

  • Compact low-power

implementation

  • High jitter tracking bandwidth
slide-23
SLIDE 23

ILO-Based De-Skew

23 4.8 5.6 6.4 8 400 500 600 700 800

Deskew Range [ps] Data Rate [Gb/s]

4.8 5.6 6.4 8 30 60 90 120 150 180

Normalized Deskew Range [deg]

  • Current-starved inverter-based ILO produces the multiple

clock phases necessary for the receiver samplers

  • Fine de-skew control by 6-bit binary current mirror which

changes ILO free-running frequency

Y.-H. Song, R. Bai, P. Chiang, and S. Palermo, “A 0.47-0.66pJ/bit, 4.8-8Gb/s I/O Transceiver in 65nm-CMOS,” IEEE JSSC, vol. 48,

  • no. 5, pp. 1276-1289, May 2013.
slide-24
SLIDE 24

Phase Drifts with ILO-Based Clocking

24

  • Voltage and temperature variations can cause the

TX/RX ILOs’ free running frequency to change, and thus the phase relationship can drift with time

4:1 PLL Data 1/4 Rate FWD Clk ILRO w/ Skew Tuning 8:4 Demuxed Data Out Parallel Data In ILRO PVT Phase Drift <±0.5UI Deskew Range

slide-25
SLIDE 25

Low-Overhead CDR w/ ILO-Based De-Skew

25

  • Introducing a low-overhead CDR into a forwarded-

clock system allows tracking of low-frequency phase drifts, while maintaining correlated jitter tracking

4:1 PLL Data 1/4 Rate FWD Clk 8:4 Demuxed Data Out Parallel Data In ILRO CDR

slide-26
SLIDE 26

Multi-Phase Errors at Low VDD

26 4:1 PLL Data 1/4 Rate FWD Clk 8:4 Demuxed Data Out Parallel Data In ILRO CDR Quadrature Phase Error

slide-27
SLIDE 27

Edge-Rotating 5/4X Sub-Rate CDR

27

  • An additional periodically

rotating edge sampler provides the 4-eye phase information to CDR logic

  • Allows tracking of phase

drift and optimization of each sampler timing margin

  • H. Li, S. Chen, L. Yang, R. Bai, W. Hu, F. Zhong, S. Palermo, and P. Chiang, “A 0.8V, 560fJ/bit, 14Gb/s Injection-Locked Receiver

with Input Duty-Cycle Distortion Tolerable Edge-Rotating 5/4X Sub-Rate CDR in 65nm CMOS,” VLSI Symp., June 2014.

slide-28
SLIDE 28

14Gb/s GP 65nm CMOS Prototype

28

  • H. Li, S. Chen, L. Yang, R. Bai, W. Hu, F. Zhong, S. Palermo, and P. Chiang, “A 0.8V, 560fJ/bit, 14Gb/s Injection-Locked Receiver

with Input Duty-Cycle Distortion Tolerable Edge-Rotating 5/4X Sub-Rate CDR in 65nm CMOS,” VLSI Symp., June 2014.

Tracking Non-Uniform Eyes

ILRO Phase Rotator PI array & Quantizer CDR Logic Clock Buffer Shift Register Shift Register 1mm 1mm

CTLE

0.001 0.01 0.1 1 10 100 1000 0.01 0.1 1 10 100 SJ Frequency (MHz) Normalized SJ (UI) 14Gbps 12Gbps Equipment Limit

Correlated Jitter Tolerance

0.001 0.01 0.1 1 10 100 0.01 0.1 1 10 Jitter Frequency (MHz) Jitter Amplitude (UIpp) 14Gbps w/ CDR 14Gbps w/o CDR 12Gbps w/ CDR 12Gbps w/o CDR

Uncorrelated Jitter Tolerance

slide-29
SLIDE 29

Outline

  • Motivation
  • Power-Scalable I/O Techniques
  • Low-Power Clocking
  • Low-Power Equalizers
  • Conclusion

29

slide-30
SLIDE 30

Link with Equalization

30

  • Equalization goal is to flatten the frequency

response out to the Nyquist frequency and remove time-domain ISI

Serializer Deserializer

slide-31
SLIDE 31

TX-FIR Equalizer Comparisons

31

  • FIR equalization can easily be implemented

in a current-mode driver by summing tap currents on the termination resistors

  • More difficult to implement in voltage-

mode drivers due to the series impedance

Current-Mode Driver (CM) Voltage-Mode Driver (VM)

slide-32
SLIDE 32

VM Equalization w/ Shunt Voltage Divider (1)

32

[Wong JSSC 2004]

0.2 0.4 0.6 0.8 1 1 2 3 4

Vppd,min/Vppd,max Normalized Power

CM VM

  • 2-Tap FFE
  • Parallel combination
  • Z-termination
  • Zo = RP ǁ RN
  • More current for de-emp. voltage swing

           Zo R Zo R

N P

, 1

                 

 2

2 1

ref em de T ref sig

V V R V I

slide-33
SLIDE 33

VM Equalization w/ Added Parallel Path (2)

33

Current-Mode

[Dettloff I SSCC 2010]

0.2 0.4 0.6 0.8 1 1 2 3 4

Vppd,min/Vppd,max Normalized Power

CM VM1 VM2

  • 2-Tap FFE
  • Extra series-connected path
  • Constant current path
  • Z-termination (ZO = RP ǁ RN ǁ RS)
  • Constant signaling power for all VSW
  • Non-linear impedance mapping
  • Decoding/pre-driver complexity
slide-34
SLIDE 34

VM Equalization w/ Impedance Modulation (3)

34

  • 2-Tap FFE: Z-modulation (For de-emphasis, higher TX impedance)
  • Signaling power  Vppd,min / Vdd,max
  • Sacrificing the output termination
  • High digital power

TX EQ O TX

R R Z R   2 1 2 1 ,    

) 2 1 ( 4

max ,

  

T ppd sig

R V I

[Sredojevic JSSC 2011]

0.2 0.4 0.6 0.8 1 1 2 3 4

Vppd,min/Vppd,max Normalized Power

CM VM1 VM2 VM3

slide-35
SLIDE 35

VM Equalization w/ Analog Impedance Modulation

35

  • Segmented pre-driver and output driver significantly

increases dynamic power consumption with increased equalization resolution

  • Analog tap control obviates output stage segmentation

Digitally-Controlled Segmented Output Analog-Controlled Non-Segmented Output

slide-36
SLIDE 36

VM Equalization w/ Analog Impedance Modulation

36

  • Maximum transmitter output swing during a transition bit
slide-37
SLIDE 37

VM Equalization w/ Analog Impedance Modulation

37

  • De-emphasis transmitter output swing for run-length > 1
slide-38
SLIDE 38

16Gb/s Operation

38

  • 5.8 inch FR4 + 0.6m SMA cable -15.5dB loss at 8GHz

2 4 6 8 10 12

  • 25
  • 20
  • 15
  • 10
  • 5

Frequency [GHz] S21 [dB]

5.8 inch FR4+SMA 12 12.2 12.4 12.6 12.8 0.1 0.2 0.3 0.4 0.5

Time [ns] Amplitude [V]

Y.-H. Song, H.-W. Yang, H. Li, P. Chiang, and S. Palermo, “An 8–16 Gb/s, 0.65–1.05 pJ/b, Voltage-Mode Transmitter With Analog Impedance Modulation Equalization and Sub-3 ns Power-State Transitioning,” IEEE JSSC, vol. 49, no. 11, pp. 2631-2643, Nov. 2014.

slide-39
SLIDE 39

Low-Voltage DFE w/ Charge-Based Latches

39

  • First stage has small

aperture time

  • Second stage has small

delay to quantized output

  • R. Bai, S. Palermo, and P. Chiang, “A 0.25pJ/b 0.7V 16Gb/s 3-Tap Decision-Feedback Equalizer in 65nm CMOS,” ISSCC, Feb. 2014.
slide-40
SLIDE 40

16Gb/s Operation

40

  • R. Bai, S. Palermo, and P. Chiang, “A 0.25pJ/b 0.7V 16Gb/s 3-Tap Decision-Feedback Equalizer in 65nm CMOS,” ISSCC, Feb. 2014.
slide-41
SLIDE 41

DFE with Feedback FIR Filter Issues

  • DFE critical path timing 

speed/power trade-off

  • High-loss channels require

large number of DFE taps

  • Increases area and power
  • Increases loading  limits speed

41

20 inch Backplane Channel 10Gb/s Pulse Response

Peak Distortion Analysis

Critical path

slide-42
SLIDE 42

DFE with Feedback IIR Filter

42

Peak Distortion Analysis

  • IIR feedback filter provides

efficient long-tail ISI cancellation

  • Typical backplane channel well

approximated with 2 IIR taps

slide-43
SLIDE 43

10Gb/s 2-IIR-Tap DFE w/ 35dB Loss Compensation

43

IIR Filter/Mux IIR Filter/Mux Path I Path Q

  • Summation/slicing merged
  • Three-input double-tail comparator
  • Comparator output directly

connected to the IIR1 Mux

  • Lowers critical path delay
  • O. El-Hadidy and S. Palermo, "A 10 Gb/s 2-IIR-Tap DFE Receiver with 35 dB Loss Compensation in 65-nm CMOS," IEEE Symposium
  • n VLSI Circuits, June 2013.
slide-44
SLIDE 44

10Gb/s 2-IIR-Tap DFE w/ 35dB Loss Compensation

44

  • O. El-Hadidy and S. Palermo, "A 10 Gb/s 2-IIR-Tap DFE Receiver with 35 dB Loss Compensation in 65-nm CMOS," IEEE Symposium
  • n VLSI Circuits, June 2013.
slide-45
SLIDE 45

PAM4 Signaling

  • PAM-4 modulation offers improved spectral

efficiency over NRZ

  • Main Characteristics:

Lower symbol rate × Lower voltage margin  Higher sensitivity is required

45

20 40 60 80 100 120

  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 Time (ps) Voltage (V) 32 Gb/s PAM4 Eye 10 20 30 40 50 60

  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 Time (ps) Voltage (V) 32 Gb/s NRZ Eye

2a 2a/3

slide-46
SLIDE 46

A 32 Gb/s 0.55 mW/Gbps PAM4 1-FIR 2-IIR Tap DFE Receiver in 65-nm CMOS

  • PAM4 DFE employs 1-FIR tap for 1st post-cursor multi-level ISI

cancellation and 2-IIR taps for long tail ISI cancellation

  • Multi-level ISI cancellation is achieved with thermometer feedback to

tap DACs

46

3-bit Flash ADC/ Summer Qt1[1:3] Qt2[1:3] Qt3[1:3] Qt4[1:3] VIIR2 VIIR1 In clk0 clk90 clk180 clk270 S&H S&H S&H S&H RZ/NRZ Conversion RZ/NRZ Conversion RZ/NRZ Conversion RZ/NRZ Conversion MUX/ IIR1 MUX/ IIR2 VIIR2 VIIR1 Qt1[1:3] Qt2[1:3] Qt3[1:3] Qt4[1:3] IIR Filter/Mux VIIR2 VIIR1 VIIR2 VIIR1 VIIR2 VIIR1

  • O. El-Hadidy, A. Roshan-Zamir, H.-W. Yang, and S. Palermo, "A 32 Gb/s 0.55 mW/Gbps PAM4 1-FIR 2-IIR Tap DFE Receiver in 65-

nm CMOS," IEEE Symposium on VLSI Circuits, June 2015.

slide-47
SLIDE 47

Dynamic Regenerative Comparator

 Second stage regeneration through small Mn3, Mp3 in parallel with second stage

 Full swing output  Smaller delay (versus regenerative comparator)

 Second stage regeneration current is controlled through NMOS transistor  Only requires one clock phase

47

Vo clk Mn1 Mn2 Mn3 Mp3 Mp2 Mp1 VIN VX

clk Vx Vo

VDD VDD

  • O. El-Hadidy, A. Roshan-Zamir, H.-W. Yang, and S. Palermo, "A 32 Gb/s 0.55 mW/Gbps PAM4 1-FIR 2-IIR Tap DFE Receiver in 65-

nm CMOS," IEEE Symposium on VLSI Circuits, June 2015.

slide-48
SLIDE 48

GP 65nm CMOS Prototype & Measurement Results

  • At 32Gb/s consumes 17.7mW or 0.55mW/Gbps

48

PRBS PRBS Combiner Combiner

+ +

  • Bias-T

Bias-T DFE Rx

600 mVppd 25Gb/ s PAM4 Data

  • O. El-Hadidy, A. Roshan-Zamir, H.-W. Yang, and S. Palermo, "A 32 Gb/s 0.55 mW/Gbps PAM4 1-FIR 2-IIR Tap DFE Receiver in 65-

nm CMOS," IEEE Symposium on VLSI Circuits, June 2015.

slide-49
SLIDE 49

Conclusion

  • I/O transceivers need to achieve near 1pJ/b at

10+ Gb/s to support future systems

  • Low-voltage operation with parallelism can

achieve significant power savings

  • Source synchronous architectures reduce

clocking complexity

  • Circuitry which supports fast power-state

transitioning can reduce system average power

  • Low-voltage equalizers are necessary to support

channel loss for data rates >10Gb/s

49

slide-50
SLIDE 50

Acknowledgements

  • Many of the projects discussed today were

collaborative works with Prof. Patrick Chiang’s group at Oregon St

  • Funding support from SRC and TI

50