[PPT] - Review Numbers Formats and Simple Arithmetic FPGA Structure (CLB, PowerPoint Presentation

SLIDE 1

Review

Numbers Formats and Simple Arithmetic
FPGA Structure (CLB, Routing, IO, Clocks)
Pipelining (Resource VS Speed VS Latency)
Memories and Waveform Generation
ADCs and DACs applications in DSP
Constraints (Timing and Placement)
More Complex Arithmetic (Series Expansion

and the CORDIC algorithm for sin & cos)

DSP Resources (DSP48E2 Block)
Filtering: FIR and IIR Implementations
Serial multi-rate DSP (decimation and

interpolation) and applications

SLIDE 2

Looking Forward

Multi-Rate, Parallel DSP (1 week)
FFTs (2 weeks)
Digital Compensation (1 week)
PLLs
AGCs
Complete DSP Chains & SDRs (2 weeks)
Miscellaneous (1 week)
Pseudo-Random Noise Generators and CRC

checks

PWM and PDM (audio systems)

SLIDE 3

Parallel Processing

In some instances, the timing

requirements cannot be met with a serial process even after a DSP function is fully pipelined.

Example:
In desktop computers, video processing

requires the values of many pixels to be simultaneously computed within the refresh rate.

Since many of the operations are

independent, GPUs are well suited to handle the computational load in a parallel fashion.

SLIDE 4

Parallel Processing

FPGAs are well suited to handle

parallel tasks.

We need to understand what can be

computed independently, or how to modify the DSP algorithm to work in a parallel fashion.

Like pipelining there is a trade-off

between use of resources and achievable clock rates.

Common Applications
FFTs
Video Processing
GSPS ADCs and DACs

SLIDE 5

Parallel Processing

Many FPGAs now have dedicated

hardware components to facilitate the use of high speed data converters that operate at rates that exceed the FPGA fabric.

Gigabit Transceivers
Serializers
Deserializers
RFSoC Integrated ADCs and DACs
Extreme care must be taken to

understand clock rates and data formats.

ODDR processing of the DAC channels on

the dev board.

SLIDE 6

GigaBit Transceiver

SLIDE 7

Zynq RFSoC

DATA_ADC0[127:0] – 8x 16-bit samples Up to 16 Converters 128 Values per clock cycle. DATA_ADC0[255:0] – 16x 16-bit samples Up to 16 Converters 256 Values per clock cycle.

SLIDE 8

Zynq RFSoC

I/Q Mixers, Decimation, Interpolation all implemented in dedicated hardware.

SLIDE 9

Zynq RFSoC

SLIDE 10

Serializer

Part of the IO Logic
Data_In D8 to D1
Data_Out OQ
Achieve output data rates that are

up to 14x fabric rate.

SLIDE 11

Detailed View

4-to-1
Signals
2 clocks
Global Clock:

Slower clock from FPGA fabric.

IO Clock:

High-speed Input/Output clock.

SDR and DDR
IO Data
4 input lines
1 output line
Enables
Training Data

SLIDE 12

Detailed View

Structure
Registers
Two Columns
Parallel Load
Global Clock
Shift Regs,

Serialized

utput.
IO Clock
Muxes
Shift data

from parallel load to Shift Registers.

Use Training

Data.

Width

Expansion

SLIDE 13

Detailed View

Operation
Global Clock
Loads parallel data

from D4 to D1.

Strobe
Not used on

OSERDESE2.

Selects mux to

shift parallel data into shift registers.

I/O Clock
When Strobe is

high, loads shift registers with new data.

When Strobe is low,

shifts out the serial data.

Train pin selects

preset training data rather than D4 to D1.

SLIDE 14

Detailed View

Operation
Global Clock
Loads parallel data

from D4 to D1.

Strobe
Not used on

OSERDESE2.

Selects mux to

shift parallel data into shift registers.

I/O Clock
When Strobe is

high, loads shift registers with new data.

When Strobe is low,

shifts out the serial data.

Train pin selects

preset training data rather than D4 to D1.

SLIDE 15

Detailed View

Operation
Global Clock
Loads parallel data

from D4 to D1.

Strobe
Not used on

OSERDESE2.

Selects mux to

shift parallel data into shift registers.

I/O Clock
When Strobe is

high, loads shift registers with new data.

When Strobe is low,

shifts out the serial data.

Train pin selects

preset training data rather than D4 to D1.

D1

SLIDE 16

Detailed View

Operation
Global Clock
Loads parallel data

from D4 to D1.

Strobe
Not used on

OSERDESE2.

Selects mux to

shift parallel data into shift registers.

I/O Clock
When Strobe is

high, loads shift registers with new data.

When Strobe is low,

shifts out the serial data.

Train pin selects

preset training data rather than D4 to D1.

D2

SLIDE 17

Detailed View

Operation
Global Clock
Loads parallel data

from D4 to D1.

Strobe
Not used on

OSERDESE2.

Selects mux to

shift parallel data into shift registers.

I/O Clock
When Strobe is

high, loads shift registers with new data.

When Strobe is low,

shifts out the serial data.

Train pin selects

preset training data rather than D4 to D1.

D3

SLIDE 18

Detailed View

Operation
Global Clock
Loads parallel data

from D4 to D1.

Strobe
Not used on

OSERDESE2.

Selects mux to

shift parallel data into shift registers.

I/O Clock
When Strobe is

high, loads shift registers with new data.

When Strobe is low,

shifts out the serial data.

Train pin selects

preset training data rather than D4 to D1.

D4

SLIDE 19

Detailed View

Operation
Global Clock
Loads parallel data

from D4 to D1.

Strobe
Not used on

OSERDESE2.

Selects mux to

shift parallel data into shift registers.

I/O Clock
When Strobe is

high, loads shift registers with new data.

When Strobe is low,

shifts out the serial data.

Train pin selects

preset training data rather than D4 to D1.

D1

SLIDE 20

Detailed View

Operation
Global Clock
Loads parallel data

from D4 to D1.

Strobe
Not used on

OSERDESE2.

Selects mux to

shift parallel data into shift registers.

I/O Clock
When Strobe is

high, loads shift registers with new data.

When Strobe is low,

shifts out the serial data.

Train pin selects

preset training data rather than D4 to D1.

D2

SLIDE 21

Detailed View

Operation
Global Clock
Loads parallel data

from D4 to D1.

Strobe
Not used on

OSERDESE2.

Selects mux to

shift parallel data into shift registers.

I/O Clock
When Strobe is

high, loads shift registers with new data.

When Strobe is low,

shifts out the serial data.

Train pin selects

preset training data rather than D4 to D1.

D3

SLIDE 22

Detailed View

Operation
Global Clock
Loads parallel data

from D4 to D1.

Strobe
Not used on

OSERDESE2.

Selects mux to

shift parallel data into shift registers.

I/O Clock
When Strobe is

high, loads shift registers with new data.

When Strobe is low,

shifts out the serial data.

Train pin selects

preset training data rather than D4 to D1.

D4

SLIDE 23

Detailed View

Operation
Global Clock
Loads parallel data

from D4 to D1.

Strobe
Not used on

OSERDESE2.

Selects mux to

shift parallel data into shift registers.

I/O Clock
When Strobe is

high, loads shift registers with new data.

When Strobe is low,

shifts out the serial data.

Train pin selects

preset training data rather than D4 to D1.

D1

SLIDE 24

Detailed View

Operation
Global Clock
Loads parallel data

from D4 to D1.

Strobe
Not used on

OSERDESE2.

Selects mux to

shift parallel data into shift registers.

I/O Clock
When Strobe is

high, loads shift registers with new data.

When Strobe is low,

shifts out the serial data.

Train pin selects

preset training data rather than D4 to D1.

SLIDE 25

Example

SLIDE 26

Example: 2.5 GSPS DAC

Think about the

required Clock and Data Requirements.

Device requires two

deinterleaved DDR data paths.

Data Rate (per path):

2.5 GSPS/2 = 1.25 GSPS

Clock Freq (per path):

2.5 GHz/4 = 625 MHz

FPGA OSERDES IO Clock
perating in DDR mode.
We will drive each

data path with an 8:1 OSERDESE2.

Our choice based on the

available FPGA (Host Processor in the figure)

SLIDE 27

Example: 2.5 GSPS DAC

FPGA Requirements Global Clock

Not DDR
1.25 GSPS/8
156.25 MHz

Every clock Cycle must update 16x 14-bit data samples. reg [13:0] D [15:0] always@(posedge GCLK) begin D[15] <= ?; D[14] <= ?; ... D[0] <= ?; end

8:1 SerDes 8:1 SerDes 8:1 SerDes 8:1 SerDes 8:1 SerDes 8:1 SerDes

DB0x14 DB1x14

D15 D13 D11 D9 D7 D5 D3 D1 GCLK IOCLK D14 D12 D10 D8 D6 D4 D2 D0 GCLK IOCLK

SLIDE 28

Clock Gen. and Dist.

Informational Resources IOSERDES: SelectIO Users Guide BUFR & BUFIO: Clocking Users Guide Instatiation: Libraries Guide

SLIDE 29

8:1 SerDes 8:1 SerDes

Example: 2.5 GSPS DAC

8:1 SerDes 8:1 SerDes 8:1 SerDes 8:1 SerDes

DB0x14 DB1x14

D15 D13 D11 D9 D7 D5 D3 D1 GCLK IOCLK D14 D12 D10 D8 D6 D4 D2 D0 GCLK IOCLK

FPGA Requirements Global Clock

Not DDR
1.25 GSPS/8
156.25 MHz

Every clock Cycle must update 16x 14-bit data samples. reg [13:0] D [15:0] always@(posedge GCLK) begin D[15] <= ?; D[14] <= ?; ... D[0] <= ?; end

8:1 SerDes

1 1 1 1 GCLK IOCLK

DCI

BUFR DIV BUF IO

IOCLK GCLK

SLIDE 30

Waveform Generation

Waveform Generation for a serializer

becomes more complicated.

Start Simple: Using the previous

example, how would a linear ramp be generated?

SLIDE 31

Waveform Generation

Waveform Generation for a serializer

becomes more complicated.

Start Simple: Using the previous

example, how would a linear ramp be generated?

reg [13:0] D [15:0]; Initial begin D[15] <= 14’h0F; D[14] <= 14’h0E; ... D[1] <= 14’h01; D[0] <= 14’h00; end always@(posedge GCLK) begin D[15] <= D[15] + 14’h10; D[14] <= D[14] + 14’h10; ... D[0] <= D[0] + 14’h10; end

SLIDE 32

Waveform Generation

Waveform Generation for a serializer

becomes more complicated.

Start Simple: Using the previous

example, how would a linear ramp be generated?

reg [9:0] DH = 0; always@(posedge GCLK) DH <= DH + 10’h01; wire [13:0] D [15:0]; assign D[15] <= {DH,4’hF}; assign D[14] <= {DH,4’hE}; ... assign D[1] <= {DH,4’h1}; assign D[0] <= {DH,4’h0};

SLIDE 33

Waveform Generation

Waveform Generation for a serializer

becomes more complicated.

Using the previous example, how

would a Sinusoidal Signal be generated using the CORDIC algorithm?

SLIDE 34

Waveform Generation

Waveform Generation for a serializer

becomes more complicated.

Using the previous example, how

would an arbitrary signal be generated from memory?

SLIDE 35

Waveform Generation

Waveform Generation for a serializer

becomes more complicated.

Using the previous example, how

would a chirp signal be generated from the CORDIC algorithm?

SLIDE 36

Waveform Generation

Waveform Generation for a serializer

becomes more complicated.

Using the previous example, how

would a chirp signal be generated from the CORDIC algorithm?

Simplify the problem. Generate a

quadratic as the phase: x2

Chirp: 2π(f0/fS)n + π(k/fS

2)n2+ ϕ

Phase = A0 + A1*n + A2*n*n

SLIDE 37

Waveform Generation

Phase[n] = A0 + A1*n + A2*n2 Expand for n0=0, 16, 32 n1=1, 17, 33 n2=2, 18, 34 ... for ean generator. Phase0[n] = A0 + A1*(16*n) + A2*(16n)2 Phase0[n] = A0 + (16A1)*n + (256A2)*n2 Phase1[n] = A0 + A1*(16*n + 1) + A2*(16*n+1)2 Phase1[n] = (A0+A1+A2) + (16A1 + 32A2)*n + (256A2)*n2 Phase2[n] = A0 + A1*(16*n + 2) + A2*(16*n+2)2 Phase2[n] = (A0+2A1+4A2) + (16A1 + 64A2)*n + (256A2)*n2 Phase2[n] = A0 + A1*(16*n + 3) + A2*(16*n+3)2 Phase2[n] = (A0+3A1+9A2) + (16A1 + 96A2)*n + (256A2)*n2 PhaseI[n] = A0 + A1*(16*n + I) + A2*(16*n+I)2 PhaseI[n] = (A0+I*A1+I2*A2) + (16A1 + 32*I*A2)*n + (256A2)*n2 Each CORDIC synthesizer would start with a different phase and frequency.

SLIDE 38

Waveform Generation

Waveform Generation for a serializer

becomes more complicated.

In general, the x16 increase in
utput data rate requires a x16

increase in the number of resources need in the FPGA fabric.

For a sinusoid, we would need 16

CORDIC Blocks.

If the instantaneous bandwidth

requirements doesn’t use the entire DAC bandwidth (1/2 DAC rate), the resources can be relaxed through an interpolation filter.

SLIDE 39

Example: Hardware Interpolation

SLIDE 40

Example: Side-note

SLIDE 41

Waveform Generation

Example:
1 GSPS DAC (Reconstruction Filter ?)
8:1 serializer
GCLK = ? MSPS
DAC Bandwidth = ? MHz
The application only requires DC to

125 MHz output bandwidth.

SLIDE 42

Waveform Generation

Example:
1 GSPS DAC w/ 500 MHz Analog Filter
8:1 serializer
GCLK = 125 MSPS
DAC Bandwidth = 500 MHz
Only require DC to 125 MHz output

bandwidth.

Implies we only need to generate

serialized data at a rate of 250 MHz.

At 125 this would be 2 CORDIC cores
r 2 AWGs.

SLIDE 43

Waveform Generation

Example:
Red values are the interpolated

values.

Interpolation filter typically uses

much fewer FPGA resources.

8:1 SerDes

AWG2 (D4,D12,D20) AWG1 (D0,D8,D16)

Parallel Interp filter

D4, D12, D20 D3, D11, D19 D2, D10, D18 D1, D9, D17 D0, D8, D16 D-1,D7, D15 D-2,D6, D14 D-3,D5, D13 ...,D0,D1,D2,D3,D4,D5,D6,D7,...

SLIDE 44

Waveform Generation

Example:
Red values are the interpolated

values.

Nearest Neighbor just uses routing.

8:1 SerDes

AWG2 (D4,D12,D20) AWG1 (D0,D8,D16) D4, D12, D20 D3, D11, D19 D2, D10, D18 D1, D9, D17 D0, D8, D16 D-1,D7, D15 D-2,D6, D14 D-3,D5, D13 ...,D0,D1,D2,D3,D4,D5,D6,D7,...

SLIDE 45

Waveform Generation

Example:
Interpolation filter does not need to

be LPF. You can use any image of the up-sampled signal depending on the

filter. Or insert an I/Q Mixer.
Provides the flexibility of using the

entire DAC bandwidth (500 MHz) by changing coefficients.

8:1 SerDes

AWG2 (D4,D12,D20) AWG1 (D0,D8,D16)

Parallel Interp filter

D4, D12, D20 D3, D11, D19 D2, D10, D18 D1, D9, D17 D0, D8, D16 D-1,D7, D15 D-2,D6, D14 D-3,D5, D13 ...,D0,D1,D2,D3,D4,D5,D6,D7,...